This article provides a guide to the OpenPose library for real-time multi-person keypoint detection. We will review its architecture, features, and comparison with other human pose estimation methods.
In the era of AI, more and more computer vision and machine learning (ML) applications need 2D human pose estimation as information input. This also involves subsequent tasks in image recognition and AI-based video analytics. Single and multi-person pose estimation is an important computer vision task and may be used in different domains, such as action recognition, security, sports, and more.
Pose Estimation is still a pretty new computer vision technology. However, in recent years, human pose estimation accuracy achieved great breakthroughs with the emergence of Convolutional Neural Networks (CNNs).
Pose Estimation with OpenPose
A human pose skeleton denotes the orientation of an individual in a particular format. Fundamentally, it is a set of data points that can be connected to describe an individual’s pose. Each data point in the skeleton can also be called a part or coordinate, or point. A relevant connection between two coordinates is known as a limb or pair. However, it is important to note that not all combinations of data points give rise to relevant pairs.
Knowing a person’s orientation paves the road for many real-life applications, many of them in sports and fitness. A lot of approaches to human pose estimation have been proposed over the years. The first-ever technique typically estimated the pose of a single individual in an image consisting of a single person. OpenPose provides a more efficient and robust approach that allows applying pose estimation to images with crowded scenes.
What is OpenPose?
OpenPose is a real-time multi-person human pose detection library that has for the first time shown the capability to jointly detect the human body, foot, hand, and facial keypoints on single images. OpenPose is capable of detecting a total of 135 keypoints.
The method is the winner of the COCO 2016 Keypoints Challenge and is popular for its decent quality and robustness to multi-person settings.
Who created OpenPose?
The OpenPose technique was created by Ginés Hidalgo, Yaser Sheikh, Zhe Cao, Yaadhav Raaj, Tomas Simon, Hanbyul Joo, and Shih-En Wei. It is, however, maintained by Yaadhav Raaj and Ginés Hidalgo.
What are the features of OpenPose?
The OpenPose human pose detection library has many features but given below are some of the most remarkable ones:
- Real-time 3D single-person keypoint detections
- 3D triangulation with multiple camera views
- Flir camera compatibility
- Real-time 2D multi-person keypoint detections
- 15, 18, 27-keypoint body/foot keypoint estimation
- 21 hand keypoint estimation
- 70 face keypoint estimation
- Single-person tracking for speeding up the detection and visual smoothing
- Calibration toolbox for the estimation of extrinsic, intrinsic, and distortion camera parameters
How to Use OpenPose
Pose Estimation algorithms usually require significant computational resources and are based on heavy models with large model sizes. This makes them unsuitable for real-time applications (video analytics) and deployment on resource-constrained hardware (edge devices in edge computing). Hence, there is a need for lightweight real-time human pose estimators that can be deployed to devices to perform on-device edge machine learning.
Lightweight OpenPose is a heavily optimized OpenPose implementation to perform real-time inference on CPU with minimal accuracy loss. It detects a skeleton consisting of keypoints and the connections between them to determine human poses for every single person in the image. The pose may include multiple keypoints, including ankles, ears, knees, eyes, hips, nose, wrists, neck, elbows, and shoulders.
Hardware and Camera
OpenPose supports video input from images, videos, and camera streams of webcams, Flir/Point Grey cameras, IP cameras (CCTV), and custom input sources (such as depth cameras, stereo lens cameras, etc.)
Hardware-wise, OpenPose supports different versions for Nvidia GPU (CUDA), AMD GPU (OpenCL), and non-GPU (CPU) computing. It can be run on Ubuntu, Windows, Mac, and Nvidia Jetson TX2.
How to use OpenPose?
The fastest and easiest way to use OpenPose is probably Viso Suite, an end-to-end computer vision platform that provides everything out of the box. The Viso Platform makes it very simple to use OpenPose with different cameras and AI hardware.
How Does OpenPose Work?
The OpenPose library initially pulls out features from a picture using the first few layers. The extracted features are then inputted into two parallel divisions of convolutional network layers. The first division predicts a set of 18 confidence maps — with each of them denoting a specific part of the human pose skeleton. The next branch predicts another set of 38 Part Affinity Fields (PAFs) that denotes the level of association between parts.
The later stages are used to clean the predictions made by the branches. With the help of the confidence maps, bipartite graphs are made between pairs of parts. Through PAF values, weaker links are pruned in the bipartite graphs. Now, applying all the given steps, human pose skeletons can be estimated and allocated to every person in the picture.
Overview of the Pipeline
- a) entire image as input
- b) two-branch CNN to jointly predict confidence maps for body part detection
- c) estimate part affinity fields for parts association
- d) set of bipartite matchings to associate body parts candidates
- e) assemble them into full-body poses for all people in the image
OpenPose vs. Alpha-Pose vs. Mask R-CNN
OpenPose is one of the most well-renowned bottom-up approaches for real-time multi-person body pose estimation. One of the reasons is because of their well-written GitHub implementation. Just like the other bottom-up approaches, OpenPose initially detects parts belonging to every person in the image known as keypoints, trailed by allocating those keypoints to specific individuals.
OpenPose vs. Alpha-Pose
RMPE or Alpha-Pose is a well-known top-down technique of post estimation. The creators of this technique suggest that top-down methods are usually based on the precision of the person detector, as pose estimation is conducted on the area where the person is present. This is why errors in localization and replicate bounding box predictions can result in the pose extraction algorithm working sub-optimally.
To solve this issue, the creators introduced a Symmetric Spatial Transformer Network (SSTN) to pull out a high-quality person region from an incorrect bounding box. A Single Person Pose Estimator (SPPE) is applied in this extracted area to estimate the human pose skeleton for that individual. A Spatial De-Transformer Network (SDTN) is applied to remap the human pose back to the initial image coordinate system. Moreover, the authors also introduced a parametric pose Non-Maximum Suppression (NMS) method to handle the problem of irrelevant pose deductions.
Along with this, a Pose Guided Proposals Generator has also been proposed to multiply training samples to help better train the SPPE and SSTN networks. The most important feature of Alpha-Pose is that it can be extended to any blend of a person detection algorithm and an SPPE.
OpenPose vs. Mask R-CNN
Last but not least, Mask RCNN is a well-known architecture for performing semantic and instance segmentation. It anticipates both the bounding box locations of the different objects in the image and a mask that segments the objects semantically (image segmentation). The architecture of Mask RCNN can be simply extended for human pose estimation.
It first extracts feature maps from a picture through a Convolutional Neural Network (CNN). A Region Proposal Network (RPN) uses these feature maps to get bounding box candidates for the presence of entities. The bounding box candidates select a region from the feature map. Since the bounding box candidates can be of different sizes, the RoIAlign layer is used to decrease the size of the extracted features so that they become of uniform size.
Now, the extracted features are passed into the parallel branches of CNNs for the ultimate prediction of the bounding boxes and the segmentation masks. The object detection algorithm can be trained to determine the region of individuals. By merging the person’s location information and their set of keypoints, we can obtain the human pose skeleton for every individual in the image.
This technique is very similar to the top-down method, but the person detection step is conducted along with the part detection step. Put simply, the keypoint detection phase and the person detection phase are independent of each other.
The Bottom Line
Real-time multi-person pose estimation is an important element in enabling machines to visually comprehend and analyze humans and their interactions. OpenPose is one of the most popular detection libraries for pose estimation and is capable of real-time multi-person pose analysis.
The lightweight variant makes it possible to apply OpenPose in Edge AI applications and to deploy it for on-device Edge ML Inference.
Read more about related articles.