Recent success in 2D pose estimation has been driven by larger, more varied, labeled datasets. While laborious, it is possible for human annotators to click on the 2D locations of different body parts to generate such training data.

Unfortunately, in the case of 3D pose estimation, it is much more challenging to acquire large amounts of training data containing people in real world settings with their corresponding 3D poses. This lack of large scale training data makes it difficult to both train deep models for 3D pose estimation and to evaluate the performance of existing methods in situations where there are large variations in scene types and poses. As a result, researchers have resorted to various alternative methods for collecting 3D pose training data - including motion capture, synthetic datasets, video, and multi-camera setups.

In this work, we argue that instead of using additional hardware to acquire full 3D ground truth training data from closed settings, Fig. 1(b), we can make use of human annotated relative depth information from images in the wild, Fig. 1(c), for successfully training 3D pose algorithms. When available, 3D ground truth is a very powerful training signal, but our results show that relative depth data can be used at the expense of little accuracy at test time. Our model predicts accurate 3D poses compared to using full supervision even with small amounts of relative training data, Fig. 1(d), here measured in megabytes.

Back to top


Our main contributions are:

Back to top


Back to top

Relative Depth LSP Dataset

We extended the LSP Dataset (using a publicly available json version), with the relative depth annotations collected from Amazon Mechanical Turk. Crowd annotators were presented with an image along with two randomly selected keypoints and were instructed to imagine themselves looking through the camera and report which of the two keypoints appeared closer to them. We forced annotators to choose from one of the two possibilities and did not provide a "same distance option" for ambiguous situations, as those cases can be inferred by inspecting the disagreement between annotators.

For each image in LSP, we collected five random pairs of keypoints, ensuring that five different annotators labeled the same keypoints and image combination, resulting in a total of 50000 annotations by 348 workers, who provided an average of 144 labels each. We merged the five votes per keypoint pair using a crowd annotation system, resulting in a single predicted probabilistic label per pair.

The Figures below contain example visualizations from the labels in the provided dataset: names and ordering of each pair of selected keypoints are written on top of their plot and the green keypoint is annotated to be closer to the camera compared to the red one. The confidence of the label in the bottom-right box determines the color of the connection between the keypoints.


The Reltive Depth LSP Dataset contains the following data:

  • 'joint_names': The ordered list of the names of the 2D keypoints.
  • 'images': The list of 2000 annotations for each image in LSP.

Every annotation is composed of:

  • 'im_file': Name of the image file.
  • 'is_train': Boolean flag indicating if the image is in the train or test set.
  • 'keypoints': List of the 28 coordinates containing the 2D skeleton of every image.
  • 'occluded': List of flags indicating the visibility of every keypoint.
  • 'anns': List of 5 keypoint pairs containing for every pair:
    1. the indices of the pair;
    2. the raw depth label from every annotator;
    3. the merged label;
    4. the probabilistic confidence and associated risk of the merged label.

Back to top


We provide 1) a pytorch implementation of our algorithm, 2) the crowd collected relative depth annotations of the LSP dataset in JSON file format, and 3) the output of our relative model on all of the Human3.6M and LSP test set.


Back to top


If you find our paper or the released data or code useful to your work, please cite:

author = {Matteo Ruggero Ronchi and Oisin {Mac Aodha} and Robert Eng and Pietro Perona},
title = {It's all Relative: Monocular 3D Human Pose Estimation from Weakly Supervised Data},
booktitle = {British Machine Vision Conference 2018, {BMVC} 2018, Northumbria University, Newcastle, UK, September 3-6, 2018},
pages = {300},
year = {2018},
crossref = {DBLP:conf/bmvc/2018},
url = {},
timestamp = {Mon, 17 Sep 2018 15:39:51 +0200},
biburl = {},
bibsource = {dblp computer science bibliography,}

Back to top


© 2018, Matteo Ruggero Ronchi, Oisin Mac Aodha, Robert Eng, and Pietro Perona

Back to top

Flag Counter