Numerous ‘non-maximum suppression’ (NMS) post-processing schemes have been proposed for merging multiple independent object detections. We propose a generalization of NMS beyond bounding boxes to merge multiple pose estimates in a single frame. The final estimates are centroids rather than medoids as in standard NMS, thus being more accurate than any of the individual candidates. Using the same mathematical framework, we extend our approach to the multi-frame setting, merging multiple independent pose estimates across space and time and outputting both the number and pose of the objects present in a scene. Our approach sidesteps many of the inherent challenges associated with full tracking (e.g. objects entering/leaving a scene, extended periods of occlusion, etc.). We show its versatility by applying it to two distinct state-of-the-art pose estimation algorithms in three domains: human bodies, faces and mice. Our approach improves both detection accuracy (by helping disambiguate correspondences) as well as pose estimation quality and is computationally efficient.





Single-frame Pose-NMS improves pose quality around 5% at similar detection accuracy compared with standard NMS. Full (multi-frame) Pose-NMS improves detection accuracy 6% and maintains similar pose quality compared to running NMS prior to our tracking phase.


Full (multi-frame) Pose-NMS improves detection accuracy by 1.9% as well as pose quality by 0.9% compared to running NMS prior to our tracking phase.


Full (multi-frame) Pose-NMS improves detection accuracy by 11.3% as well as pose quality by 6% compared to running NMS prior to our tracking phase.





The Buffy Stickmen dataset is one of the most widely used datasets for human body pose estimation. Pose is encoded as the beginning/end points of 5 body parts (head, shoulder,elbow, wrist and hip), converting humans into ‘stick’ figures. The original dataset does not contain any temporal information so we have extended the Buffy Stickmen dataset by collecting 50 short clips using the same episodes as the original test set. Only uncut scenes with a duration longer than 2s (50 frames) were collected.

A new ‘in the wild’ face landmark dataset that includes video was collected. 33 HD movies shot on the streets of 23 different countries were downloaded from YouTube from the series ‘50 people one question’. This represents a realistic and challenging benchmark for face landmark estimation due to the variety of filming conditions, locations and people’s expressions (RCPR was used for landmark estimation). From these 33 ten minute films, 450 clips were extracted with durations varying between 1 and 10 seconds. Face landmarks on the final frame of each clip were annotated using 29 keypoints as in the LFPW dataset.

Videos from the Caltech Resident-Intruder Mouse (CRIM13) dataset were used. These videos represent a challenging task for pose estimation due to the high amount of interobject occlusions that result from the frequent social interactions between the two mice. The 133 ten minute top-view testing videos were downloaded from which 550 clips were extracted ranging from 1-10s in duration. For each, the final frame was annotated by placing direction sensitive ellipses around each mouse body as in the original work.


The Pose-NMS code can be downloaded here. Note that Piotr's Image & Video Matlab Toolbox is a required external library. This code is licensed under the Simplified BSD License. The code for facial landmark estimation (RCPR) is also available. Please refer to our ICCV 2013 publication for details.