From Categories to Individuals in Real Time


A method for online, real-time learning of individual-object detectors is presented. Starting with a pre-trained boosted category detector, an individual-object detector is trained with near-zero computational cost. The individual detector is obtained by using the same feature cascade as the category detector along with elementary manipulations of the thresholds of the weak classifiers. This is ideal for online operation on a video stream or for interactive learning. Applications addressed by this technique are reidentification and individual tracking. Experiments on four challenging pedestrian and face datasets indicate that it is indeed possible to learn identity classifiers in real-time; besides being faster-trained, our classifier has better detection rates than previous methods on two of the datasets


From Categories to Individuals in Real Time — A Unified Boosting Approach
D. Hall and P. Perona
CVPR 2014, Columbus, USA.
PDF Poster


The computational performance versus error rate for the (top) reidentification and (bottom) tracking scenarios using the CRP dataset. Computational performance is the average time it takes to train and evaluate an individual detector (either per window or per frame depending on the application).

IDBoost (the method we propose in this work) operates just as fast as L2 and KISSME but has the best performance for the reidentification scenario. In the tracking scenario our method still achieves real-time operation with the best performance. All experiments were conducted on a single core of a 3.20 GHz processor. The ideal performance is in the bottom left corner of the plot.



The Fifty People One Question (FPOQ) face dataset contains 6 videos with 222 annotated individuals across 725 sequences. Each annotation contains the bounding box, the identity and the sequence number of the face. In total there are 68,676 bounding boxes; 78,181 frames; and 57,274 frames that contain faces. The videos were collected from YouTube and involve either a single individual or groups of individuals being asked a question in front of a fixed camera. Their responses are edited in such a way so that an individual's response is interspersed between the responses of others. This means individuals can appear at any time point within the video.

Caltech Roadside Pedestrians

The Caltech Roadside Pedestrian (CRP) dataset contains 2 videos with 170 annotated individuals across 263 sequences. In total there are 7450 bounding boxes; 77,450 frames; and 5606 frames that contain pedestrians. Each video is captured by mounting a rightwards-pointing video camera to the roof of a car. The car then completes two laps of a ring road within a park where there are many walkers and joggers. This dataset is more challenging than the face dataset due to the considerable differences in lighting and pose for an individual.


The IDBoost code can be downloaded here. A set of tools to display videos and their annotations is also provide in Video Tools. Note that Piotr's Image & Video Matlab Toolbox is a required external library. This code is licensed under the Simplified BSD License.