Model-Based Object Tracking in Road Traffic Scenes

Dieter Koller


=>

An output of the model-based tracking approach: The left figure shows the last frame of a 2 seconds video sequence. The right figure shows the detected cars (that appeared right at the beginning of the sequence) with their associated tracks (You obtain full blown images, [222132 Bytes] and [3350 Bytes], respecptively, upon selection).


Introduction

Image sequence analysis provides intermediate results for a conceptual description of events in a scene. A system that establishes such higher level descriptions based on tracking of moving objects in the image domain has been described in [koller 91]. Here we introduce three-dimensional models about the structure and the motion of the moving objects as well as about the illumination of the scene in order to verify the hypotheses for object candidates and to robustly extract smooth trajectories of such objects.

In order to record and analyze non trivial events in road traffic scenes we have to cope with the following dilemma: Either we must fixate the camera on an interesting agent by applying gaze control so that the agent remains in the field of view. Or we must use a stationary camera with a field of view that is large enough to capture significant actions of moving agents. The immediate shortcoming of the passive approach which is pursued in this work is the small size and the low resolution of the area covered by the projection of the moving agent. Image domain cues like grayvalue edges and corners are short and can be hardly detected. Additionally, in a road traffic scene, we have to cope with a highly cluttered environment full of background features as well as with occlusions and disocclusions. This renders the task of figure-background discrimination extremely difficult. The use of models representing our a priori knowledge appears necessary in order to accomplish the hard task of detecting and tracking under real world conditions.


Our Approach

Our approach consists of the following main steps (you can also get a one page overview in a block diagram [25KByte]):
  1. Motion segmentation: The first step is a motion segmentation, which segments moving objects from the stationary background. We apply a discrete feature-based approach to compute displacement vectors between consecutive frames. A cluster of coherently moving image features provides then the rough estimates for moving regions in the image.

    => =>
    ............. Image section ................. displacement vectors ..... ........... vector cluster

  2. Model Hypothesis: The assumption that such a cluster is due to a hypothetical object moving on a planar road in the scene yields a rough estimate for the position of the hypothetical object in the scene: the center of the group of the moving image features is projected back into the scene, based on a calibration of the camera.
    The assumption of a forward motion yields the orientation of the principal axis of the model which is assumed to be parallel to the motion.

    Backprojection of the vector cluster and enclosing rectangle and overlayed to a digitixed image of an official map.

  3. Generic polyhedral vehicle model: We use a 3D generic vehicle model parameterized by 12 length parameters. This enables the instantiation of different vehicles, for example cedan, hatchback, station wagon, bus, or van from the same generic vehicle model. The estimation of model shape parameters is possible by including them into the state estimation process (see below).

  4. Object recognition and alignement: Straight line edge segments extracted from the image are matched to the 2D model edge segments - a view sketch - obtained by projecting a 3D polyhedral model of the vehicle into the image plane, using a hidden-line algorithm to determine their visibility. The matching of image edge segments and model segments is based on the Mahalanobis distance of line segment attributes. The midpoint representation of line segments is suitable for using different uncertainties parallel and perpendicular to the line segments, which emerge in the edge detection process.

    => alignment =>
    => alignment =>

    These figures shows the alignment results: the left coloumn the initial model instantiation and the right column the optimal pose estimate. The figures in the upper row exhibits the image edge segments (red), the model instantiation (green dashed lines) and the matched image edge segments (thick pink lines).

  5. Illumination model and shadows: In order to avoid incorrect matches between model segments and image edge segments which arise from shadows of the vehicles, we enrich the applied a priori knowledge by including an illumination model. This provides us with a geometrical description of the shadows of the vehicles projected onto the street plane.

    => + shadow =>
    Effect of including an illumination model - casting a shadow on the road - in the pose estimation: left image without shadows, right image including shadow edges.

  6. Motion model: We establish a motion model which describes the dynamic vehicle motion in the absence of knowledge about the intention of the driver. In the stationary case, in which the steering angle remains constant, the result is a simple circular motion with constant magnitude of velocity and constant angular velocity around the normal of a plane on which the motion is assumed to take place. The unknown intention of the driver in maneuvering the car is captured by the introduction of process noise.

  7. Kalman filtering: The motion parameters for this motion model as well as the shape parameters according to our generic polyhedral model are estimated using a recursive maximum a posteriori estimator (MAP), which is implemented by an iterated version of the Extended Kalman Filter (IEKF). We use furthermore the Levenberg-Marquardt minimization method for minimizing the objective function in the MAP estimator.

  8. Model interpretation loop: The key feature of our approach is a model interpretation loop which copes with the non-linear relation between the model features and the image features (due to visibility and projective projection). A model interpretation is defined as a set correspondences between model and image features. For this interpretation (set) we compute an optimal pose and shape parameter set according to step 7, back-project it again into the image and continue with step 4 until the process converges towards an optimal estimate (see the block diagram).

  9. Classification: A classification is based on the assumption that differences between class members can be considered as deformations of the shape of a stored prototype. For that purpose we apply a Bayes classifier between a shape parameter instantiation and the shape parameters of the 5 prototypes.


Related Publications:

 * Moving Object Recognition and Classification based on Recursive Shape Parameter Estimation.
D. Koller. In Proc. 12th Israel Conference on Artificial Intelligence, Computer Vision, pp. 359-368, Ramat Gan, Israel, December 27-28, 1993.
 * Model-Based Object Tracking in Monocular Image Sequences of Road Traffic Scenes.
D. Koller, K. Daniilidis, H.-H. Nagel. International Journal of Computer Vision 10:3 (1993) 257--281.
 * Detektion, Verfolgung und Klassifikation bewegter Objekte in monokularen Bildfolgen am Beispiel von Straßenverkehrsszenen.
D. Koller , Dissertationen zur Künstlichen Intelligenz DISKI 13 (in german), infix-Verlag, Sankt Augustin, 1992.
 * Algorithmic Characterization of Vehicle Trajectories from Image Sequences by Motion Verbs.
D. Koller, N. Heinze, and H.-H. Nagel. In Proc IEEE Conf. Computer Vision and Pattern Recognition, pp. 90-95, June 3-6, 1991.

Last modified on Tuesday, November 20, 1996, Dieter Koller (koller@vision.caltech.edu)