With the above three feature templates in hand, we can proceed to describe how they can be used to locate the mouth in each frame of a given sequence. The first step in processing a given frame from a sequence is to isolate a rectangular region in the image which is believed to safely contain the head. We assume that the head lies inside a rectangle centered on the nose with height equal to twice the vertical distance from the eyes to the mouth and width equal to twice the horizontal separation between the eyes. The position of the nose in frame n is used as a guess for the position of the nose in frame n+1. For the first frame in the sequence, the nose position from the manually entered feature coordinates is used for the guess.
Once the box containing the head has been extracted for a given frame, we compute the orientation map for the entire head using Kass and Witkin's method followed by smoothing and subsampling. For illustrative purposes, the orientation map computed using this method for the face in Figure 10 is shown in Figure 11.
Figure 11: Example of orientation map for a face.
In the next step, three ROIs are defined in the vicinity of the eyes and nose. Each ROI is square with a side length equal to 4/5 times the eye separation and is centered on the presumed feature coordinate (as dictated by the geom array translated to the current nose position). Once the ROIs have been defined, OTC is performed on the orientation map of the face within the ROIs with each respective feature template.
After performing OTC in each ROI, the correlation results are overlaid on one another, added, and smoothed once with gauss5. If the head did not move from one frame to the next, then the maximum of the correlation should be at the center of the summed result. But in general, the head moves a few pixels over the course of a few frames, and the maximum shifts around. Thus, we must pick out the maximum correlation value each time in order to update the head position guess for the next frame. Since the resolution of the orientation map upon which we are correlating is half that of the original image, it is in our interest to find the peak with subpixel accuracy. To this end, we first locate the correlation peak at the resolution of the orientation map and then employ bicubic interpolation in the neighborhood of the peak to fine-tune the estimate to the resolution of the original image. Note that bilinear interpolation does not perform well for such a purpose since such a method is incapable of introducing new maxima between adjacent samples.
Once the location of the correlation peak has been found, it is used to update the head position as already mentioned, and more importantly, it is used to extract out a box in the original grayscale image containing the mouth of the speaker. We set the height of this box to be twice the distance from the nose to the mouth and the width to be 1.2 times the eye separation. We place the top of the box at a distance equal to 0.4 times the nose-mouth height below the nose location.
Having described our method of localising the ``mouthbox'' for a given frame in the sequence, we will now discuss our techniques for feature extraction.