One recent work involving lip tracking by Hennecke et al.\ [HPS94] employs a contour finding scheme known as a deformable template. As introduced by Yuille et al. [YCH89], a deformable template is a parametrized mathematical model used to track the movements of a given object. Specifically, Hennecke et al. make use of a piecewise parabolic/quartic template which seeks to lock on to the upper and lower edges of each lip. In a manner similar to that of snakes, the deformable lip template adjusts its shape according to the value of a number of integrals along the relevant contours. In addition, the authors make use of several configurational and temporal penalty terms, which keep erroneous template deviations under control.
Other works that have dealt with lip tracking based on contour detection
include that of Matsuoka et al. [MFK86], Tamura et al.\
[ea89], and Kass et al. [KWT88]. As noted in a
number of works, e.g. [MP91] and [HPS94], robustly
localizing a contour model can often be quite difficult when the intensity
changes at the lip edges are very gradual.
Moreover, the appearance of the
teeth and tongue can present a threat to the correctness of the template
placement. Despite these potential difficulties,
Hennecke et al. acheived encouraging tracking results for a set of
10-12 individuals.
Another system which incorporates auditory information along with visual lip information is that of Wolff et al. [WPSH94]. In this work, the authors make use of one vertical and one horizontal intensity profile through the center of the mouth. The region of interest around the mouth through which the profiles are extracted is located using a succession of efficient filtering and thresholding steps. By means of analysing the motion of peaks and valleys in the extracted profiles, the authors obtain a number of phonologically relevant discriptors which are then fed to a time delay neural network architecture. Results are provided which indicate a notable increase in performance relative to speech-only systems.
One of the earlier works in computer-aided lipreading is that of Petajan
et al. [PBB
88]. In this work, the lips of the speaker
are tracked indirectly by tracking the nostrils. During training,
the mouth images are binarised and put together to form a large
codebook. Then, during testing, a vector quantisation algorithm is
used to associate the appearance of the mouth in a given frame of
a sequence with the closest codeword in the codebook. The authors
also experimented with a more direct minimum image distance method which
did not employ vector quantisation, and the results were actually
better. Experimental evidence is also provided indicating the usefulness
of visual lip information as an aid in speech recognition.
While not specifically addressing the task of lipreading per se,
the work of Chen et al. presents an interesting application of lip
tracking for improved low bit-rate video transmission of talking heads.
In this system, which makes use of colour information in the video
signal, the lips are located via nostril tracking (as in [PBB
88]).
The audio signal is then used to render a set of synthetic lips on the
image of the speaker's face. With the use of some temporal smoothing,
the authors note a perceptual improvement in transmission quality of
talking head sequences which employ the synthetic lips.