Quast, Holger
UCSD; University of Göttingen
INC
INC UCSD 9500 Gilman Drive DEPT 0523 La Jolla, California 92093-0523



Simulating the Perception of Nonverbal Vocal Speech Features

In this project it is investigated how neural data processing techniques can be used to model the impression a speaker conveys through nonverbal speech features like pitch, prosody, frequency distribution etc. The information communicated in spoken language can be categorized as linguistic, para- or extralinguistic. Whereas current speech recognition systems, for instance, work on the linguistic (verbal) level, it is necessary to furthermore analyze the para- and extralinguistic layers in order to understand the context of the utterance as well as the state of the speaker. In this work, the impression of social competence – confidence, valence, persuability, authority – was chosen exemplarily. A database of recordings of professional actors who were asked to generate specific impressions and of non-actors each articulating a standardized German monologue of 8 sentences multiple times is used to produce values such as frequency variance, power distribution, speed, variation of the fundamental frequency, etc. (comparable to parameters used in previous research e.g. in emotion recognition). In addition to these statistic parameters, it is of particular interest what pattern recognition systems can be used to find structures in the fundamental frequency contour throughout a sentence. Cepstrum techniques are used to classify the speech recordings as voiced, unvoiced or silent to produce an estimation of the pitch contour of the recordings. Diffusion neural networks and Hidden Markov models offer promising capabilities for this pattern recognition task in the time as well as in the spectral domain, and their results are compared. Once a model for the perception of this impression exists, the findings can be verified by the evaluation of speech synthesis samples and in turn be used to refine a recognition system that ultimately is able to generalize not only on the given standard database sentences but on any given utterance. [Note to the committee: the project outlined is ongoing research, and it is unclear at this moment what results will be available at the time of the conference. If accepted, I would very much appreciate the opportunity to give a poster presentation to discuss the current results of this investigation.]