Multimodal Far Field Speech Processing

Project Description

The aim of this project was to build an audio-visual speech recognition system combining advanced motion estimation algorithms developed by the Mathematical Image Analysis (MIA) group with multi-channel audio processing and speech recognition techniques developed at Spoken Language Systems.

While reliable motion estimation is essential for capturing visual mouth movements, multi-channel audio processing techniques provide a powerful tool for filtering sound that does not originate from the direction of the speaker. We used a very challenging corpus, which stands out in that it was recorded under real conditions, that is, in different cars driving at speeds up to 55 miles per hour, with a microphone array mounted on the sun visor. In addition to adverse acoustic conditions with a SNR between 15 and -10dB, there are serious challenges on the visual side. Light conditions can change abruptly; the cameras are shaking; and there are severe compression artifacts. Often, the upper parts of the face — especially the eyes and ears — are occluded by hair flying around due to the wind, which caused many of the standard face detection algorithms to fail. Accurate detection of the mouth region, however, is imperative for audio-visual speech recognition. Hence, we developed a novel, robust face tracking algorithm, which uses a probabilistic, scale-invariant face geometry model in order to infer the positions of occluded facial features. The developed algorithm is truly able to handle concurrent hypotheses rather than doing data association; and it constrains the facial feature detectors to regions where the features are expected, which makes it robust to clutter. After extracting the mouth region, the motion estimator from MIA was applied. The resulting images were dimensionally reduced and subsequently used for training Gaussian mixture models (GMM) for audio-visual speaker activity detection (AVSAD).

One of the main achievements — apart from the face tracking system — is an improved speaker localization system, which uses AVSAD in order to estimate the time delay of arrival when the speaker is active and which estimates the noise for channel-wise spectral subtraction when the speaker is inactive. This improved the speech recognition performance of the delay and sum beamformed signal from 11.6% to 7.3%, which compares to a WER of 15.5% for a single distant channel. In addition to this, we developed a novel, single-channel noise reduction technique, which estimates noise from a noisy utterance based on a Monte Carlo variant of the expectation maximization algorithm. It makes use of importance sampling and Parzen window density estimation and it provided a WER of 10.3%.