Computer Based Auditive and Visual Scene Analysis (CAVSA)

Fund Coordinator

Project Description

Within this project, the following milestones have been reached:

Construction of an extended head model

One goal of the present project is to imitate human abilities with a robotic head. Hence, the layout of our robot Bob should resemble the proportions of a human head. The mechanics should allow similar degrees of freedom in the movement of the head and the eyes as exist in humans. It was possible to meet those requirements by adding a sophisticated mechanical control and compact, high-quality cameras.

Improvement of the audio tracker

The algorithms for sound source separation are assisted by the motor control of the head. So far, Bob has only been able to perceive his environment passively. With the newly added degrees of freedom, Bob is now able to interact with the scene. Depending on the structure of the scene, he can find a position in the environment that provides a better result for separation. Additionally, an iterative approach has been implemented to improve the result of the source separation.

Improvement of the face tracker

An algorithm for the detection of faces in image sequences has been developed. It improves on the state of the art techniques for 2D object detection by evaluating an additional depth map obtained from Bob's stereo camera. The depth map is estimated for an image region in real time. First, faces are detected in the 2D images using the approach of Viola and Jones (2001). In a second step, incorrectly detected faces are eliminated by analysis of the depth map.

Multimodal fusion

For multimodal fusion, it is necessary to represent the results of audio and face tracker in the same global coordinate system. In recent months, a fully automatic calibration algorithm has been developed. Now, the prerequisites have been created for conducting first experiments with modal fusion.