Multimodal Speech Processing

Vision and Research Strategy

Combining research in human speech production, enabled by intra-oral motion capture and vocal tract imaging modalities, with applications in speech technology, such as text-to-speech (TTS) synthesis and automatic speech recognition (ASR), the focus of the Multimodal Speech Processing Group lies on endowing these applications with articulatory models. This extends their capabilities and simplifies their control, improving performance and allowing integration into more complex systems for human-computer interaction.

We seek to create efficient models of human speech production, using articulatory data from a variety of acquisition modalities, and integrate them in a unified framework. This in turn allows us to integrate advanced phonetic knowledge into speech technology applications. One example of this is the MaryTTS platform, which is maintained and extended by this Research Group.

Composition of Group

Our group was founded in December 2012 in the Cluster of Excellence and is steadily expanding. Currently, we have three PhD Students, Alexander Hewer, Arif Khan, and Eran Raveh, one postdoctoral researcher, Sébastien Le Maguer, as well as several grad students.

Research Interests and Projects

Current projects:

Phonetic Convergence in Human-Computer Interaction

DFG project in collaboration with the Phonetics group.

Information density aware TTS synthesis

Project in the Collaborative Research Center SFB-1102 Information Density and Linguistic Encoding.


A multilingual Text-To-Speech Engine written in Java.

[ Website | GitHub ]


TTS support for Luxembourgish, in collaboration with the Institut de langue et de littératures luxembourgeoises at the University of Luxembourg.

Vocal tract modeling from MRI data

Hybrid approaches to image processing/segmentation and mesh modeling of speech production MRI data, in collaboration with the Non-Rigid Shape Analysis group.

Multimodal articulatory analysis of posture and noise in speech

Fusing 3D articulography and tongue ultrasound to investigate the influence of posture and noise on speech production, in collaboration with the Phonetics Lab at the University of Trier.

Cross-lingual facial synthesis for video lip syncing

Realistic resynthesis of face and lip motion to match a video actor's performance to the audio dubbed in another language, in collaboration with the Graphics, Vision & Video group at MPI-Inf.

Previous projects:


Cross-platform visualization and processing of articulographic data.

[ Website ]


Facial Expression based Affective Speech Translation, joint work with Éva Székely and Zeeshan Ahmed at UCD.


Articulatory animation synthesis for multimodal applications, using speech production data.

[ Website | GitHub ]


Articulatory corpus from a single speaker, using modalities 3D EMA, MRI, video, and more; joint work with Korin Richmond at the University of Edinburgh.

[ Website ]

Dr. Ingmar Steiner

Dr. Ingmar Steiner

Ingmar Steiner has been head of the research group "Multimodal Speech Processing" since December 2012.

Fon: +49-681-302-70028