Multimodal Speech Processing

Vision and Research Strategy

Combining research in human speech production, enabled by intra-oral motion capture and vocal tract imaging modalities, with applications in speech technology, such as text-to-speech (TTS) synthesis and automatic speech recognition (ASR), the focus of the Multimodal Speech Processing Group lies on endowing these applications with articulatory models. This extends their capabilities and simplifies their control, improving performance and allowing integration into more complex systems for human-computer interaction.

We seek to create efficient models of human speech production, using articulatory data from a variety of acquisition modalities, and integrate them in a unified framework. This in turn allows us to integrate advanced phonetic knowledge into speech technology applications. One example of this is the MaryTTS platform, which is maintained and extended by this Research Group.

Composition of Group

MSP group photo

Our group was founded in December 2012 in the Cluster of Excellence and has since expanded to roughly a dozen members. Alongside Ingmar Steiner, we have three PhD Students, Alexander Hewer, Arif Khan, and Eran Raveh, one postdoctoral researcher, Sébastien Le Maguer, as well as a number of grad students and assistants.

Research Interests and Projects

Current projects:

Synchronized Face and Tongue Motion Capture

In collaboration with the Perceiving Systems group at MPI-IS and the Quantitative Linguistics group at the University of Tübingen.

Phonetic Convergence in Human-Computer Interaction

DFG project in collaboration with the Phonetics group.

Information density aware TTS synthesis

Project in the Collaborative Research Center SFB-1102 Information Density and Linguistic Encoding.


A multilingual Text-To-Speech Engine written in Java.

[ Website | GitHub ]

Vocal tract modeling from MRI data

Hybrid approaches to image processing/segmentation and mesh modeling of speech production MRI data, in collaboration with the Morpheo group at Inria and the SPAN group at USC.

Multimodal articulatory analysis of posture and noise in speech

Fusing 3D articulography and tongue ultrasound to investigate the influence of posture and noise on speech production, in collaboration with the Phonetics Lab at the University of Trier.

Cross-lingual facial synthesis for video lip syncing

Realistic resynthesis of face and lip motion to match a video actor's performance to the audio dubbed in another language, in collaboration with the Graphics, Vision & Video group at MPI-Inf.

Previous projects:


TTS support for Luxembourgish, in collaboration with the Institut de langue et de littératures luxembourgeoises at the University of Luxembourg.


Cross-platform visualization and processing of articulographic data.

[ Website ]


Facial Expression based Affective Speech Translation, joint work with Éva Székely and Zeeshan Ahmed at UCD.


Articulatory animation synthesis for multimodal applications, using speech production data.

[ Website | GitHub ]


Articulatory corpus from a single speaker, using modalities 3D EMA, MRI, video, and more; joint work with Korin Richmond at the University of Edinburgh.

[ Website ]

Dr. Ingmar Steiner

Dr. Ingmar Steiner

Ingmar Steiner heads the research group "Multimodal Speech Processing". His research centers on models of human speech production, using articulatory acquisition modalities (such as motion caption and medical imaging), and integration of these models in multimodal applications (such as speech synthesis). 

Fon: +49-681-302-70028