In this section, we present the most important results in the area of text and speech processing. Semantically informed text processing is essential for the Open Science Web (RA5 and Demonstrator). Cross-modal speech processing is an important prerequisite to Multimodal Interaction (RA9 and In-Car Demonstrator), and profits from collaboration with Visual Computing (RA2).
Extending semantically informed linguistic grammars through treebank development:
Quick prototyping of a semantically informed grammar of German has been demonstrated by combining introspective grammar development with a corpus-driven approach by Cramer and Zhang. In collaboration with Emily Bender's team at the University of Washington, Antske Fokkens has developed modularized grammar components covering word order phenomena in the larger project context of the LinGO Grammar Matrix. In an ongoing project with Dan Flickinger (Stanford University), we are exploiting methods to integrate human language intuitions with the linguistic grammar analysis, to construct a richly annotated corpus (with both syntactic and semantic information). In 2010, Zhang and Kordoni presented efficient ways of using machine learning techniques and improved treebanking efficiency by over 50% without hurting annotation quality.
Cross-domain hybrid syntactic and semantic parsing:
A hybrid system which combines the statistical models with linguistic hand-crafted grammar in the dependency parser and semantic role labeler achieved leading performance in the open competition (CoNNL shared task, 2008). The improvements observed on English in previous shared tasks were also confirmed on multiple languages — in collaboration with Stephan Oepen (University of Oslo) — in the CoNLL 2009 shared task on syntactic and semantic dependency parsing with seven languages. In 2009, Zhang and Wang showed that the hybrid parsing strategy which combines the linguistic hand-crafted grammars with machine learning-based statistical models generally improves the cross-domain performance.
Applications of semantically informed grammatical processing:
One application for which deep syntactic parsing has been utilized is Opinion Mining. As part of the long-term collaboration with Shanghai Jiao Tong University, we jointly participated in the Chinese Opinion Analysis Evaluation 2009 using a system with integrated syntactic and semantic analysis. A second application is the recognition of textual entailment, which in turn can be used for question answering or other forms of semantic search. In this application, both syntactic and semantic dependency structures have been successfully used. Dinu & Wang (2009) present a study on using DIRT inference rules in the recognizing textual entailment task (RTE), and achieved moderate improvement. In 2009, Wang and Zhang showed that shallow semantic analysis like predicate-argument structures brings significant improvement in the RTE task.
Minimally supervised relation extraction:
The central goal of this research is to continue the successful application of linguistic analysis methods in bootstrapping approaches to the automatic learning of relation extraction grammars. In 2008, Xu et al. were able to demonstrate a novel kind of coference resolution that exploits the domain knowledge used for relation extraction. Another method for improving precision of relation extraction uses closed-world seeds that implicitly provide negative examples in addition to the positive ones. Recent experiments have utilized truly deep parsing (HPSG PET Parser with the ERG) for learning relation extraction grammars, yielding better results than were obtained using the semantic (RMRS) representations.
Contextualization of distributional meaning representations:
Thater, Fürstenau, and Pinkal developed syntactically enriched vector models supporting the computation of contextualized semantic representations, which allow one to assess semantic similarity between words in context, and associate usages with context-specific word senses. The approach makes crucial use of syntactic dependency information and second-order distributional frequencies. The application of their model to the SemEval 2007 lexical substitution task substantially outperforms all preceding systems. G. Dinu, Pinkal's doctoral student, employed a latent variable approach for the task of measuring context-specific semantic similarity: meanings are modeled as probability distributions over a set of latent senses, which are modified by contextual disambiguation. Although the collection of input frequencies is not syntax-sensitive, but rather uses a standard bag-of-words approach, her results are competitive on word similarity and lexical substitution tasks.
Work in several IRGs complements the RA1 research on contextualization. Titov and Kozhevnikov's work on bootstrapping semantic analyzers from non-contradictory texts successfully treats a known shortcoming of all existing distributional semantic approaches, i.e., the poor distinguishability of words in a lexical field with opposite polarity. Sporleder and Li successfully approached the open problem of token-level figurative speech detection as an important part of contextual disambiguation. In a collaboration with Pinkal's group, Koller and Thater provided a highly efficient method for ambiguity reduction on the sentence level.
Minimally supervised learning of script information:
Regneri, Koller, and Pinkal proposed a novel approach for learning patterns of event sequences that make up scripts, along with constraints on their temporal ordering. A large set of natural language descriptions of script-specific event sequences (collected using Mechanical Turk) is used as the basis for computing a graph representation of the script's temporal structure via a multiple sequence alignment algorithm. The approach is complementary to recent work of N. Chambers and D. Jurafsky, who present an unsupervised method for learning event chains from texts. Our method needs minimal supervision in that scenarios must be identified for script learning, but on the other hand it provides more explicit event information than techniques relying on plain texts.
Frame-based shallow semantic processing:
A long-term research aim in Pinkal's group has been the development of resources and methods for shallow semantic processing in the framework of frame semantics. Since parser performance has become an increasingly urgent problem, we thoroughly studied the potential of improvement for frame-based semantic parsing and the actual and potential impact of frame-semantic information on language technology applications. Burchardt et al. present a study assessing the impact of frame-semantic information on textual entailment, which in principle corroborates the high potential of frame-semantic information for advanced inference tasks, but shows at the same time that this potential is counterbalanced by the poor performance of existing frame and role labelers. We explored a variety of methods to improve parser performance, including data expansion techniques for training corpora, active learning methods for frame assignment, and methods for automatic generation of coarse-grained FrameNet versions. In joint work with C. Sporleder's and I. Titov's group, we are investigating semi- and unsupervised methods for frame-semantic analysis.
Situated language comprehension:
A central objective of RA1 is to empirically investigate and computationally model how non-linguistic information in the environment can contribute to spoken language processing. One fundamental finding of our experimental investigations has been the high priority of scene information on language comprehension, as revealed by both behavioral (eye movements) and neurophysiological (event-related potentials) methods. This work has led to the development of the Coordinated Interplay Account (CIA) of situated comprehension which explains the real-time influence of speech on visual attention, and visual attention on comprehension: computational models of the CIA predict both human gaze behavior and ERP findings. In current research, these findings are being evaluated in the context of navigation instruction generation in dynamic virtual environments, to identify how speech generation can become responsive to user gaze in real time (IRG Koller). Continuing research in this area is supported by the recently established gaze lab, which enables 180-degree tracking of head orientation and eye gaze in virtual environments, and a 64-channel EEG laboratory.
Robot gaze and speech:
In addition to investigation of how human visual attention is influenced by speech, empirical research has examined how users integrate spoken and visual information from the environment, including robots and virtual agents. Maria Staudte's doctoral dissertation examined how language-mediated robot gaze can augment user comprehension, demonstrating that users treat robot/agent gaze much like human gaze, using it to ground mentioned referents to objects and events in the visual environment. Not only does human-like referential robot gaze improve comprehension of robot speech, but inappropriate or random gaze can similarly disrupt user comprehension. In current work, we are collaborating to extend these findings to virtual agents and environments (IRG Kipp), which enable a much richer inventory of gaze and gesture cues to be examined.
Situated language learning:
Virtual environments have been further exploited as an ideal setting in which to investigate semi-naturalistic language learning. J. Köhne, a doctoral student of Crocker, has examined the role of visual, linguistic, and probabilistic cues on second-language learning, with key results identifying the ability of people to integrate diverse information cues, many of which rely on cross-trial inferences. This empirical research is supported by A. Alishahi's development of computational models which can similarly integrate visual, linguistic, and probabilistic cues to explain word learning.
Distant speech recognition:
Speech recognition without the use of a close-talking microphone is a mostly unsolved problem, which involves several processing steps that we have worked on in RA1. We successfully improved speaker tracking algorithms to find the speakers' positions based on the input of multiple microphones, combined this with visual information from lip movement in a collaboration with RA2 (see below), and fused all this information using a novel maximum neg-entropy beamformer, a paradigm inspired by blind source separation. Before or after the beamforming, single channel speech enhancement techniques improve the results even further. Contributions to cross-modal aspects of natural-language generation include V. Rieser's doctoral dissertation on the acquisition of dialogue strategies for multimodal output generation using reinforcement learning, which is also a contribution to RA9 and is described in greater detail there. A. Koller investigates a planning-based approach to model generation and dialogue in virtual environments.