Structured Distributional Models of Semantics
Our efforts in the second funding phase have focused on methods that include a rich representation of linguistic structure (including semantic frames and roles) and make use of prior knowledge to more effectively use large amounts of linguistic data. We integrated distributional structured vector space models with an incremental parsing model, and developed the first incremental semantic role labeler, as well as, more recently, a deep neural network for selectional preference prediction. We explored non-parametric Bayesian models in shallow semantic parsing tasks, and achieved competitive results both on a question-answering task and on direct evaluation against human annotation. For languages for which role labeling annotation is not available, we also showed that competitive results on a semantic role labeling task can be achieved by using a cross-lingual model transfer approach. We used semantic role information for the task of phrase-level paraphrase identification, and successfully applied the resulting model to short-answer scoring as an educational application.
In a collaboration with RA2 (Bernt Schiele’s group), we grounded lexical semantics in visual information extracted from videos, thus obtaining a finer-grained distributional analysis of event and action verbs. Our distributional semantic models can also be used for cognitive modelling tasks, and we have shown that semantic surprisal is predictive of word durations in spontaneous speech; we are currently extending this work to improve the naturalness of synthesized speech.
Lexical Semantics
In collaboration with RA5, we have made major advances in disam- biguating the senses of words or multi-word phrases, with a focus on phrases that denote se- mantic relations. This includes the development of large resources for relational paraphrases and their alignment with other resources. New methods for word sense disambiguation, with applications in named entity recognition and classification, have been developed. We have devised probabilistic graphical models that allow joint inference on several NLP tasks in an integrated manner, enhancing the overall output quality: co-reference resolution and named entity disambiguation, as well as named entity recognition and named entity disambiguation. Additionally, we worked on structured distributional models for thematic fit tasks (i.e., distributional spaces where words are represented as a function of their role in the sentence), and showed that an unsupervised word sense disambiguation step significantly improves over the state of the art on this task.
Combining machine-learning and linguistic processing for information extraction
On the topic of schema-based relation extraction, we moved from a minimally supervised learn- ing approach to truly web-based learning via distant supervision. A new method was de- veloped for learning dependency structures belonging to instances of the target relations. Like other research efforts in the area, we took semantic seed instances from existing knowledge repositories, and extracted tens of thousands of Freebase facts for a rather com- plete harvest of learned patterns. We cover n-ary relations, learning patterns for instances containing between two and n arguments. Instead of training a classifier, we learn discrete dependency rules represented in a recursive rule formalism, and apply a range of semantic filters to the learned rules. One of these filters checks if there is at least one content word in the learned pattern that is semantically connected to the target relation, using BabelNet and lexical disambiguation via Babelfy. We furthermore successfully increased effi- ciency of our information extraction pipeline in a collaboration with Berlin Big Data Center (BBDC), and integrated our pipeline with a real-time analytics system. A current focus of our research in this area lies on extracting information from indirect mentions of relations (textual entailment, presupposition, metonymy, metaphor) or relation arguments outside the sentence.
Discourse-level Processing
Script knowledge, i.e., common-sense knowledge describing patterns of event sequences for typical activities, is relevant for the prediction of discourse relations as well as for other discourse-level tasks like coreference resolution and temporal ordering classification. We explored different models of script knowledge based on existing corpora, a hierarchical Bayesian model as well as a neural network model, the latter ones substantially outperforming the state of the art on the temporal ordering task. As a basis for high-accuracy extraction of script knowledge, we crowdsourced a new, much larger corpus containing generic descriptions of script information plus partial event-level alignments between the scripts (the DeScript corpus), as a basis for semi-supervised script extraction. We also created a corpus of simple narrative texts annotated with detailed script information (the InScript corpus). We have achieved good results on the task of automatic participant prediction. In a collaboration with RA2, we used script data for the improvement of action recognition in video understanding, as well as for the generation of text from videos. We also developed a distributional approach to discourse relation processing, improving the state of the art on discourse relation parsing by employing a neural network model.
Unlike the assumptions in standard models of script processing, only some verb instances de- note events, or describe specific situations at all. We developed models for the classification of aspectual classes along different dimensions (stative/eventive, generic/specific/habitual). We were able to outperform the state of the art, and could show that a part of these distinc- tions is dependent on the wider discourse context, using sequence labeling techniques (CRF).
Language-centered Multimodal Interaction
In order to approach our long-term goal of flexible adaptive open-domain multimodal dialogue systems driven by adequate cognitive modeling of human behavior, we have undertaken research and development efforts in hu- man computer interaction using speech, gaze and a shared situated frame of reference.
Flexible management strategies can be learned from dialogue data; we therefore collected interactive multimodal data in domains of various complexities, e.g. multi-player games, tutor- ing dialogues for training metacognitive skills, political debate and multi-issue bargaining dialogues. We annotated a total of about 30 hours from more than 100 speakers, using multiple sensors, e.g. microphones and Kinect, according to the ISO 24617-2 dialogue act annotation scheme (issued under our co-authorship). The resulting corpora have been used for automatic dialogue act recognition. Current research building on this resource is concentrated on cognitive aspects of dialogue modeling and management. We successfully integrated ACT-R-based cognitive task models into our multi-thread dialogue management architecture.
We have also investigated the interaction of spoken and visual modalities in human cognition. Specifically, we have examined the benefit of visual and linguistic contexts for word learning, complementing our previous work on computational modeling. Eye-tracking evidence reveals the extremely rapid deployment of both immediate context and cross-situational learning mechanisms.
In order to improve virtual reality interactions with avatars, we investigated whether the interaction can be improved by enabling the computer system to react to the eye gaze patterns of the human user. We found that the natural-language generation system that provided eye-movement-based feedback outperformed baseline systems that did not have access to eye gaze information. This suggests that NLG systems can, in principle, exploit human eye movements and react quickly enough in order to encourage correct, or prevent wrong, reference resolution and any resulting actions in the virtual environment. Based on this success, IRGs Staudte and Heloir have now set up an avatar that receives information about listener gaze and can follow the gaze of the human speaker. This exciting novel setup allows for truly interactive and adaptive behavior which enables us to further explore how an artificial agent should exploit and react to human gaze and speech.
IRGs Steiner and Wuhrer have collaborated on improving realtime naturalistic speech anima- tion for an open-source game engine. They performed unsupervised segmentation of vocal tract MRI and intra-oral 3D scan data to create multilinear shape space models of the tongue and oral cavity, which are then deformed using data from electromagnetic articulogra- phy. Furthermore, IRG Steiner extended work on facial-expression-based affective speech translation.
A further important task in situated interaction using speech is speaker localization. Re- searchers from Klakow’s group have proposed a novel probabilistic framework for localizing multiple speakers with a microphone array. This approach increases localization precision, especially when few microphones are available.
Behavioral, neurophysiological and pupillometric measures of situated interaction
A continuing focus of RA1 has been the experimental investigation of how spoken language processing interacts with visual context. To better understand the cognitive mechanisms underlying scene-utterance integration, we have begun to use neurophysiological methods (EEG/ERP). The focus so far has been on determining the neurophysiological indices as- sociated with the comprehension of referring expressions in visual contexts. In a study comparing over-specified, minimally specified, and mismatching expressions, for example, we found a clear gradient response of the N400 component (over-specified being preferred). In addition, underspecified expressions (in which the visual referent is not uniquely picked out) result in an early positive-going potential which is being further investigated in ongoing research.
In order to improve the naturalness of interaction with speech-enabled virtual agents, ongo- ing research is examining both (a) how the neurophysiological (ERP) correlates of referent comprehension are modulated by (artificial) speaker gaze in visual contexts, and (b) the ex- tent to which traditional ERP correlates of spoken language comprehension are modulated by synthesized speech.
In some types of situated tasks (e.g., interacting with a virtual agent or dialog system while driving), the use of EEG or eye movements during reading is, however, not possible (due to the driving tasks and resulting movement artifacts). We have therefore also investigated the suitability of a novel pupillometric measure, the Index of Cognitive Activity (ICA). The ICA is a reliable indicator of linguistic processing difficulty, and is at the same time a highly dynamic measure which allows us to separate effects of the driving task from effects of language processing. This method has also delivered positive results for measurning effects of processing difficulty in visual world settings.