We conducted research on the modeling of costs and utilities associated with real-world user activities and the influence of computer-based dialog support on the resulting overall cost and outcome values. For this purpose, we assumed a Markovian plan execution model in which state transitions correspond to actions and are labeled with costs, while states correspond to the (partial) attainment of goals and are associated with (intermediate) rewards. We extended this model to account for three types of system interaction by introducing additional transitions: proactive information provision, active anticipation of user actions, and active intervention on user errors.
This interaction model augments the underlying activity model and allows one (for example) to decide whether the provision of proactive assistance is beneficial for the user in a particular context. Moreover, the proposed model can distinguish between different modalities for a single support action, and allows for the association of their individual costs and expected success probabilities. This makes the developed cost and reward model a valuable knowledge source for multimodal interaction and dialog planning. Further applications of the developed model include, for instance, the assessment of sensor utilities in order to perform sensor selection in resource-constrained sensor networks. For this purpose, a Partially Observable Markov Process is constructed from the developed cost-utility model, which then can be solved using standard methods. Figure shows an example of a dialog planning graph. The activity model (left) summarizes different possible activities performed by a user, and the corresponding interaction model (right) shows possible intervention points of a dialog system. By calculating the costs and rewards, a system can determine if it is advantageous to proactively interrupt the user and to propose a better strategy to achieve the user's goal.
Tangible Multimodal Interaction
We have investigated new fusion methods for physical user actions and spoken user input. On the one hand, we realized an ontology-based realtime fusion engine which can disambiguate spoken input by the action context. If the user says "More," this may be interpreted by the dialog system either as "increase the temperature" or "turn up the radio volume" based on the action context of turning the temperature setting knob or the volume control knob. On the other hand, the fusion engine can use spoken input to disambiguate a user action. If the driver says "Open the left window," then the turning of the multifunctional control wheel of the iDrive device can be used to finetune how far the window opens, whereas after saying "Please switch on the CD player" the wheel can be used to control the volume of the loudspeakers.
Tangible Anthropomorphic Interfaces
People often tend to treat objects as being similar to humans, which allows users to explain the behavior of a system if they lack a good conceptual model. Anthropomorphic interfaces generally attempt to build on established human skills, learned in daily social encounters. The research domain of tangible user interfaces shares this goal, focusing on motor skills mainly related to physical manipulation of tangible objects. Both concepts complement each other; thus, the naturalness of conversational speech and the physicality of real world objects can be used as the foundation for a multimodal dialog interaction paradigm for smart objects. Schmitz has introduced the novel concept of a tangible interaction with anthropomorphic smart objects, which has been derived by combining tangible with anthropomorphic user interfaces in the domain of smart object interaction. This is the first approach that consequently conceptualized tangible interaction with physical, anthropomorphic objects. During the reporting period, we have experimented with an instrumented cocktail shaker which includes a 3D accelerometer as well as light and color intensity sensors. The shaker can detect incorrect shaking actions of the user for a given cocktail and complain by anthropomorphic speech output like "Wow! You are shaking me too fast!" When the user is pouring the appropriate liquids into the shaker, these actions are acknowledged by swallowing sounds. Thus, we created a model of tangible interaction with an animalistic smart object: the life-like cocktail shaker.
Dialog Strategies for Multimodal Output Generation
In Manfred Pinkal's group, the automatic acquisition of dialog strategies for multimodal output generation was investigated in the example domain of an in-car music player. A reinforcement learning framework was used to optimize multimodal information-seeking dialog strategies. Research focused on automatically jointly optimizing the decisions of when to present information, how many pieces of information to present to users, and how to present information (speech vs. graphics). These decisions are represented as a hierarchical learning problem using the formalism of Markov decision processes. The simulation environment used to train the policies is learned from Wizard-of-Oz data collected in a collaboration between Pinkal's and Wahlster's groups, in the EU project TALK. The RL-based policy significantly outperformed supervised learning (based on the Wizard-of-Oz performances in the TALK data), gaining on average 18% more reward and rated 10% better in interaction with human users.
Automatic Generation of Multimodal Meeting Summaries
Tilman Becker and his group have focused on the automatic generation of multimodal meeting summaries. Apart from their generation, there are many ways of presenting these summaries to the user. A new multimodal presentation method has been realized by the SUVI-tool (Summary Visualizer). Based on actual audio and video data, it generates a storyboard layout or displays the meeting in the style of a newspaper. From a functional point of view, the constraint-based layout-system SUVI is a self-contained, generic and parameterizable layout generator for multimodal meeting summaries. Speaker utterances are used to fill the layout objects, namely text boxes and balloons. All of this data is analyzed and appropriately represented by an input reader. Based on that representation, a constraint solver derives the necessary constraint variables which are stored within corresponding layout objects. The layout objects that are used in the storyboard implementation of SUVI are balloons, panels and text boxes. The constraint solver is comprised of the mechanisms needed to process the layout knowledge by initially producing and then solving appropriately defined general constraints for all layout objects. Finally, the layout manager component creates a corresponding layout representation in XML-format from the instantiated layout objects. Figure shows an example of such a meeting summary. The system was tested on parts of a real project review meeting. The review meeting was completely processed by automatic tools from the University of Sheffield (ASR), the University of Edinburgh (topic segementation & labeling) and the IDIAP Research Institute in Martigny (recordings).
Generating Introspection-Based Meta-Utterances
The appeal of being able to ask a question to a multimodal Internet terminal and receive a multimedia answer immediately has been renewed by the broad availability of information on the Web. Ideally, a multimodal interaction system should use the Web as its knowledge base and combine this with domain-specific multimedia information to answer a broad range of user questions.
Adaptive information retrieval in the spoken dialog context seeks to increase the usability of such systems by increasing the trust in the presented multimodal answers. A second goal is the initiation of a system reaction when a trustworthy answer can be expected to be presented from a particular heterogeneous information source. These are great challenges, because, especially in the information and knowledge retrieval context, information sources may change their quality characteristics, e.g., accessibility, response time, and reliability.
Therefore, we investigated an introspective view on the processing workflow: machine learning methods update the reasoning process for dialog decisions.
One of the new methodologies investigated during the reporting period is the ability of the dialog system to learn how to predict the probability that it can answer a complex user query in a given time interval. A distinguishing feature of our approach is that introspective models can be extracted and applied automatically. This paves the way for question answering systems that adapt automatically to new situations.