Design of multimodal interaction models
A study was conducted that addressed the question of how speech and tangible interfaces can be combined in order to provide effective multimodal interaction in vehicles, taking into account the special requirements induced by the circumstances of driving. In this study, speech was used to set the interaction context (determine the object which is to be manipulated) and a turn-and-push dial is used to manipulate/adjust. The distraction induced by manual (conventional), speech-only, and multimodal interaction (combination of speech and turn-and-push dial) was measured using the standardized lane change task (LCT). Results show that while subjects were able to perform more tasks in the manual condition, their driving was significantly safer when using speech-only or multimodal dialog.
Role-based access to devices, applications, and services
A study was conducted that addressed the question of how communication between passengers can be supported by transmitting speech (and later also video) of the communication partners back and forth. In particular, the following question was addressed: is listening to noisy speech coming from the back seat really distracting to the driver? Subjects rated the truth of common-sense statements played from the back of the car (clear, noisy) while driving with a drive simulator. NOISY was rated significantly more distracting than CLEAR, while objective driving performance degrades only for men and not for women.
Development of speaker classification methods
We developed a GMM-SVM-supervector system for speaker age recognition and conducted an experimental study with the aim being to evaluate the performance of the system and explore the selection of parameters. Due to changes in the definition of the age classes and the use of particularly short utterances for testing, the underlying task is considered more difficult than the one underlying previous studies (by the authors and others). Nevertheless, a comparable accuracy has been obtained.
Development of models for role-based access to devices, applications and services
We developed a concept for passenger localization and identification that especially takes into account information security and privacy issues. According to this approach, information that is stored on nomadic devices (mobile, personal devices that are brought into the car) is used to decode and locally store personal information.
Each seat is equipped with a microphone, pressure sensor, and a screen. Voice profiles are generated and pressure is measured at each position. For each seat, icons on the built-in screens indicate if speech was detected or if pressure information is available. This sensory information is made available to the users' personal mobile devices, which also contain their user profiles. Passengers that are not regularly using the car can connect to it using their mobile phone, which obtains the current measurements for all seats and performs a comparison locally on the device. Using speaker identification technology and sensor fusion networks, a match likelihood is computed for each seat. If the profile on the device matches some seat's sensor data above a certain threshold, the device informs the user about the successful positioning and asks whether he wants to share certain parts of his user profile with the car. Accepting this enables advanced personalization scenarios on the built-in displays, such as resuming a movie that was previously interrupted. For those passengers who have traveled with the car before, the positioning comes into effect without requiring any further interaction from the user. A personalized greeting indicates that a successful match has been made. For the driver in particular, permitting the device to share personal information with the car allows it to retrieve schedule data and link it to automotive functions like the on-board navigation system.
Development of libraries with presentation strategies
The effectiveness of five modality variants (speech, text-only, icon-only, two combinations of text and icons) for presenting local danger warnings for drivers was experimentally investigated. We focused on suddenly-appearing road obstacles within a maximum up-to-date scenario as it is envisaged in Car2Car communication research. Effectiveness was measured by the minimum time necessary for fully interpreting the content. Results show that text-only requires the most time, while icon-only is perceived the fastest. The two combined versions lay in between. The minimum length for speech was determined by the duration of the utterance, which is longer than the perception time of text-only in this case. However, speech could be decoded reliably by nearly all subjects. Results indicate further that a blinking visual cue provided through the peripheral visual channel was able to enhance the salience of visual modalities. Subjective judgments by the subjects furthermore suggest a combined use of visual and auditory modalities.
In a second study, two presentation factors were selected: modality and level of assistance. The modality factor had 4 variants: speech warning, visual and speech warning, visual warning with blinking cue, and visual warning with sound cue. The level of assistance varied: action suggestions (AS) were sometimes given. In accordance with the ISO usability model, 6 measurements were derived to assess the effectiveness and efficiency of the warnings and the drivers' satisfaction. Results indicate that the combination of speech and visual modality leads to the best performance as well as the highest satisfaction. In contrast, purely auditory and purely visual modalities were both insufficient for presenting high-priority warnings. AS generally improved the usability of the warnings, especially when they were accompanied by supporting information so that drivers could validate the suggestions.
Finally, in a third study, we investigated possible behavioral effects. We compared driving performance as well as driver stress measured with physiological sensors in the following experimental conditions: NO ASSISTANCE, WORKING SYSTEM and UNANTICIPATED FAILURE. Results show that the driving performance and stress level was severely affected by system failures.
Post-ASR error correction using eye gazes
We conducted another study on correcting speech recognition while driving. Following the ASR post-correction paradigm, we considered ASR as a black box. For dictating (and editing) a text message while driving, the black-box scenario is very likely as car manufacturers are in the process of moving from owned on-board ASR solutions to off-board third party services. Obvious means for correcting errors that we know of from personal computers at home will not be immediately applicable here, because in mobile situations a number of constraints apply. Moreover, correcting speech recognition errors by adding another speech recognition step bears the danger of ending up in cascading errors that frustrate the user. We therefore believe that the problem of post-ASR correction is a suitable testbed for the proposed gaze-based interaction concept.
In our experiment, we used eye-gaze tracking in combination with a button on the steering wheel as explicit input. This approach combines the advantages of direct interaction on visual displays but avoids the drawbacks of a touchscreen. Particular advantages are freedom of placement for the screen (even out of the user's reach) and that both of the driver's hands remain on the steering wheel. The results showed that this interaction modality is slightly slower and more distracting than a touchscreen, while being significantly faster than automated speech interaction.
Summary of finished and on-going work for the multimodal in-car dialog demonstrator.