Published November 28, 2014 | Version 1
Project deliverable Open

TWO!EARS Deliverable D3.2 - Software Architecture, progress (WP3: Feature extraction, object formation & meaning assignment; FP7-ICT-2013-C TWO!EARS FET-Open Project 618075)

  • 1. Speech and Hearing Research Group, University of Sheffield, UK
  • 2. Neural Information Processing Group, Technische Universität Berlin, Germany
  • 3. Institute of Communication Acoustics, Ruhr-University Bochum, Germany
  • 4. Audiovisual Technology Group, Technische Universität Ilmenau, Germany
  • 5. Hearing Systems Group, Technical University of Denmark, Copenhagen, Denmark

Description

The goal of the Two!Ears project is to develop an intelligent, active computational model of auditory perception and experience in a multi-modal context. At the heart of the project is a software architecture that optimally fuses prior knowledge with the currently available sensor input, in order to find the best explanation of all available information. Top-down feedback plays a crucial role in this process. The software architecture will be implemented on a mobile robot endowed with a binaural head and stereo cameras, allowing for active exploration and understanding of audiovisual scenes. Our approach recasts a conventional “blackboard system” in a modern machine learning framework. In the blackboard system, knowledge sources cooperate to solve a problem; in this case, the problem is to identify the acoustic sources that are present in the environment and ascribe meaning to them.

This deliverable documents our progress on the design and development of the Two!Ears software architecture. We have developed a blackboard system with three layers. In the first layer, the acoustic input is pre-segmented using information about pitch, onsets/offsets, interaural coherence and amplitude modulation. This allows sound sources of interest to be separated from the acoustic background. Visual information gathered from cameras (or initially from a 3D simulation) are also segmented. Since sound sources of interest in the environment will not be static, we have also developed a framework for nonlinear tracking of sound sources which models their underlying motion dynamics. In the second layer of the expert system, audio-visual events are labelled to indicate their attributes. So far, we have focused on attributes relating to the spatial location of sound events and their source type (e.g., ‘female voice’ or ‘telephone ring’). Correctly labelling these attributes in noisy and reverberant acoustic environments is a challenging problem; we have developed techniques for improving the robustness of spatial location estimation (using multi-condition training and head movements) and source type classification (using noise-adaptive linear discriminant analysis). We have also developed an i-vector approach to speaker recognition which gives very promising results. In the third layer of the blackboard system, events are interpreted to derive a meaningful description of the auditory scene. We aim to achieve this within a graphical model framework, based on the open-source software toolkit GMTK. To date, we have demonstrated how sound source localisation can be cast within this framework.

Files

D3_2_TWOEARS_Progress_Software_Architecture.pdf

Files (3.0 MB)

Additional details

Funding

TWO!EARS – TWO!EARS 618075
European Commission