Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published October 27, 2017 | Version v1
Conference paper Open

Correspondence between audio and visual deep models for musical instrument detection in video recordings

  • 1. Universitat Pompeu Fabra

Description

This work aims at investigating cross-modal connections between audio and video sources in the task of musical instrument recognition. We also address in this work the understanding of the representations learned by convolutional neural networks (CNNs) and we study feature correspondence between audio and visual components of a multimodal CNN architecture. For each instrument category, we select the most activated neurons and investigate exist- ing cross-correlations between neurons from the audio and video CNN which activate the same instrument category. We analyse two training schemes for multimodal applications and perform a comparative analysis and visualisation of model predictions.

 

This work is supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502).

Files

ISMIR2017SlizovskaiaLBD.pdf

Files (305.5 kB)

Name Size Download all
md5:38d31a689ed0a33010c1ba7ab841dbbe
305.5 kB Preview Download