There is a newer version of the record available.

Published September 21, 2025 | Version v1
Conference paper Open

Joint Object Detection and Sound Source Separation

Description

We propose See2Hear (S2H), a framework that jointly learns audio-visual representations for object detection and sound source separation from videos. Existing methods do not fully exploit the synergy between the detection and separation tasks, often relying on disjointly pre-trained visual encoders. In this paper, S2H integrates both tasks in an end-to-end trainable unified structure using transformer-based architectures. A naive combination of them, however, results in suboptimal performance. We propose a dynamic filtering mechanism that selects relevant object queries from the object detector to resolve this issue. We conduct extensive experiments to verify that our approach achieves the state-of-the-art performance in audio source separation on the MUSIC and MUSIC-21 datasets, while maintaining competitive object detection performance. Ablation studies confirm that the joint training of detection and separation is mutually beneficial for both tasks.

Files

000095.pdf

Files (1.2 MB)

Name Size Download all
md5:7cad165177636179284cc407d8cac68a
1.2 MB Preview Download