Joint Object Detection and Sound Source Separation

Sunyoo Kim; Yunjeong Choi; Doyeon Lee; Seoyoung Lee; Eunyi Lyou; Seungju Kim; Junhyug Noh; Joonseok Lee

doi:10.5281/zenodo.17706601

There is a newer version of the record available.

Published September 21, 2025 | Version v1

Conference paper Open

Joint Object Detection and Sound Source Separation

We propose See2Hear (S2H), a framework that jointly learns audio-visual representations for object detection and sound source separation from videos. Existing methods do not fully exploit the synergy between the detection and separation tasks, often relying on disjointly pre-trained visual encoders. In this paper, S2H integrates both tasks in an end-to-end trainable unified structure using transformer-based architectures. A naive combination of them, however, results in suboptimal performance. We propose a dynamic filtering mechanism that selects relevant object queries from the object detector to resolve this issue. We conduct extensive experiments to verify that our approach achieves the state-of-the-art performance in audio source separation on the MUSIC and MUSIC-21 datasets, while maintaining competitive object detection performance. Ablation studies confirm that the joint training of detection and separation is mutually beneficial for both tasks.

Files

000095.pdf

Files (1.2 MB)

Name	Size	Download all
000095.pdf md5:7cad165177636179284cc407d8cac68a	1.2 MB	Preview Download

106

Views

Downloads

Show more details

	All versions	This version
Views	106	74
Downloads	70	53
Data volume	101.3 MB	76.0 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

ISMIR

Imprint

Proceedings of the 26th International Society for Music Information Retrieval Conference, 827-834. Daejeon, South Korea.

Conference

International Society for Music Information Retrieval Conference (ISMIR 2025) , Daejeon, South Korea and Online, September 21-25, 2025

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 25, 2025
Modified: November 25, 2025

Joint Object Detection and Sound Source Separation

Authors/Creators

Description

Files

000095.pdf

Files (1.2 MB)