00000nam##2200000uu#4500 4245400 doi 10.5281/zenodo.4245400 oai:zenodo.org:4245400 user-ismir Meinard Müller Using weakly aligned score–audio pairs to train deep chroma models for cross-modal music retrieval Frank Zalkow info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 spdx Many music information retrieval tasks involve the comparison of a symbolic score representation with an audio recording. A typical strategy is to compare score–audio pairs based on a common mid-level representation, such as chroma features. Several recent studies demonstrated the effectiveness of deep learning models that learn task-specific mid-level representations from temporally aligned training pairs. However, in practice, there is often a lack of strongly aligned training data, in particular for real-world scenarios. In our study, we use weakly aligned score–audio pairs for training, where only the beginning and end of a score excerpt is annotated in an audio recording, without aligned correspondences in between. To exploit such weakly aligned data, we employ the Connectionist Temporal Classification (CTC) loss to train a deep learning model for computing an enhanced chroma representation. We then apply this model to a cross-modal retrieval task, where we aim at finding relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a query. We present systematic experiments that show the effectiveness of the CTC-based model for this theme-based retrieval task. ISMIR 2020-10-11 user-ismir info:eu-repo/semantics/conferencePaper 20201106002702.0 831590 md5:2cbe5c2a931796e4c4ec43019e66e339 https://zenodo.org/records/4245400/files/23.pdf open 10.5281/zenodo.4245399 isVersionOf doi Proceedings of the 21st International Society for Music Information Retrieval Conference 184-191 Montreal, Canada 2020-10-11