Conference paper Open Access
Frank Zalkow; Meinard Müller
<?xml version='1.0' encoding='utf-8'?> <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd"> <identifier identifierType="DOI">10.5281/zenodo.4245400</identifier> <creators> <creator> <creatorName>Frank Zalkow</creatorName> </creator> <creator> <creatorName>Meinard Müller</creatorName> </creator> </creators> <titles> <title>Using weakly aligned score–audio pairs to train deep chroma models for cross-modal music retrieval</title> </titles> <publisher>Zenodo</publisher> <publicationYear>2020</publicationYear> <dates> <date dateType="Issued">2020-10-11</date> </dates> <resourceType resourceTypeGeneral="ConferencePaper"/> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/4245400</alternateIdentifier> </alternateIdentifiers> <relatedIdentifiers> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.4245399</relatedIdentifier> <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf">https://zenodo.org/communities/ismir</relatedIdentifier> </relatedIdentifiers> <rightsList> <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights> <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights> </rightsList> <descriptions> <description descriptionType="Abstract">Many music information retrieval tasks involve the comparison of a symbolic score representation with an audio recording. A typical strategy is to compare score–audio pairs based on a common mid-level representation, such as chroma features. Several recent studies demonstrated the effectiveness of deep learning models that learn task-specific mid-level representations from temporally aligned training pairs. However, in practice, there is often a lack of strongly aligned training data, in particular for real-world scenarios. In our study, we use weakly aligned score–audio pairs for training, where only the beginning and end of a score excerpt is annotated in an audio recording, without aligned correspondences in between. To exploit such weakly aligned data, we employ the Connectionist Temporal Classification (CTC) loss to train a deep learning model for computing an enhanced chroma representation. We then apply this model to a cross-modal retrieval task, where we aim at finding relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a query. We present systematic experiments that show the effectiveness of the CTC-based model for this theme-based retrieval task.</description> </descriptions> </resource>
All versions | This version | |
---|---|---|
Views | 128 | 128 |
Downloads | 51 | 51 |
Data volume | 42.4 MB | 42.4 MB |
Unique views | 115 | 115 |
Unique downloads | 45 | 45 |