Published September 23, 2018 | Version v1
Conference paper Open

Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words

Description

We define three retrieval tasks requiring efficient search of the musical content of a collection of ~32k pageimages of 16th-century music to find: duplicates; pages with the same musical content; pages of related music. The images are subjected to Optical Music Recognition (OMR), introducing inevitable errors. We encode pages as strings of diatonic pitch intervals, ignoring rests, to reduce the effect of such errors. We extract indices comprising lists of two kinds of 'word'. Approximate matching is done by counting the number of common words between a query page and those in the collection. The two word-types are (a) normal ngrams and (b) minimal absent words (MAWs). The latter have three important properties for our purpose: they can be built and searched in linear time, the number of MAWs generated tends to be smaller, and they preserve the structure and order of the text, obviating the need for expensive sorting operations. We show that retrieval performance of MAWs is comparable with ngrams, but with a marked speed improvement. We also show the effect of word length on retrieval. Our results suggest that an index of MAWs of mixed length provides a good method for these tasks which is scalable to larger collections.

Files

210_Paper.pdf

Files (2.2 MB)

Name Size Download all
md5:41279a5e2a3e734d07006b86f371d7dc
2.2 MB Preview Download