ADTOF: A large dataset of non-synthetic music for automatic drum transcription

The state-of-the-art methods for drum transcription in the presence of melodic instruments (DTM) are machine learning models trained in a supervised manner, which means that they rely on labeled datasets. The problem is that the available public datasets are limited either in size or in realism, and are thus suboptimal for training purposes. Indeed, the best results are currently obtained via a rather convoluted multi-step training process that involves both real and synthetic datasets. To address this issue, starting from the observation that the communities of rhythm games players provide a large amount of annotated data, we curated a new dataset of crowdsourced drum transcriptions. This dataset contains real-world music, is manually annotated, and is about two orders of magnitude larger than any other non-synthetic dataset, making it a prime candidate for training purposes. However, due to crowdsourcing, the initial annotations contain mistakes. We discuss how the quality of the dataset can be improved by automatically correcting different types of mistakes. When used to train a popular DTM model, the dataset yields a performance that matches that of the state-of-the-art for DTM, thus demonstrating the quality of the annotations.


INTRODUCTION
Automatic drum transcription (ADT) consists of creating a symbolic representation of the notes played by the drums in a music piece. Two targets that would benefit from such transcriptions are musicians, for example when learning a musical piece, and music information retrieval (MIR), that can leverage the location of the notes to draw a deeper knowledge of a music track (e.g., its structure). ADT is known to be difficult to achieve. In fact, as explained in a recent state-of-the-art review by Wu et al. [1], it is tackled in several manners and with different levels of complexity. The most basic aspect undertaken is the automatic classification of isolated drum sounds. Here, we are interested in solving the more general and complex task of drum transcription in the presence of melodic instruments (DTM). In DTM, the input consists of polyphonic music (drums and accompanying instruments); the output is a log with time stamp and instrument for each drum note. As Wu et al. argued, much progress has been made recently in ADT (and, therefore, in DTM) thanks to deep learning approaches. However, a high volume of annotated data is needed for neural networks to perform well, and such data is difficult to obtain, mainly because the annotation process is labor-intensive. This explains why the publicly available datasets are usually either small (e.g. [2][3][4]) or consist of augmented data (e.g. [5,6]) or synthesized audio (e.g. [7,8]), both of which are not direct representations but only estimations of real music. Therefore, current datasets seem to be suboptimal for DTM either because of quantity or data authenticity.
In this work, we explore a way to reach both the required quantity and realism of the data needed for DTM by using crowdsourced annotations of a high volume of realworld (not synthetically created) audio tracks. In fact, we realized that this amount of data can be found in rhythm games such as RockBand 1 or PhaseShift 2 . In these games, one of the goals is to correctly play the drum line of a song on a toy drum kit. Songs come with the game, but players can also add audio tracks and their own annotations of the drums parts. Because of this feature, a large online community of players and musicians emerged to extend the catalog of playable tracks and share custom game files, also known as "custom charts". These data have the advantage of being fundamentally similar to the content of current ADT datasets and contain the audio source along with the representation of the notes being played on the drums.
The outcome of our work is a new methodology to build a dataset from custom charts. Following our method, we build a dataset named Automatic Drums Transcription On Fire (ADTOF) 3 that is composed of a large amount of realistic data. As mentioned above, the large amount is achieved through crowdsourcing annotations from a much larger group of people than previously observed, to our knowledge. Realism is due, instead, to the use of real-world as opposed to augmented or synthesized music tracks. Yet, quantity and realism are useful only if the Dataset Hours Classes Real music ENST [2] 1.02 20 √ annotations contain as few mistakes as possible.
In order to ensure a sufficient quality, we used a systematic way of curating the data from an online source by selecting the tracks that are less likely to contain wrong annotations. In fact, while manually assessing the annotations, we found many discrepancies between the locations of the annotations and the positions of the actual sound onsets. To overcome this issue we adapted the automatic alignment technique described in the work of Driedger et al. [9] to correct the time precision of the annotations. We also found many inconsistencies in the labels used to designate specific instruments of the drum kit, which we solved by reducing the set of instrument classes to be detected. Finally, in order to assess how useful the annotations were after being processed, we evaluated our new dataset as both a training and test data for the popular convolutional recurrent neural network (CRNN) illustrated in the work of Vogl et al. [8]. The result is that ADTOF allows for the direct training of a model that achieves comparable performance to the state-of-the-art model trained on multiple other datasets. It also provides complementary information and generalization capability. 4 The rest of this article is organized as follows: Section 2 contains a survey of related works. The data with the annotation and curation process are presented in Section 3, and the methodology to automatically clean them is detailed in Section 4. In Section 5, experiments on training and testing are presented. The results are then discussed in Section 6. Conclusions are drawn and future works are described in Section 7.

RELATED WORK
Multiple datasets with different characteristics have been created to solve specific aspects of ADT. Since we deal with DTM, though, in this section we discuss only datasets containing polyphonic music (see Table 1).
To our knowledge, the oldest public dataset suitable for DTM is ENST [2] created in 2006. This dataset contains the recordings of three professional drummers playing along with a variety of musical accompaniments composed for drum kit practicing. More recently, in 2017, MDB drums [3] was created by adding drum transcriptions to 23 tracks from MedleyDB [10], and RBMA [4] was released, with annotations, on the freely available album "Various Assets -Not For Sale: Red Bull Music Academy New York 2013". These datasets are standard in DTM, but they are limited in several ways. First, because of the difficulties inherent in annotating music, these datasets are small, with a cumulative time just above three hours. Second, the number of occurrences of each instrument in a drum kit is generally unbalanced, with some instruments (e.g., crash cymbal, ride cymbal) appearing much less than others (e.g., bass drum, snare drum). Lastly, in these datasets, data diversity is largely reduced (e.g., ENST contains audio from a limited number of drum kits, RBMA is biased toward a few music genres). As a consequence, the majority of DTM research [4,5,[11][12][13] narrows down to the identification of three main drum classes -kick drum, snare drum and hi-hat.
As an effort to increase the size of the manual annotations, data augmentation was employed by Vogl et al. [5] and, more recently, by Jacques and Röbel [6]. In these studies, data augmentation techniques (e.g., pitch-shifting, time-stretching the audio) usually increased the performances of the model trained on the augmented data. However, according to some of the authors in a later work [8, p. 4], this improvement is limited.
Another approach taken to contrast data paucity is the generation of synthetic datasets which consist of synthesized audio generated from a symbolic representation of music (i.e. MIDI files). This technique allows to create larger datasets because it removes the labor needed to annotate the audio tracks, since the ground truth is deduced from the generation process. Moreover, audio synthesis gives the flexibility to balance instrument distribution by artificially replacing more common drum classes with sparser ones.
Following this approach, Cartwright and Bello proposed in 2018 the Synthetic Drum Dataset (SDDS) [7] that is multiple orders of magnitude larger than the previous datasets. In their work, the audio has been rendered from a collection of MIDI drum loops using randomly selected drum samples, augmented with harmonic background noise and other data augmentation methods. The same year, Vogl et al. created another synthetic dataset, which we refer to as TMIDT [8], by using MIDI files available online to synthesize both drums and the accompaniment parts in such a way that drums classes would be distributed in a natural and balanced fashion. Both these works indicate that models trained on a large synthetic dataset alone do not outperform models trained on small real-world datasets, with still possible performance improvements for some underrepresented classes when using TMIDT. Furthermore, Vogl et al. [8, p. 5] raised the concern that the atypical nature of drum patterns that underwent a balancing process could harm the model and they showed that this technique is ineffective when making evaluation on real-world datasets. In conclusion, results improve only when real data is somehow involved: by training with both synthetic and real data [7], or by training first with synthetic data and then refining the outcome with real data [8].