There is a newer version of the record available.

Published June 7, 2022 | Version v1
Conference paper Open

Transformer and LSTM Models for Automatic Counterpoint Generation using Raw Audio

  • 1. University of Oslo

Description

A study investigating Transformer and LSTM models applied to raw audio for automatic generation of counterpoint was conducted. The dataset was a collection of raw audio waveforms of various pieces of Bach’s work, played on different instruments. Each song sentence was composed of four voices, and the aim was for the models to predict a missing voice from any subset of the remaining three voices. The research demonstrated the efficacy and behaviour of two deep learning (DL) architectures (the LSTM and Transformer), when applied to raw audio data, which are typically characterised by much longer sequences than symbolic music representations, such as MIDI. So far, the LSTM model has been the quintessential DL model for sequence-based tasks, such as generative audio models, but the research conducted in this study shows that the Transformer model can achieve competitive results. The mean absolute (MAE) and squared (MSE) errors were as follows:

- Transformer: 1.0404+-0.003e-5 (MSE) & 7.6733+-0.2410e-4 (MAE)
- LSTM: 1.0388+-0.004e-5 (MSE) & 7.9989+-0.5274e-4 (MAE).

Both models achieved excellent performance, with very small MSE and MAE values. The LSTM model yielded a slightly smaller MSE on the test set, while the Transformer performed better with regards to MAE. Nevertheless, due to the very small differences between the two, it was difficult to conclude on a better model out of the two. Spectral plots of the targets and predictions were also investigated, as well as listening to audio files, for a couple of randomly selected test samples and one out-of-distribution sample. They showed that the models could in fact generate excellent predictions that were difficult to distinguish from the target samples, even for a musical piece that was not taken from the original dataset. Overall, we propose a novel application of the Transformer model for automatic counterpoint generation, which achieved results on par with the current state-of-the-art, represented by the LSTM model. Furthermore, the study investigates the respective prediction capabilities and propose new areas of research thought particularly interesting, such as analysing attention weights to improve human-computer interaction in musical systems. We proved the competitiveness of a different deep learning model, compared against recurrent architectures, for raw audio modelling. Having a plethora of models to choose from for a particular application is thought desirable, as certain features of particular architectures might be advantageous for different research problems.

Files

15.pdf

Files (906.2 kB)

Name Size Download all
md5:9c4d8048f9a653251c4006cb83a9ca9c
906.2 kB Preview Download