Learning to mix with neural audio effects in the waveform domain
The process of transforming a set of recordings into a musical mixture encompasses a number of artistic and technical considerations. Due to this inherent complexity, a great deal of training and expertise on the part of the audio engineer is required.
This complexity has also posed a challenge in modeling this task with an assistive or automated system. While many approaches have been investigated, they fail to generalize to the diversity and scale of real-world projects, with the inability to adapt to a varying number of sources, capture stylistic elements across genre, or apply the kinds of sophisticated processing used by mix engineers, such as compression.
Recent successes in deep learning motivate the application of these methods to advance intelligent music production systems, although due to the complexity of this task, as well as a lack of data, there are a number of challenges in directly applying these methods. In this thesis, we address these shortcomings with the design of a domain inspired model architecture. This architecture aims to facilitate learning to carry out the mixing process by leveraging strong inductive biases through selfsupervised pre-training, weight-sharing, as well as a specialized stereo loss function.
We first investigate waveform based neural networks for modeling audio effects, and advance the state-of-the-art by demonstrating the ability to model a series connection of audio effects jointly over a dense sampling of their parameters. In this process, we also demonstrate that our model generalizes to the case of modeling an analog dynamic range compressor, surpassing the current state-of-the-art approach.
We employ our pre-trained model within our framework for learning to mix from unstructured multitrack mix data. We show that our domain-inspired architecture and loss function enable the system to operate on real-world mixing projects, placing no restrictions on the identity or number of input sources. Additionally, our method enables users to adjust the predicted mix configuration, a critical feature that enables user interaction not provided by basic end-to-end approaches. A perceptual evaluation demonstrates that our model, trained directly on waveforms, can produce mixes that exceed the quality of baseline approaches. While effectively controlling all the complex processors in the console remains challenging, we ultimately overcome many of the challenges faced by canonical end-to-end deep learning approaches.
While the results presented in this work are preliminary, they indicate the potential for the proposed architecture to act as a powerful deep learning based mixing system that learns directly from multitrack mix data. Further investigations with this proposed architecture through the use of larger datasets, in addition to transforming the controller to act a generative model, prove to be promising directions in advancing intelligent music production, surpassing current knowledge based expert systems and classical machine learning approaches for this task.