Interactive Control of Explicit Musical Features in Generative LSTM-based Systems

Long Short-Term Memory (LSTM) neural networks have been effectively applied on learning and generating musical sequences, powered by sophisticated musical representations and integrations into other deep learning models. Deep neural networks, alongside LSTM-based systems, learn implicitly: given a sufficiently large amount of data, they transform information into high-level features that, however, do not relate with the high-level features perceived by humans. For instance, such models are able to compose music in the style of the Bach chorales, but they are not able to compose a less rhythmically dense version of them, or a Bach choral that begins with low and ends with high pitches -- even more so in an interactive way in real-time. This paper presents an approach to creating such systems. A very basic LSTM-based architecture is developed that can compose music that corresponds to user-provided values of rhythm density and pitch height/register. A small initial dataset is augmented to incorporate more intense variations of these two features and the system learns and generates music that not only reflects the style, but also (and most importantly) reflects the features that are explicitly given as input at each specific time. This system -- and future versions that will incorporate more advanced architectures and representation -- is suitable for generating music the features of which are defined in real-time and/or interactively.


INTRODUCTION
Artificial Intelligence (AI) and Machine Learning (ML) methods have been extensively explored for generating music. Interesting results have been produced so far in the academic context and a few attempts with debatable success have been made to incorporate such methods in real-world applications -either for the production of AI-music or AI-based tools for assisting/collaborating with musicians. Such methods can be, nonetheless, useful in real-world applications: interacting with AI and ML systems that generate music can be both recreational, and especially educational, for people who are not familiar with playing a musical instrument and are willing to explore musical and/or timbral attribute.
Interacting with audio and music generation systems has been examined under many aspects; from audio generation systems that help users generate audio mosaics [13] or automatically construct audio content based on target timbres given by the user [21] using adaptive concatenative sound synthesis, to symbolic music systems, which is the focus of the paper at hand, in [19] explored the idea to transform initial "seeds" of music based on desired musical attributes. However, explicit control of many musical parameters is cumbersome and frustrating, since the generated results are often uncontrollable; this fact gave rise to interactive approaches that involved human ratings instead of direct control of parameters.
Rating-based methods often involve Genetic Programing and such methods have been proposed for symbolic music generation [8] or audio [9]. In a rating-based scheme, users listen to examples and rate them, providing fitness evaluations to algorithms that expectedly converge to an estimated "maximum" (a musical result that should be highly-rated). Given the user-fatigue constraints, i.e. the fact that users cannot undergo many rating rounds, genetic algorithms harnessed with Particle Swarm Optimisation have been proposed to accelerate convergence to feature spaces preferred by the user [11]. Cognitively-relevant musical features play an important role in how we understand and classify music qualities and AI/ML methods have been extensively employed for music classification and generation [6]. Real-time interactive systems are also possible by incorporating the notion of feature-based "divergence"; an example is the "evoDrummer", which produced variations of drum rhythms in real time, according to the desired divergence value given by the user [10].
The advent of new interaction means, e.g. the Leap Motion controller or the Kinect sensor, allow direct interaction with computers, strengthening the value of the real-time control paradigm. Along these lines, gesture-control of compositional processes using the Leap Motion controller has been proposed in [17]. These compositional processes, however, incorporated "fixed" patterns based compositions of well-known, idiosyncratic music composers. Another approach using the Kinect sensor has been implemented in the Harmonic Navigator [16]; this system uses an informative visualisation and interaction interface that allows the user to choose the next chords in harmonic progressions proposed by a Markovian probabilistic model trained on a given corpus.
Recurrent Neural networks (RNN) offer a sophisticated "black box" 1 probabilistic model to model sequences through learning from data; since the early 90's, RNN models have been tested on music composition [18]. Recent developments have allowed experimentation with RNN-based models that employ more sophisticated representations for folk melody generation [20] as well as hybrid deep learning architectures (that involve Restricted Boltzmann Machines) for polyphonic music generation [1].
During the last years, a popular model of RNNs that incorporates variable memory capacities has been extensively tested for music generation, namely, the Long Short-Term Memory [7] (LSTM) network. A typical example case for testing the generative capabilities of models based on the LSTM architecture is to train such networks on very strictly defined styles, as the Bach chorales [14]. LSTM-based models have been expanded with impressive results in composing Bach chorales [4], incorporating information about attributes of the musical score that are not deteriorated to notes, i.e. metric structure. Similar techniques have been used for generating drum rhythms [15], given information about the metric structure and the activity of other instruments in a recording. For a more detailed description of music work on deep learning in music, the reader is referred to [2].
LSTM-based models, given a proper representation of music sequences and sufficiently large amounts of training data, are able to implicitly learn the characteristics of musical styles and compose music that very effectively reflects those characteristics. However, these networks, being "black-box" models, are not able to digest explicit 2 information: they are able to internally represent high-level features of music that do not relate, however, with the high-level structures perceived by humans. For instance, the Bach chorales can be rhythmically dense, or at some points higher or lower pitches might have been used by Bach. However, these systems do not have "conscious" access to such information. Is it possible to develop a system that learns a given style, while it can also "consciously" modify its composition according to explicitly-described musical input features?
Aim of this work is not to provide yet another proof that LSTMbased methodologies are indeed able to learn given styles. The goal is to develop and examine a neural network methodology that incorporates explicit information in forms that humans understand. To this end, a minimal neural network implementation (in terms of architecture, data and representation) is developed that allows a preliminary examination on whether direct infusion of desired feature values (as rhythm density and pitch register) in a generative musical LSTM-based system provides satisfactory results in interactive, feature-driven control of the system's compositions. It should be noted that the study presented herein is different from the approach proposed in [3]. The aforementioned work provided a solution based on neural networks for learning relations/mappings between parameters (whether they are musical and non-musical) without introducing a generative music system that composes music according to given parameters. A more recent version of the system that will be trained with more data and incorporate more features will be integrated into a STEAM (Science, Technology, Engineering, Arts and Mathematics) education platform developed in the context of the iMuSciCA 3 project, to allow students play music with virtual instruments using feature-driven composition.
In the paper at hand, it is such a system that is proposed; a system that learns and composes music that reflects the characteristics of given styles, while, at the same time, allows users to have explicit and interactive control of some musical features. The methodology, described in Section 2 includes two steps; data augmentation, where modified copies with diverse characteristics of the available data are concatenated to form a unified dataset and the network architecture. Section 3 presents the evaluation of the trained network in artificial scenarios that allow a rigorous examination of the achieved adaptation of the system to the input features. Conclusions are given in Section 4.

METHODOLOGY
This section describes the methodological steps for selecting, processing and augmenting data and for building a base-level machine learning model for focusing on research questions related with interactive control. Since the goal is not to construct and train a sophisticated system that proves the stylistic adaptability of LSTMbased networks, the applied methodologies related to data and network are kept at the most basic level, focusing on aspects that concern the work-flow that would allow the development of a feature-driven LSTM-based system.

Data Selection, Processing and Augmentation
The collected dataset was decided to be small with pieces that incorporate specific, consistent and simple musical characteristics. Since data augmentation is among the followed data processing steps, a small dataset size makes it more practical to experiment with many different augmentation setups, considering that training neural networks for many epochs requires considerable processing time.
The small dataset with such characteristics is the "Little dragon" trilogy 4 , which includes three piano pieces suitable for educational purposes. All three pieces are in the C major scale (with simple yet interesting chromatic deviations), in the 4/4 time signature and a tempo of 120 beats per minute, considerable rhythmic diversity and a consistent range of less than 2 octaves.
Data are straightforwardly represented with a binary matrix that indicates note onsets in each 16th note, while durations (or note ending times) are not considered. The binary matrix of a piece is a matrix P ∈ {0, 1} 128×M , where M is the length of the piece in 16ths and each column represents a 16th slice of the piece and includes a binary representation of which respective midi notes (128 in total) begins at this time slice; the row of the notes that have an onset in each 16th (column) are marked with 1 and all others with 0. The initial matrix representation of the dataset, before augmentation, is constructed by concatenating the columns of all matrices (one next to the other) of each piece, resulting in a matrix X ∈ {0, 1} 128×N , where N is the sum of all column lengths of each piece in the dataset. Data were collected in musicXML format and were converted to their numerical/matrix format using the Music21 5 tool. Figure 1 shows the data processing steps; after forming the X matrix, data augmentation and feature extraction follow. music data (musicXML) binary matrix representation augmentation feature computation (per column) During data augmentation, the initial dataset, as represented in the X matrix, is doubled consecutively according to how many augmentation steps are followed; each augmentation step generates a clone of the data samples with altered characteristics that is appended to the pre-augmentation data samples, doubling their size. In Figure 1, the augmented data is shown as A ∈ {0, 1} 128×k N , where k is an integer and depends on the augmentation steps followed. Music features are extracted from the augmented data samples and data augmentation aims to create training data that are more diverse in regards with the considered features.
Since the aim is to examine the ability of LSTM-based networks to interactively adapt to given requested musical features, two features were extracted that are simple to compute and to track by ear: rhythm density (or sparsity) and mean pitch height (or register). Data augmentation should therefore generate data that are more diverse in terms of rhythmic sparsity and pitch register. Two augmentation steps are followed: (1) Sparsity augmentation: a copy of the available data is generated and in each bar (16-tuples of columns), columns are randomly selected according to probabilities that constitute the first column most probably to "survive" the deleting process, while other columns (especially these on weaker beats) 5 http://web.mit.edu/music21/ -retrieved May 31 2018.
are more in favor to be deleted. A musical illustration of sparsity augmentation is given in Figure 2 (a). (2) Register augmentation: a copy of the data samples is created that is pitch-shifted by some desired amount of semitones. Figure 2 (b) shows a musical illustration of register augmentation by increasing 7 semitones.
The augmented copies generated in this fashion are appended to the end of the available data samples, consecutively doubling their size. For example, if sparsity augmentation is applied on a data sample of length N , then its length is becoming 2N ; if register augmentation is followed afterwards, then the 2N size is doubled to 4N . After augmenting the data with the aforementioned steps (register augmentation can be applied several times with different pitch shifting values), two features are extracted regarding sparsity and mean register/pitch height. Features are extracted for each 16th time step, on a sliding window of nine sixteenths length -four 16ths before and four after the event of interest. Specifically, the two features under discussion are computed for each column in the augmented matrix inside the corresponding window as follows: (1) Density (as opposed to sparsity): the ratio of events in the window that include at least one onset (sum of matrix columns that include a monophonic or a polyphonic event) over the total number of events in the window (sum of events in window). (2) Register: the normalised mean value of the highest and the lowest pitch in the window of the event of interest, over the value of the highest pitch in the dataset. If no pitch is present in the window, thus computation of this feature for the window is not possible, the value is inherited from the previous window -if no pitches are included in the beginning window, the value of 0.5 is assigned.
Both feature values are in the range [0, 1]. The feature values in the window of each event have the characteristics of step functions; e.g. the sparsity feature takes only values in the form i/9 (or i/window size), where i ∈ {0, 1, · · · , 9}. Since we are interested in providing interactive control of continuous features values, the features were slightly smoothed out by applying a moving average for each feature in a window of length 4 (or the integer part of the half window size).
The feature values extracted from the augmented data are shown in Figure 3. The data samples have been augmented by applying the augmentation steps in the following order: sparsity augmentation, register augmentation 12 semitones down and register augmentation 12 semitones up. The feature values in the aforementioned figure clearly show the feature areas that are formed by applying these augmentation steps. The initial copy of the data takes up to the 848 time-step and since sparsity is applied first, it is evident that the first copy of the initial data (time-steps from 849 to 1696) is an area with smaller density values (upper graph in Figure 3)the respective register feature values in the lower graph are the same as for the initial copy of the data. Afterwards, when register −12 and +12 augmentations are applied, two new copies (of length 1696) are created leading to the final data sample with length 5088 time-steps. The lower graph shows clearly the register differences of the new copies, while the respective copies in the density feature are identical to the first copy.

Network functionality, architecture and training
The "vanilla" version of the network presented in this paper is a network that learn compositions implicitly, while at the same time integrates knowledge about the musical characteristics of what it is actually learning, according to the aspects that are reflected by the aforementioned features. The goal is to allow users to interact with this network "explicitly" by providing desired musical characteristics (or target feature values) that will lead the network to compose music that reflects these characteristics. Figure 4 illustrates the desired work-flow and the basic architecture employed for the network under discussion.
Generating: Note events are represented as 128-dimensional binary arrays of onsets which are the columns of a binary matrix that represents the entire data; when in "generating" mode, the two desired features for the next event are appended in the 128-dimensional array of the current event, creating a final 130dimensional input array to the network. The network then generates the next note event (128-dimensional array), which is concatenated with the new user-defined features and fed back as input to recursively compose new music 6 . Initially, the network is given a seed sentence of 16 data samples that is taken from the training dataset.
The presented "vanilla" version of the network comprises: Training: Figure 5 illustrates the training process and how it relates with the data processing steps. The augmented binary matrix (A ∈ {0, 1} 128×k N ) is vertically concatenated with the feature matrix (F ∈ [0, 1] 2×k N ), after the feature matrix has been column-wise shifted by one position to the left to correspond to the features of the next event 7 . It should be noted that in Figure 5 the f i and n i−1 vectors are illustrated as separate entities for visualisation purposes, however, they are actually concatenated into a single vector. The input matrix (vertical concatenation of F and A) is divided into batches of length 320 (corresponding to 20 bars), while the network is unfolded in time for 16 steps (one bar) during training. The cost function is the sigmoid cross entropy (since note representation is binary and not one-hot) and the employed learning algorithm was the Adam optimiser [12].

RESULTS
The experiments are designed to test the extent to which the system is able to generate music that corresponds the musical attributes used as input. Ideally, the network should be able to follow the user-provided feature input and compose music that reflects the respective characteristics in real-time. The experimental processes are based on artificial "simulated" scenarios that test the adaptability of the system in pre-defined target feature curves. To this end, the respective feature values described by these curves are given as input to the system at every musical event it composes. After the system generates a composition, the "achieved" features of this composition at every time step are compared with the "target" features that were used as input; high correlations between the target and the achieved features would allow the assumption that the system does indeed have the ability to adapt. A link to an on-line version of the system can be found here: https://athena.imuscica. eu/featureLSTM. The system is examined by using three input curve scenarios: •  Figure 3).
It should be noted that the ranges of the artificial target features were set to [0.3, 0.8] for the density and to [0.2, 0.8] for the register. Regarding density, given that the maximum number of voices is 4, a density below 0.25 would mean silence -which indeed happens in simulations where density values below 0.25 are requestedhowever, testing the ability of the system to retain the capacity to produce notes close to the lowest possible density is considered more interesting for the presented results. The choice for setting the upper limits to 0.8 for both features and the lowest limit to 0.2 for the register feature are related with the feature values observed in the training set which are within these ranges. Nevertheless, even the system is given feature values that exceed these limits the results are similar.
The illustrations in Figure 6 reveal that the system is indeed able to adapt, to some extent to the feature inputs. This is also numerically evident through the high correlation values between the "target" and the "achieved" feature curves, shown in the "corr. " column in Table 1. This table also examines another hypothesis: that the achieved features should incorporate some lag in comparison to the target features, since it can be assumed that the system has some delay in preparing proper musical responses for a given input.
To examine this hypothesis, the achieved features are shifted backwards one step at a time and their correlation with the target features are measured at each shift. The "opt. lag." and "opt. corr." columns in Table 1 show the optimal shifting values for achieving the optimal correlation respectively. In some cases, the optimal correlation is achieved with zero lag, while in the cases where the optimal lag is achieved with a (even a considerable amount) shifting, the gain in correlation is minimal (in most cases around 0.03). Therefore, it can be considered that even without shifting, satisfactory results are achieved in real-time, at least as far as the correlation can indicate. Two additional remarks need to be made regarding the system architecture and the nature of the input features. Regarding the network architecture, the best results were achieved by using 64 units for the LSTM network. Using 256 or 512 produced poor results since the network is choosing to be silent too often. This indicates that, given the small amount of training data, the network was over-specialising and is not able to generalise and provide new solutions to combinations of input feature values and previous note layouts that are not encountered during training.
Regarding the nature of the features, the artificial scenarios are also tested with different levels of noise, examining the hypothesis that the plateaus and steady-gradient changes in the input feature values might not be "realistic" enough -since in training the features appear to be more "noisy". This hypothesis has been rejected, as the results (not presented due to space constraints) in 100 simulations per noise level (ranging from 0.01 to 0.2) indicate that the achieved results are comparable to the ones produced by the shown artificial input features.

CONCLUSIONS
A system based on the Long Short-Term Memory (LSTM) neural networks has been presented that can compose polyphonic music that reflects characteristics of a learned style, while at the same time integrates user-defined features. A data augmentation technique is also proposed that generates copies of the dataset that incorporate the examined features more intensely. Specifically, the dataset is augmented regarding the features of rhythm density and pitch height/register, and the trained system is able to compose music that is more or less rhythmically dense, and with higher or lower pitch registers, according to the feature values requested as input. The results on artificial scenarios show that satisfactory correlations between target features and achieved features (extracted from the artificial generations) are accomplished, showing that interactive, real-time control of these parameters is indeed feasible. Future work will focus on integrating more features like rhythm syncopation or harmonic tension. Additionally, integration of the feature-related layer into more advanced LSTM-based methodologies will be examined, along the application of the data augmentation technique to "benchmark" datasets such as the Bach chorales.