A machine learning approach to ornamentation modeling and synthesis in jazz guitar

We present a machine learning approach to automatically generate expressive (ornamented) jazz performances from un-expressive music scores. Features extracted from the scores and the corresponding audio recordings performed by a professional guitarist were used to train computational models for predicting melody ornamentation. As a first step, several machine learning techniques were explored to induce regression models for timing, onset, and dynamics (i.e. note duration and energy) transformations, and an ornamentation model for classifying notes as ornamented or non-ornamented. In a second step, the most suitable ornament for predicted ornamented notes was selected based on note context similarity. Finally, concatenative synthesis was used to automatically synthesize expressive performances of new pieces using the induced models. Supplemental online material for this article containing musical examples of the automatically generated ornamented pieces can be accessed at doi: 10.1080/17459737.2016.1207814 and https://soundcloud.com/machine-learning-and-jazz. In the Online Supplement we present an example of the musical piece Yesterdays by Jerome Kern, which was modeled using our methodology for expressive music performance in jazz guitar.


Introduction
Performance actions (PAs) can be defined as musical resources used by musicians to add expression when performing a musical piece, which consist of variations in timing, pitch, and energy. In the same context, ornamentation can be considered as an expressive musical resource used to embellish and add expression to a melody. In the past, music expression has been mostly studied in the context of classical music, e.g. Puiggròs et al. (2006), in particular classical piano music, e.g. Widmer and Tobudic (2003). Contrary to classical music scores, performance annotations (e.g. ornaments and articulations ) are seldom indicated in popular music (e.g. jazz music) scores, and it is up to the performer to include them based on his/her musical background. Therefore, in popular music it may not always be possible to characterize ornaments with the archetypical classical music conventions (e.g. trills and appoggiaturas). Several approaches have been proposed to generate expressive performances in jazz saxophone music, e.g. Arcos, De Mantaras, and Serra (1998), Ramírez and Hazan (2006), and Grachten (2006). Ramírez and Hazan (2006) describe a method to predict ornamentation (among other performance actions). Grachten (2006) detects ornaments of multiple notes to render expressive-aware tempo transformations. Other methods are able to recognize and characterize ornamentation in popular music, e.g. Gómez et al. (2011) and Perez et al. (2008). However, due to the complexity of free ornamentation, most of these approaches study ornamentation in constrained settings, for instance by restricting the study to one-note or notated trills ornamentations, e.g. Puiggròs et al. (2006). Based on our previous studies in expressive performance modeling in jazz music (Giraldo 2012;, 2015a, 2015b, 2015c, 2015d, this article presents a system for automatically predicting and synthesizing expressive jazz guitar music performances with unrestricted ornamentation. The aim of this work is twofold: (1) to train computational models of music expression using recordings of a professional jazz guitar player, and (2) to synthesize expressive ornamented performances from inexpressive scores. The general framework of the system is depicted in Figure 1. In order to train a jazz guitar ornamentation model, we recorded a set of 27 jazz standards performed by a professional jazz guitarist. We extracted symbolic features from the scores using information on each note, information on the neighboring notes, and information related to the musical context. The performed pieces were automatically transcribed by applying note segmentation based on pitch and energy information. After performing score-to-performance alignment, using dynamic time warping (DTW), we calculated performance actions by measuring the deviations between performed notes and their respective parent notes in the score. For model evaluation, the data set was split using a leave-one-piece-out approach in which each piece was in turn used as test set, using the remaining pieces as training set. The expressive actions considered in this article were duration, onset, energy, and ornamentation transformations. Concatenative synthesis was used to synthesize new ornamented jazz melodies using samples of adapted notes/ornaments from the segmented audio recordings.
The rest of the article is organized as follows. Section 2 surveys related work. Section 3 describes data acquisition. Section 4 presents our machine learning approach to ornament prediction. Section 5 describes the audio synthesis method. Section 6 reports on the results, and finally, Section 7 presents some conclusions and future work.

Related work
Expressive music performance studies the micro variations a performer introduces (voluntary or involuntary) when performing a musical piece to add expression. Several studies investigating this phenomenon have been conducted, e.g. Gabrielsson (1999Gabrielsson ( , 2003 and Palmer (1997). Computational approaches to studying expressive music performance have been proposed in which data are extracted from real performances and used to formalize expressive models for different aspects of performance -for an overview see Goebl et al. (2008). Computational systems for expressive music performance (CEMP) are often targeted at automatically generating human-like performances by introducing variations in timing, energy, and articulation (Kirke and Miranda 2013).
Two main approaches have been used to model expression computationally. On one hand, expert-based systems obtain their rules manually from music experts. A relevant example is the work of the KTH group (Bresin and Friberg 2000;Friberg, Bresin, and Sundberg 2006;Friberg 2006). Their Director Musices system incorporates rules for tempo, dynamic, and articulation transformations. Other examples of manually generated expressive systems are the Hierarchical Parabola Model (Todd 1989(Todd , 1992(Todd , 1995 and the work of Johnson (1991), who developed a rulebased expert system to determine expressive tempo and articulation for Bach's fugues from The Well-Tempered Clavier. The rules were obtained from two expert performers. On the other hand, machine-learning-based systems obtain their expressive models from real music performance data by measuring the deviations of a human performance with respect to a neutral or robotic performance, using computational learning tools. For example, neural networks were used by Bresin (1998) to model piano performances, and by Camurri, Dillon, and Saron (2000) to model emotional flute performances. Rule-based learning algorithms were used by Widmer (2003) to cluster piano performance rules. Other piano expressive performance systems worth mentioning are the ESP piano system by Grindlay (2005), which utilizes Hidden Markov Models, and the generative performance system of Miranda, Kirke, and Zhang (2010), which uses genetic algorithms to construct tempo and dynamic curves.
Most of the proposed expressive music systems are targeted at classical piano music. More recently, there have been several approaches to computationally modeling expressive performance in popular music by applying machine learning techniques. Arcos, De Mantaras, and Serra (1998) report on SaxEx, a performance system capable of generating expressive solo saxophone performances in jazz, based on case-based reasoning. Ramírez and Hazan (2006) compare different machine learning techniques to obtain jazz saxophone performance models capable of both automatically synthesizing expressive performances and explaining expressive transformations. Grachten (2006) applies dynamic programming using an extended version of edit distance and case-based reasoning to detect multiple note ornaments and render expressiveaware tempo transformations for jazz saxophone music. In previous work (Giraldo 2012;Giraldo and Ramírez 2015a, 2015b, 2015c, ornament characterization in jazz guitar performances is accomplished using machine learning techniques to train models for note ornament prediction.

Data acquisition
In this study, the data set consisted of 27 jazz standard audio recordings (resulting in a total of 1368 notes) recorded by a professional jazz guitarist, and their corresponding music scores. Each note in the score of the recorded pieces was characterized by a set of 30 descriptors. The music scores and audio recordings were analyzed as explained in Sections 3.1 and 3.2, respectively.

Score analysis
Music scores were obtained from commercially available compilations of jazz scores (The Real Book Series). Selected scores were rewritten using an open source software for music notation, and saved into MusicXML format. The MusicXML format allows not only information about the notes (pitch, onset, and duration) to be stored, but also other relevant information for note description such as chords, key, and tempo (among others).

Feature extraction
Feature extraction was performed following an approach similar to that of Giraldo (2012), in which each note is characterized by its nominal, neighboring, and contextual properties.
• Nominal descriptors refer to the intrinsic properties of score notes (e.g. pitch, duration, and onset). Duration and onsets were described both in beats and seconds, as the duration in seconds depends on the tempo of the piece. For example the choice of ornamenting two different notes from different pieces with quarter note duration (beats) may differ if the pieces are played at slow and fast tempos. The energy descriptor refers to the loudness of the note, which in MIDI format is measured as velocity (how fast a piano key was pressed). • Given a particular note, its neighboring descriptors refer to the properties of its neighboring notes, e.g. previous/next interval, previous/next duration ratio, previous/next inter-onset interval. In this work, only one previous and one following note were considered. Inter-onset distance (Giraldo and Ramírez 2015b) refers to the onset difference between two consecutive notes. • Contextual descriptors refer to the musical context in which the note occurs, e.g. tempo, chord, and key. The phrase descriptor (Giraldo and Ramírez 2015b) refers to the note position within a phrase: initial, middle, or end. Phrase descriptors were obtained using the melodic segmentation approach of Cambouropoulos (1997), which indicates the probability of each note being at a phrase boundary. Probability values were used to decide if the note was a boundary note, annotated as either initial (i) or ending (e). Non-boundary notes were annotated as middle (m). The phrase descriptor was introduced based on the hypothesis that boundary notes (i.e. initial or ending phrase notes) are more prone to be ornamented than middle notes. Note to key and note to chord descriptors are intended to capture harmonic analysis information, as they refer to the interval of a particular note with respect to the key and to the chord root, respectively. Key and mode refer to the key signature of the song (e.g. key: C, mode: major).
The metrical strength concept refers to the rhythmic position of the note inside the bar (Cooper and Meyer 1960). Four levels of metrical strength were used to label notes in three common time signatures, depending on the beat at which the note occurs, as shown in Table 2.
The complete list of the 30 descriptors used for this study and its definition is summarized in Table 3.

Audio analysis
The audio of the performed pieces was recorded from the raw signal of an electric guitar. The guitarist was instructed not to strum chords or play more than one note at a time. The guitarist recorded the pieces while playing along with prerecorded commercial accompaniment backing tracks (Kennedy and Kernfeld 2002). We opted to use audio backing tracks performed by professional musicians, as opposed to synthesized MIDI backing tracks, in order to provide a more natural and ecologically valid performance environment. However, using audio backing tracks required a preprocessing beat tracking task. Each piece's section was recorded once (i.e. no repetitions or solos were recorded), For instance, for a piece consisting of sections AABB, only sections A and B were considered.

Melodic transcription
The monophonic audio signal recorded from the guitar was parsed to automatically obtain a MIDI type transcription of the notes performed by the guitarist, based on the previous work of Bantula, Giraldo, and Ramírez (2014). This representation includes the pitch, onset, duration, and energy of each note. For doing this, the audio signal was segmented based on the pitch and energy profiles obtained with the YIN algorithm (De Cheveigné and Kawahara 2002). Each segment represents a note with its corresponding information on pitch, onset, and offset (therefore duration).
To minimize transcription errors, the resulting segments (notes) were filtered using heuristic rules based on human perceptual thresholds for minimum note duration and minimum note gaps. Also, rules to detect octave errors or unusual note intervals were used. To obtain temporal information on the recordings, beat tracking (Zapata et al. 2012) was used, as there was uncertainty concerning the use of a metronome in the recordings of the accompaniment backing tracks. After manual correction of beat tracking, the onset and duration information on each note was adjusted to the beat grid detected for each piece.

Score to performance alignment
Score to performance alignment was performed to correlate each performed note with its respective parent note in the score as depicted in Figure 3. This procedure was carried out following the approach of Giraldo and Ramírez (2015d), in which DTW techniques were used to match performance and score note sequences. A similarity cost function was designed based on pitch, duration, onset, and phrase onset/offset deviations. Phrase onset and offset deviation were introduced to force the algorithm to map all the notes of a particular short ornament phrase (lick) to one parent note in the score. We assumed that a group of notes conforming a lick are played legato. Therefore, the performed sequence is segmented in phrases in which the time gap between consecutive notes is less than 50 ms. This threshold was chosen based on human time perception studies (Woodrow 1951).
Each note from the score and the corresponding performed sequence is represented by a five position cost vector as and cp = (p(j), ds(j), ons(j), ph ons (j), ph ofs (j)), respectively, where cs is the score cost vector and cp is the performance cost vector. Index i refers to a note position in the score sequence, and j refers to a note position at the performed sequence. The onset of the first note of the lick phrase in which the jth note of the performance sequence occurs is represented by ph ons (j). Similarly ph ofs (j) refers to the offset of the last note of the lick phrase in which the jth note of the performance sequence occurs. The total cost is calculated using the Euclidean distance as follows: Notice that in equation (3) phrase onset and offset deviations are calculated when n equals four and five.
Finally, we apply DTW: a similarity matrix H (m×n) is defined in which m is the length of the performed sequence of notes and n is the length of the sequence of score notes. Each cell of the matrix H is calculated as follows: where min is a function that returns the minimum value of the preceding cells (up, left, and upleft diagonal). The matrix H is indexed by the note position of the score sequence and the note position of the performance sequence. A backtrack path is obtained by finding the lowest cost calculated in the similarity matrix. Starting from the last score/performance note cell, the cell with the minimum cost at positions H (i−1) , H (i, j−1) , and H (i−1, j−1) is stored in a backtrack path array. The process iterates until indexes arrive to the first position of the matrix, assigning each note in the performance to a parent note in the score. Figure 4 presents an example of the resulting similarity matrix obtained for one of the recorded songs. The x-axis corresponds to the sequence of notes of the score and the y-axis corresponds to the sequence of performed notes. The cost of correspondence between all possible pairs of notes is depicted darker for the highest cost (less similar) and lighter for the lowest cost (most similar). The dots on the graph show the backtrack path (or optimal path) found for alignment. Diagonal lines represent notes which were not ornamented, as the correspondence from the performance notes to the parent score notes is one to one. On the contrary, vertical lines represent ornaments, as two or more performed notes correspond to one parent note in the score. Because there are no concrete rules to map performance notes to parent score notes, our alignment algorithm was evaluated by comparing its output with the level of agreement between five human experts who were asked to align performance and score note sequences manually. Accuracy of the system was estimated by quantifying how much each note pair produced by the algorithm agreed with the human experts, using penalty factors for high, medium, and low agreement. The results of the evaluation showed that the performance of our approach was comparable with that of the human annotators. Details of these evaluations can be found in Giraldo and Ramírez (2015d).

Performance actions
After alignment, score notes which were mapped to only one performance note were labeled as non-ornamented, whereas score notes mapped to several performance notes (or which were omitted in the performance) were mapped as ornamented. Performance actions were calculated  for each score note, as defined in Table 4, by measuring the deviations in onset, energy, and duration. Again, indexes i and j refer to the note position at the score and the performance sequence, respectively.

Database construction
The data collected was organized, storing each note descriptors along with its corresponding performance action. The pitch, duration, onset, and energy deviations of each ornament note with respect to the score parent note were annotated as shown in Table 5.

Expressive performance modeling
Several machine learning algorithms -i.e. artificial neural networks (ANNs), decision trees (DTs), support vector machines (SVMs), and k-nearest neighbor (k-NN) -were applied to predict the ornaments introduced by the musician when performing a musical piece. The accuracy of the resulting classifiers was compared with the baseline classifier, i.e. the classifier which always selects the most common class. Timing, onset, and energy performance actions were modeled by applying several regression machine learning methods (i.e. ANNs, regression trees (RTs), SVMs, and k-NN).
Based on their accuracy, we chose the best performance model and feature set to predict the different performance actions. Each piece (used as test set) was in turn predicted based on the models obtained with the remaining pieces (used as training set) and synthesized using a concatenative synthesis approach.

Algorithm comparison
In this study we compared four classification algorithms for ornament prediction and four regression algorithms for duration ratio, onset deviation, and energy ratio. We used the implementation of the machine learning algorithms provided by the WEKA library (Hall et al. 2009). We applied k-NN with k = 1, SVM with linear kernel, ANN consisting of a fully-connected multi-layer neural network with one hidden layer, and DT/RT with post pruning.
A paired T-test with a significance value of 0.05 was performed for each algorithm for the ornamentation classification task, over all the data set with 10 runs of 10-fold cross validation scheme. Experiments results are presented in Table 8 and will be commented in Section 6.

Feature selection
Both filter and wrapper feature selection methods were applied. Filter methods use a proxy measure (e.g. information gain) to score features, whereas wrappers make use of predictive models to score feature subsets. Features were filtered and ranked by information gain values, and a wrapper with greedy search and decision trees accuracy evaluation was used to select optimal feature subsets. We used the implementation of these methods provided by WEKA library (Hall et al. 2009). Selected features are shown in Table 6, and will be commented upon in Section 6.
Learning curves on the number of features, as well as on the number of instances, were obtained to measure the learning rate of each of the algorithms. The selection of the model was based on the evaluation obtained with these performance measures.

Synthesis
Predicted pieces were created in both MIDI and audio formats. A concatenative synthesis approach was used to generate the audio pieces. This process consists of linking note audio samples from real performances to render a synthesis of a musical piece. The use of this approach was possible because we had monophonic performance audio data from which onset and offset information was extracted based on energy and pitch information as described in Section 3.2.1. Therefore, it was possible to segment the audio signal into individual notes, and furthermore to obtain complete audio segments of ornaments.
Similarly to the evaluation of the different machine learning algorithms, for synthesis we followed a leave-one-piece-out scheme in which, on each fold, the notes of one piece were used as test set, whereas the notes of the remaining 26 songs were used as training set. Figure 5. Concatenative synthesis approach.

Note concatenation
The note concatenation process is divided into three different stages as depicted in Figure 5.
• Sample retrieval For each note predicted to be ornamented, the k-NN algorithm, using a Euclidean distance similarity function based on note description, was applied to find the most suitable ornamentation in the database (see Section 3.5). This was done by searching for the most suitable ornament in the songs in the training set (Section 4.2). • Sample transformation For each note classified as ornamented, transformations in duration, onset, energy, and pitch (in the case of ornaments) were performed based on the deviations stored in the database, as seen in Figures 6(a) and 6(b). For audio sample transformation we used the time and pitch scaling approaches of Serra (1997). Notes classified as not ornamented were simply transformed as predicted by the duration, onset, and energy models. • Sample concatenation Retrieved samples were concatenated based on final onset and duration information after transformation. The tempo of the score being predicted (in BPM), was imposed on all the retrieved notes. Figure 6. Sample retrieval and concatenation example.

Feature selection
The most relevant features found using the two selection methods described in Section 4.2 are shown in Table 6. The average correctly classified instances percentage (C.C.I.%) obtained using the features selected by the information gain filtering and the greedy search (decision trees) wrapper methods were 78.12 and 78.60%, respectively (F-measures of 0.857 and 0.866, respectively). Given that both measures are similar, i.e. not significantly different, the smallest subset was chosen. In Figure 7, the accuracy on increasing the number of features based on the information gain ranking (explained in Section 4.2) is presented for each of the algorithms used (SVM, ANN, DT). From the curves it can be seen that the subset with the first three features contains sufficient information, as additional features do not add significant accuracy improvement. SVM exhibits better accuracy on the cross validation scheme, and less over-fitting based on the difference between cross validation (CV) and training set (TS) accuracy curves. Figure 7. Accuracies on increasing the number of attributes.

Algorithm comparison results
For the (ornament) classification problem we compared each of the algorithms (SVM, DT, ANN, and k-NN) with the baseline classifier (i.e. the majority class classifier) following the procedure explained in Section 4.1. From Table 7 it can be seen that all the algorithms present a statistically significant improvement, except k-NN. Given the accuracy results, we apply the ornamentation prediction model induced by the DT algorithm to determine whether a note is to be ornamented or not. We discarded the use of k-NN for this task due to its low accuracy, which led to larger mis-classifications of ornamented and not ornamented notes.
For the regression problems (duration, onset, and energy prediction) we applied regression trees, SVM, neural networks, and k-NN, and obtained the correlation coefficient values shown in Table 8. Onset deviation has the highest correlation coefficient, close to 0.5. For ornamentation classification using k-NN we explored several values for k (1 ≤ k ≤ 10). However, all of the explored values for k resulted in inferior classification accuracies when compared with decision trees and SVM. As in the case of k = 1, both the decision trees and SVM classifiers resulted in statistically significantly higher accuracies (based on the T-test) when compared with the classifiers for 2 ≤ k ≤ 10.

Learning curves
The learning curves of accuracy improvement, for both cross validation and training sets, over the number of instances are shown in (Figure 8). The learning curves were used to measure the learning rate and estimate the level of overfitting. Data subsets of different sizes (in steps of 100 randomly selected instances) were considered and evaluated using 10-fold cross validation. In general, for the three models, it can be seen that the accuracy on CV tends to have no significant improvement above 600 instances.
Overfitting can be correlated with the difference between the accuracies of CV and TS, wherein a high difference means higher levels of overfitting. In this sense, in Figure 8(c), SVM shows a high tendency for overfitting, but seems to improve slowly over the number of instances. On the other hand, in Figures 8(a) and 8(b), ANN and DT seem to improve overfitting between 700 and 1100 instances. This could mean that adding more instances may slightly improve the accuracy of both CV and TS for the three models, and may slightly improve overfitting for SVM, but this may not be the case for ANN and DT.  Figure 9 shows a MIDI piano roll of an example piece performed by a professional musician and the predicted performance obtained by the system, using a decision trees classifier. It can be noticed how the predicted piano roll follows a similar melodic structure as the one performed by the musician. For instance, for the score notes predicted correctly as ornamented (true positives), notes 1, 10, and 34 in Figure 9(a) (top sequence), the system finds ornaments of similar duration, offset, and number of notes as the musician's performance. Also, score notes 3 and 9 of Figure 9(b) (false positives), are ornamented similarly as score notes 18 and 26 (Figure 9(a)), which are in a similar melodic context. Figure 9. Musician versus predicted performance. Top and bottom sequences represent score and performance piano roll, respectively. Vertical lines indicate score to performance note correspondence. Gray and white boxes represent notes predicted as not ornamented and ornamented, respectively.

Duration and energy ratio curves
Duration and energy deviation ratio measured in the musician performance and predicted by the system for one example piece (All of me) are compared in Figures 10(a) and 10(b), respectively. We obtained similar results for the other pieces in the data set. Similarity between the contour of the curves indicates that the deviations predicted by the system are coherent with the ones performed by the musician. Figure 10. Performed versus predicted duration and energy ratio example for All of me. Gray and black lines represent performed and predicted ratios, respectively, for each note in the score.

Musical samples
Musical examples of the automatically generated ornamented pieces can be found in the Online Supplement (see the unnumbered section directly before the references list at the end of this article). The rendered audio of the Yesterdays music piece generated by the system (as test piece) has been included in this site.

Conclusions
In this article we have presented a machine learning approach for expressive performance (ornament, duration, onset, and energy) prediction and synthesis in jazz guitar music. We used a data set of 27 recordings performed by a professional jazz guitarist, and extracted a set of descriptors from the music scores and a symbolic representation from the audio recordings. In order to map performed notes to parent score notes we have automatically aligned performance to score data. Based on this alignment we obtained performance actions, calculated as deviations of the performance from the score. We created an ornaments database including the information on the ornamented notes performed by the musician. We have compared four learning algorithms to create models for ornamentation, based on performance measures, using a significance paired T-test. Feature selection techniques were employed to select the best feature subset for ornament modeling. For synthesis purposes, instance based learning was used to retrieve the most suitable ornament from the ornamentation database. A concatenative synthesis approach was used to generate expressive performances of new pieces -i.e. pieces not in the training set -automatically. A subjective perceptual evaluation based on listening tests is beyond the scope of this article. As future work, we plan to evaluate the performances generated by the system by computing the alignment distance between the system and the target performance.