Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics

This paper presents an effective technique for automatically clustering undocumented music recordings based on their associated singer. This serves as an indispensable step towards indexing and content-based information retrieval of music by singer. The proposed clustering system operates in an unsupervised manner, in which no prior information is available regarding the characteristics of singer voices, nor the population of singers. Methods are presented to separate vocal from non-vocal regions, to isolate the singers’ vocal characteristics from the background music, to compare the similarity between singers’ voices, and to determine the total number of unique singers from a collection of songs. Experimental evaluations conducted on a 200-track pop music database confirm the validity of the proposed system.

quite common in popular music.In music collections labeled only by song name, singer-based clustering can be used to distinguish the original and covered versions of a song and even determine the artist performing the cover.Cameos, or guest appearances in a song, are substantially less common than covers, but often occur in recordings of live concerts.Music collections that are labeled with song name and artist name may still fail to include names of cameo appearances.Again, singer-based clustering solves this problem.Lastly, the careers of many artists involve collaboration with several different bands, each with different names.Music collections that are labeled only by band name, as opposed to singer name, must be cross-referenced with band membership data to determine singer information.Even so, because the relationship between band and singer may be many-to-many, this cross-referencing is insufficient to determine the singer of a given song, and again singer-based clustering may be applied.
Before clustering music recordings by singer, we must detect and exploit the underlying characteristics of the singer's voices.This task resembles the recently emerging research on clustering or segmentation of spoken data based on their associated speakers (Kimber et al. 1995;Jin et al. 1997).However, because the lion's share of popular music contains background accompaniment during most or all vocal passages, it is unfeasible to acquire voiceonly data directly for drawing the desired singer's vocal characteristics like speaker-based clustering or segmentation generally does.In our earlier work (Tsai and Wang, in press), we have proposed a statistical method that leverages approximate estimation of a piece's music background to build a reliable model for the solo voice.The method has been shown effective in the problem of automatic singer recognition, in which a set of singers' refer-

M Music Recordings
Vocal/Non-vocal Segmentation Inter-recording Similarity Computation and Clustering

Singer Characteristic Extraction
ence models are created off-line using pre-collected music data labeled with singer identity, and unknown music recordings are then tested on the basis of the stochastic matching for the singers' reference models.In contrast to such a supervised singer-recognition problem, this study further extends our statistical modeling of singer voice characteristics to be operated in an unsupervised manner, which assumes no prior information is available regarding the singers involved and the population of singers.Special efforts are also made to compare the similarity among singers' voices and to determine the total number of unique singers from a collection of popular music recordings.

Problem Formulation
Given a set of M unlabeled music recordings, each performed by one of P singers, where M Ն P, and P is unknown, the aim of the singer-based clustering is to produce a partitioning of the M recordings into K clusters such that K ‫ס‬ P, and each cluster consists exclusively of recordings from only one singer.For those recordings that contain multiple singers or background vocals, the formulation remains applicable if the recordings are presegmented into singer-homogenous regions.However, during the initial development stage, we limit ourselves to single-singer recordings.
Because the goal of this study is to cluster the recordings on the basis of the singer's voice rather than the background music, musical genre, or other characteristics of the recording, it is necessary to extract, model, and compare the characteristic features of the singers' voices without interference from non-singer features.Thus, a three-stage process as shown in Figure 1  Performance of the singer-based clustering is evaluated on the basis of cluster purity (Solomonoff et al. 1998), defined as where q k is the purity of the cluster k, n k is the total number of recordings in the cluster k, and n kp is the number of recordings in the cluster k that were performed by singer p. From Equation 1, it follows that n k ‫1מ‬ Յ q k Յ 1, in which the upper bound and lower bound reflect that all within-cluster recordings were performed by the same singer or completely different singers, respectively.To evaluate the overall performance of a K-clustering, we also compute an average purity: Vocal/Non-Vocal Segmentation As a first step in determining the vocal characteristics of a singer, music segments that contain vocals are located and marked as such.This task can be formulated as a problem of distinguishing between vocal segments and accompaniments, analogous to the study by Berenzweig and Ellis (2001).However, in contrast to their work, which uses a speech recognizer for detecting singing voices, we propose constructing a statistical classifier with parametric models trained using accompanied singing voices rather than normal speech.As shown in Figure 2, the classifier consists of a front-end signal processor that converts digital waveforms of a music recording into spectral feature vectors, followed by a back-end statistical processor that performs modeling, matching, and decision-making.It operates in two phases: training and testing.
During training, a music database with manual vocal/non-vocal transcriptions is used to form two separate Gaussian mixture models (GMMs): a vocal GMM and a non-vocal GMM.The use of GMMs is motivated by the desire to model various broad acoustic classes by a combination of Gaussian components.These broad acoustic classes reflect some general vocal tract and instrumental configurations.It has been shown that GMMs have a strong ability to provide smooth approximations to arbitrarily shaped densities of spectrum over a long time span (Reynolds and Rose 1995).We denote the vocal GMM as k V and the non-vocal GMM k N .Parameters of the GMMs are initialized via k-means clustering and iteratively adjusted via expectationmaximization (EM; see Dempster et al. 1977).
In the testing phase, the recognizer takes as input the T x -length feature vectors X ‫ס‬ {x 1 , x 2 , . .., x Tx } extracted from an unknown recording, and produces as output the frame log-likelihoods log p(x t |k V ) and log p(x t |k N ), 1Յ t Յ T x , for the vocal and the nonvocal GMM, respectively.The attribute of each frame is then hypothesized according to a decision rule made on the frame log-likelihoods.Recognizing that singing tends to be continuous, classification is preferably made in a segment-by-segment manner rather than a frame-by-frame manner.An intuitive approach is to hypothesize the attribute of a fixed-length segment by taking an accumulation of their frame log-likelihoods into account.In particular, accumulating the frame log-likelihoods over a longer period is more statistically reliable for decision-making.However, long segments could run the risk of crossing multiple vocal/non-vocal boundaries.
To better handle this problem, the decision rule is designed on the basis of homogeneous segments that are located in the following way.First, we employ k-means clustering on all the feature vectors of a music recording.Each frame is then assigned a cluster index associated with that frame's feature vector, and hence a recording is tokenized as a cluster index stream.Next, the cluster index stream is divided into a sequence of consecutive, non-overlapping, fixed-length short segments.Each short segment is then assigned the majority index of its constituent frames, and adjacent segments are merged as a homogeneous segment if they have the same index.Finally, classification is made per homogeneous segment: where W k and s k represent, respectively, the length and starting frame of the kth homogeneous segment, and g is the decision threshold.

Singer Characteristic Extraction
After locating the vocal portions, the next important step is to extract the singer's vocal characteristics from these vocal portions.In most popular music, substantial similarities exist between the non-vocal regions and the accompaniment of the vocal regions, and therefore it is reasonable to assume that the stochastic characteristics of the background music can be approximated by those of the non-vocal regions.This assumption forms the basis of our following formulation.Let V ‫ס‬ {v 1 , v 2 , . .., v T } represent the feature vectors computed from the vocal regions of a music recording.Owing to the existence of accompaniment, V can be considered as a mixture of a solo voice S ‫ס‬ {s 1 , s 2 , . .., s T } and background music B ‫ס‬ {b 1 , b 2 , . .., b T }.More specifically, S and B are additive in the time domain and linear spectrum domain, but neither of them is observable.Our goal is to construct a stochastic model for the solo voice signal S such that the underlying singer's voice characteristics can be parametrically represented.Adapting the techniques developed in robust speech recognition (Nadas et al. 1989;Rose et al. 1994), we assume that the solo voice signal and background music are, respectively, drawn randomly and independently according to GMMs where w s,i and w b,j are mixture weights, l s,i and l b,j are mean vectors, and R s,i and R b,j covariance matrices.If the accompanied signal is formed from a generative function v t ‫ס‬ f(s t , b t ), 1 Յ t Յ T, the probability of observing V, given k s and k b , can be represented by where p(v t |l s,i ,R s,i ,l b,j ,R b,j ) accounts for the possible combination of the solo voice and background music that can form an instant accompanied voice v t .
Because S and B are statistically independent, the probability p(v t | l s,i ,R s,i ,l b,j ,R b,j ) can be computed by where N N(•) denotes a Gaussian density function.As we mentioned earlier, although the background music B is unobservable, its stochastic characteristics can be approximated from the nonvocal regions.Therefore, the background music model k b can be created directly using the feature vectors computed from the non-vocal regions.Then, with the available background music model k b and the observable accompanied voice V, it is sufficient to derive the solo voice model k s via a maximum likelihood estimation:

≈ ≡
Letting ٌQ(k s , s ) ‫ס‬ 0 with respect to each pa-k rameter to be re-estimated, we have where the prime operator (Ј) denotes vector transpose, and E{•} denotes expectation.The details of Equations 10-12 required for implementation are given in the Appendix.Figure 3 summarizes the procedure for building a solo voice model.Note that if the number of mixtures in the background music GMM is zero, then the solo voice modeling degenerates to directly modeling the observed vocal signal without taking the background music into account.This serves as a baseline to examine the effectiveness of our solo voice-modeling method.

Similarity Computation and Clustering
Once the singer's vocal characteristics are modeled, the similarity between music recordings can be measured, and the recordings resembling each other can thus be grouped into a cluster.Figure 4 shows the procedure for similarity computation and clustering using a method extended from Tsai et al. (2001).
To begin, a solo voice model k s,i and a background music model k b,i is generated for each of the M recordings to be clustered, 1 Յ i Յ M. The loglikelihood, L i,j ‫ס‬ log p(V i |k s,j ,k b,i ), 1 Յ i, j Յ M, that the vocal portion of the recording V i tests against the model k s,j , is then computed using Equation 4. Here, a large log-likelihood L i,j indicates that the singer of recording i is similar to the singer of recording j.Accordingly, singer-based clustering may be formulated as a conventional vector-clustering problem by assigning a log-likelihood vector L i ‫ס‬ [L i,1 ,L i,2 , . .., L i,M ]Ј, 1 Յ i Յ M, to each recording i, and computing the similarity between two recordings using the Euclidean distance |L i ‫מ‬ L j |.Because the dynamic range of the log-likelihood varies from recording to recording, the log-likelihood vectors must be rescaled for the similarity to be a comparable measurement.To perform rescaling, the L i,j for each recording i are ranked in descending order of their values, the rank of L i,j is denoted by R i,j .Then, a likelihood-ratio vector where ␣ is a positive constant for scaling, and h is an integer constant for pruning the lower loglikelihoods.The likelihood-ratio vector above emphasizes the larger log-likelihoods, suppresses the smaller ones, and warps the vector components to lie between 0 and 1. Example likelihood-ratio vectors computed for a collection of 25 music recordings with five female singers are shown as columns  cluster and recursively divides clusters in attempts to minimize the within-cluster variances.In applying the k-means algorithm, we must choose how many clusters to create.If the number of clusters is low, a single cluster is likely to include recordings from multiple singers.On the other hand, if the number of clusters is too high, a single singer's recordings will be divided across multiple clusters.Clearly, the optimal number of clusters K is equal to the number of singers P, which is unknown.In this study, the Bayesian Information Criterion (BIC; see Schwarz 1978) is employed to decide the best value of K.The BIC is a model assessment criterion that assigns a value to a stochastic model based on how well the model fits a data set, and how simple the model is: where ᐉ is the number of free parameters in model K, |D| is the size of the data set D, and c is a penalty factor.The larger the BIC value is, the better the model performs.If we view a split of K clusters generated as a stochastic model in which each of the K clusters is represented by a Gaussian distribution with full covariance matrix, the BIC for this K-clustering may then be computed by where n k is the number of characteristic vectors in the cluster k, and R k is the covariance matrix of the n k likelihood-ratio in the cluster k.The BIC value should increase as the splitting improves conformity of the model, but it should decline significantly after an excess of clusters is created.Thus, a reasonable number of clusters can be determined by 1ՅKՅM

Experimental Results
The music data used in this study consisted of 416 tracks from Mandarin pop music CDs.Most of the tracks were lyrical ballads, but around ten percent were folk-like songs accompanied by a disco beat, and another ten percent were essentially stylistic imitations of Western pop blended with hip hop and rock.The average length of the tracks was about three minutes.All the tracks were manually labeled with the singer's identity and the vocal/ non-vocal boundaries as the ground truth.The database was divided into two subsets, denoted as DB-1 and DB-2, respectively.DB-1 comprised 200 tracks performed by 10 female and 10 male singers, with 10 distinct songs per singer.DB-2 contained the remaining 216 tracks, involving 13 female and 8 male singers, none of whom appeared in DB-1.All music data were down-sampled from the CD sampling rate of 44.1 kHz to 22.05 kHz to exclude the high frequency components beyond the range of normal singing voices.Feature vectors, each consisting of 20 Mel-scale Frequency Cepstral Coefficients (MFCCs), were computed using 32-msec Hamming-windowed frames with 10-msec frame shifts.
The vocal and non-vocal GMMs were trained using DB-2, whereas performance of the proposed methods was evaluated using DB-1.Our first experiments tested the validity of the vocal/non-vocal segmentation methods.Accuracy was computed by comparing the hypothesized attribute of each frame with the manual label Accuracy (in %) number of correctly identified frames ‫ס‬ (18) number of total frames ‫ן‬ 100% However, in view of the limited precision with which the human ear detects vocal/non-vocal  changes, all frames that occurred within 0.5 sec of a perceived switch point were ignored in the computation.Table 1 summarizes the results of vocal/ non-vocal segmentation using the empirically most accurate configuration of a 64-mixture vocal GMM and 80-mixture non-vocal GMM.Here, experiments were conducted with different lengths of short segments and different numbers of clusters for frame feature tokenization.The last row of the table shows the results yielded by assigning a single classification per short segment of N frames, where N ‫ס‬ 1, 20, 40, 60, or 80, without merging adjacent segments into a homogeneous one.We can see the benefit of the homogeneous-segmentbased classification.The best accuracy achieved was 79.8 percent, using 60-frame segments along with 16 clusters for frame feature tokenization.
Table 2 shows the confusion probability matrix from the best vocal/non-vocal classification results.The rows of the confusion matrix correspond to the ground-truth of the segments, and the columns indicate the hypotheses.We can see that the majority of errors are misidentifications of vocal segments.Qualitatively, we found that many falsely identified vocal segments had unusually loud background accompaniment or unusually quiet vocals.However, owing to the high accompaniment-to-vocal ratio, we believe that such false judgments may actually benefit the singer clustering.
Next, the entire singer-clustering system, based on both manual vocal/non-vocal segmentation and the best results of automatic segmentation, was examined on DB-1. Figure 6 shows the average purity as a function of the number of clusters.Here, the results yielded by the manual segmentation may be viewed as a performance upper bound of automatic segmentation.We can see that as expected, the average purity gains sharply as the number of clusters increases in the beginning and then tends to saturate after too many clusters are created.Comparing the results with and without explicit usage of the background model in extracting the solo voice information, it is clear that the solo voice modeling is superior to directly modeling the observed vocal signal.When the number of clusters is equal to the singer population, i.e., K ‫ס‬ P ‫ס‬ 20, the highest purities of 0.87 and 0.77 were yielded by using manual segmentation and automatic segmentation, respectively.Further analysis of the results found that it is much easier to separate the singers of different gender through the use of solo voice modeling.When only two clusters were generated, we found that more than 95 percent within-cluster tracks are associated with the same gender.
Lastly, the problem of automatically determining the singer population was investigated.A series of clustering experiments were conducted using the following configuration of music data stemmed from DB-1: 50 music recordings (5 randomly chosen singers ‫ן‬ 10 tracks), 100 music recordings (10 randomly chosen singers ‫ן‬ 10 tracks), 150 music recordings (15 randomly chosen singers ‫ן‬ 10 tracks), and 200 music recordings (20 singers ‫ן‬ 10 tracks).

Conclusions
This study examined the feasibility of unsupervised clustering of music data based on their associated singer.It has been shown that the characteristics of a singer's voice can be extracted from music via vocal segment detection followed by solo-vocal signal modeling.Singer-based clustering was formulated and solved using a vector-clustering framework with reliable estimation of the correct number of clusters.
Although fairly good results have been reported in this article, more work is needed to validate the proposed methods for a wider variety of music data, such as larger singer populations and richer songs with different music styles.Furthermore, future work for singer-based clustering will extend the current system to handle duets, chorus, background vocals, or other music data with multiple simultaneous or non-simultaneous singers.MFCCs, the accompanied voice can be approximately expressed by Because MFCCs are fairly decorrelated, it is reasonable to assume that the covariance matrices of the GMMs are diagonal for implementation efficiency.Thus, the probability p(v t |l s,i ,R s,i ,l b,j ,R b,j ) can be computed using d s,i ,d s,i ,d b,j ,d b,j ,d d‫1ס‬ where D is the dimension of the feature vector, and v t,d ,l s,i,d ,r 2 s,i,d ,l b,j,d , and r 2 b,j,d are, respectively, the dth component of v t ,l s,i ,R s,i ,l b,j , and R b,j .For ease of discussion, we drop the component index d and focus on scalar operations.Consider an arbitrary component, v t of v t , the probability p(v t |l s,i ,r 2 s,i , l b,j ,r 2 b,j ), computed as Ύ ‫מ‬ϱ 2s

Ί
The value of U(s) can be obtained using a table of the error function.On the other hand, the conditional expectation E{s t |v t ,l s,i ,R s,i ,l b,j ,R b,j } with respect to an arbitrary component, s t of s t , can be shown in the following form: is proposed.The first stage involves the segmentation of each recording into vocal and non-vocal segments, where vocal segments consist of concurrent singing and accompaniment, and non-vocal segments consist of accompaniment only.Next, the singer's stochastic vocal characteristics are distilled from the vocal segments by specifically suppressing the characteristics of the background.The third and final step involves clustering of the recordings based on singer-characteristic similarity.
Equation 6 can be solved using the EM algorithm, which starts with an initial model k s and iteratively estimates a new model s such that k p(V| s ,k b ) Ն p(V|k s ,k b ).It can be shown that the k need of increasing the probability p(V| s ,k b ) can be k satisfied by maximizing the auxiliary function

Figure 3 .
Figure 3. Procedure for the solo voice modeling.

Figure 4 .
Figure 4. Illustration of the similarity computation and clustering.

Figure 7 .
Figure 7. BIC measurements after each division.The arrows indicate the optimal numbers of clusters according to the BIC criterion.

Table 2 . Confusion Probability Matrix of the Vocal/Non-Vocal Discrimination
‫ם‬ [1 ‫מ‬ p(s ‫ס‬ v |l ,r ,l ,r )]Similarly, the conditional expectation E{s t sЈ t |v t ,l s,i ,R s,i ,l b,j ,R b,j } is computed using • E{s |s Ͻ v ,l ,r ,l ,r } E{s |s Ͻ v ,l ,r ,l ,r } ‫ס‬ 2 N N(v ; l ,r ) l ‫ם‬ r ‫מ‬ (l ‫ם‬ v )r U r s,i