Learning to Cope With Diversity in Music Retrieval

The approach presented here makes productive use of the multidimensionality of music retrieval. It exploits heterogeneous representations of music objects into a self-adapting retrieval system. The different perspectives of users can be expressed by relevance feedback and serve as the direction for a learning process which ultimately leads to an optimal solution for a user within a certain context. The paper explores the diversity within music retrieval stemming from an abundance of approaches for representing musical objects as well as methods for searching for similarity. As a result, the system designer is usually confronted with a large number of arbitrary decisions. These challenges are discussed within the M-MIMOR framework, which provides an appropriate solution. A fusion with linear combination guarantees that every perspective is integrated. The weight and therefore the strength of one perspective is reflected by the weight of the representation scheme or matching algorithm into the fusion. These weights are adapted according to their success in previous retrieval tasks.


Introduction
Most areas within information retrieval (IR) have created a variety of domain-specific representation, matching and user modeling schemes.For example, there is still a discussion in cross-language text retrieval whether words or n-grams of letters are the most appropriate items for the representation of language (Peters, Braschler, Gonzalo, & Kluck, 2002).
The complexity of other application domains results in even higher levels of diversity.Especially multimedia abounds in the number of possible representation mechanisms.Image retrieval systems may be based on color, texture, histograms, orientation shapes or objects (Santini, 2001).Video adds a temporal aspect as well as possibly sounds.A video retrieval system needs to consider frames as well as the combination and mutual dependency between sounds and graphics (Smeaton, 2001).
Designers of music retrieval systems are also confronted with a large number of choices.A look at some of the implementations proves this point (Downie & Bainbridge, 2001;Fingerhut, 2002).Furthermore, the forms for the user query include a significant number of different parameter choices.
The search for appropriate algorithms and units or atoms for proper representation within the universe of possibilities seems to be a curse, since each solution neglects several aspects.However, it can turn out to be a blessing when the diversity is integrated into a self adapting fusion system considering many heterogeneous solutions.This paper presents a model including a machine-learning approach, to balance the influence of several system parameters according to users' preferences.
The following section introduces information retrieval and related concepts.The third section demonstrates in more detail that music retrieval is confronted with a highly diverse solution space.Fusion approaches for information retrieval are discussed in the fourth section.Next, a self adapting fusion system is introduced which takes the challenges faced by music retrieval into account and makes productive use of them.

Information retrieval
Information retrieval deals with the storage and representation of knowledge and the retrieval of information relevant for a special user problem. he information seeker formulates a query which is compared to document representations extracted during the indexing phase.
The query formulation process is influenced by the users' state of knowledge, their context as well as by the user interface or query language.
Thomas Mandl and Christa Womser-Hacker The representations of documents and queries are typically matched by a similarity function such as the Cosine or the Dice coefficient.The most similar documents are presented to the users who can evaluate the relevance with respect to their problem.Figure 1 gives an overview of the retrieval process.
This information retrieval process is inherently vague.Documents and queries traditionally contain natural language or more and more multimedia objects like graphics, pictures or music pieces.The content of these objects must be analyzed, which is a hard task for artificial systems.Robust semantic analysis of large text collections or multimedia objects has yet to be developed.Therefore, text documents are represented by natural language terms mostly without syntactic or semantic context.These keywords or terms can only imperfectly represent an object.In text retrieval, queries and documents are represented via terms or descriptors.In multimedia retrieval, the context is essential for the selection of a form of query and document representation.Different media representations may be matched against each other or transformations may become necessary (e.g., to match terms against pictures or spoken language utterances against documents in written text).
Because information retrieval needs to deal with vague knowledge, exact processing methods are not appropriate.Vague retrieval models such as a probabilistic model are more suitable.Within these models, terms are provided with weights corresponding to their importance for a document.These weights mirror different levels of relevance.An overview of information retrieval can be found in Baeza-Yates and Ribeiro-Neto (1999).

Diversity in music retrieval
The complexity of music as a formal system as well as a cultural phenomenon leads to difficulties for the computational representation.A single note by itself has no meaning and there is no correspondence to the syntax nor to words in natural language.In addition, many different technical formats for storing musical data exist (Fingerhut, 2002).
Traditionally, music retrieval has been merely based on textual meta-data, such as author, performer or the name of a piece.In recent years, content-based retrieval methods have been developed which focus on features automatically derived from music objects (Lippincott, 2002).These features try to describe the musical content.Both meta-data retrieval and content-based methods may be appropriate for specific contexts.An optimal music retrieval system allows both and enables cooperation between these approaches.

Musical styles
Music as a cultural phenomenon has led to an abundance of musical styles.They all use different methods to express their intentions or let the listener enjoy a pleasing experience.Sounds, notes, pauses, tempi, instruments, and voices are combined in many ways.This multi-modal structure of music represents one of the roots of the challenge of music retrieval.Depending on the style, different atoms are assembled for a composition.Which elements can be considered as the prominent feature, such as a melody or a theme, depends largely upon the style, the user need and context.Therefore, a formal model for the representation allows an optimization only for one style of music.

Representation
The possibility to store music in digital representations has made it increasingly attractive to search large music collections.In contrast to texts, music "documents" lack separa- tors necessary to identify semantic units such as "words" or "phrases."Like words in texts, the same melodic pattern may occur in more than one piece of music, perhaps composed by different composers.The same entity can be represented in two different main forms: the notated and the acoustic form.Music communication is performed at two levels: the composer creates a musical structure while the interpreter (musician or singer) translates the written score into sounds.The resulting performances may differ a lot from each other.Knowledge within a musical work can be identified at different levels: melody, harmony and rhythm can be assessed in written formats, whereas in the case of musical performances other dimensions like timbre, articulation, and timing may be of interest.However, only a subset of these dimensions is captured by musical representation formats like MIDI.
The most widespread mode for music retrieval is search via similarity.However, similarity in music retrieval presents several difficulties: what part of a song is likely to be perceived as the theme of the music?How can one determine whether two pieces of music with different sequences of notes represent the same theme?Melucci and Orio (1999) discuss some of these issues of content-based indexing of musical data.
Most systems search for similarity on a level of pitch.Usually, these systems, like SEMEX (Lemström & Perttu, 2000), only process monophonic melodies; however, for some musical styles polyphonic matching would be desirable.Further parameters lie in the consideration of transposition invariance and the global or local focus of a representation.A global representation might, for example, only consider a histogram analysis of pitch values.Approaches from speech recognition have also been applied to music retrieval (Logan, 2000).
In Foote (1999), a novel approach to visualizing the time structure of music is presented.The acoustic similarity between any two music objects is displayed in a 2D representation, allowing the identification of structural and rhythmic properties.

Retrieval models
The choice of the retrieval model is an important factor in any domain.One central aspect in a information retrieval model is the similarity calculation between query and object representations.In music information retrieval, a large number of functions to calculate the similarity between melodies have been proposed (Rolland & Ganascia, 1999).They consider the closeness of the match between query input and database objects.In most cases, acoustic information is transcribed and converted into intervals and is used for deriving feature vectors.These vectors can be compared with input vectors with respect to measuring similarity.The vector-space retrieval model can provide ranked output according to the match of query and document vectors.

Input mode
Existing music retrieval systems allow a variety of input modes for querying.They can be classified into two fundamental modes: first, the specification of retrieval criteria using textual meta-data, second, the query by example mode, which accepts the users' acoustic input like singing or humming for example via a microphone, an uploaded file or typing in part of a melody e.g., by a MIDI keyboard.Haus and Pollastri (2000) present a hierarchy of music interfaces that reflects the level of expertise of a particular type of user.At the top of the hierarchy the textual input is placed as the easiest mode of querying, while music notation takes the last position as it is considered to be the most difficult interaction mode requiring considerable expertise.
Previous prototypes on music retrieval mostly concentrate on only one input mode.Naoko, Yuichi, Tetsuo, Masashi, and Kazuhiko (2000) and Rolland (2001) used vocal input, while Uitdenbogerd and Zobel (1999) for example, preferred input via a MIDI interface.
Most online music library catalogues use textual metadata.They follow the tradition of textual information retrieval where the rules for maintaining consistency have been refined over many years.These systems do not allow searching by musical content.Within the OMRAS project an integrated approach of the two types of systems is investigated (Dovey, 2001).
Modes requiring the automatic detection of criteria like humming are associated with a level of vagueness or fuzziness resulting from the uncertainty related to the reliability of the detection method.Many efforts in this area focus on music data that contain some built-in semantic information structures or focus on classification of music.

User intentions
The intentions of users of music retrieval systems vary greatly.The usage scenarios comprise entertainment, learning, research or support for composition.Users with different skills (trained musicians or non-professionals) may interact with music systems.The usefulness of multimedia systems largely depends on the way they match the users' expectations and their technical as well as musical skills.

Implementation as selection within a solution space
All parameters for music retrieval discussed above need to be considered when implementing a system.Each parameter represents one dimension in the solution space for a specific retrieval system.The space of potential solutions is highly dimensional.The determination of the value of all parameters defines an instance of a retrieval system as illustrated in Figure 2.
The search for a solution within the high dimensional space has the goal of achieving a good retrieval quality.This search is guided either by heuristics or by empirical results.
Finding an optimal solution requires a large testbed of tasks and evaluation of the results by users or experts.
However, when the conditions change, a different solution might be optimal.These changes may be the consequence of different queries, new user interests or changes of the music content.Now, another solution might produce the optimal result.

Approaches for coping with diversity and heterogeneity
As a consequence of the heterogeneity presented above, most existing music retrieval systems are focused on one type of music only.This form of content specialization is a typical reaction to such complex domains; however, it limits the content and is therefore not desirable for many applications.Haus and Pollastri (2000) developed a multimodal prototype for different users in which the musician who is able to write a musical query on a score can play it with a musical instrument or sing it while the layman can only sing or query by textual data.They suggest translations of audio inputs to pass from the acoustic to the symbolic domain.
Another solution for a highly dimensional optimization problem lies in the active adaptation by the user.In a system suggested by Bainbridge (2000), the user can set a large number of parameters of the matching function himself.These settings include duration, start of pattern and type of match.This solution is effective for musical experts who can predict the consequences of their settings, whereas, the layman may not even know what the parameters mean.The expert may also profit from a more dynamic approach which optimizes the settings according to the context.
An approach focusing on efficiency is presented in Lemström, Wiggins, and Meredith (2001), where three layers are implemented.The first level is more efficient and less thorough.The longer a user waits, the more occurrences to his query may be encountered.This seems to be an appropriate strategy; however, different contexts may already tend to have a good performance in one layer only.
Another highly promising strategy is an adaptive user model such as the one presented by Rolland (2001), which takes into account the multidimensionality of human similarity judgement as well as the different importance of representations.Consequently, a weight vector is assigned to each user representing the importance of different representations.The model is adapted by user feedback, which has proved highly effective in text retrieval.However, the initialization assigns arbitrary weights and the context is not modeled.Both aspects are important, since the first contact with a system should lead to reasonably good results.The consideration of context is crucial because the needs of the user may be dynamic and different optimizations are necessary when the same user queries different musical styles.

Optimization through fusion
The fusion of various approaches is widely used in computer science.The goal of applying several algorithms is to improve the overall performance.Fusion methods delegate a task to several systems and integrate their results into one final result presented to the user.Ideally, the weaknesses of one method do not have a large negative influence on the final result because they are superimposed by another method.Typical examples are committee machines in machine learning (Haykin, 1999).The fusion may be implemented as a voting scheme or as a weighted linear combination.Recently, non-linear committee machines like boosting or bagging have drawn considerable attention because of their high effectiveness (Witten & Frank, 2000).

Fusion in information retrieval
For information retrieval, fusion can also be implemented as a combination of several algorithms.The integration considers several different probabilities for the relevance of a document to a query and calculates one final similarity measure.
Fusion has led to a significant amount of research in information retrieval.This is especially true since experiments carried out within the framework of TREC (Text Retrieval Conference) have shown that the results of similarly wellperforming information retrieval systems often differ.This means that the systems find the same percentage of relevant documents, but the overlap between their results is sometimes low.TREC is an initiative which has led to a higher level of comparability in information retrieval.Whereas, before TREC, most researchers used their own small collection to test their systems, TREC now provides a testbed for the empirical evaluation of different systems (Voorhees & Harman, 2001).
Because of these results, fusion seems to be a promising approach and has been applied to text retrieval (Fox & Shaw, 1994;Vogt & Cottrell, 1998;McCabe, Chowdhury, Grossmann, & Frieder, 1999;Savoy, 2002).A model for a fusion system is presented in Figure 3.A different kind of fusion is carried out by the popular internet meta search engines.These machines have been developed because of the fact that search engines in the internet can hardly index all the documents of the internet.Meta search engines attempt to create a greater basis for the search for relevant material by combining the results of single search engines.However, it is not clear whether meta search really leads to better retrieval results.Some empirical studies have shown no improvement (Wolff, 2000).

The MIMOR model
MIMOR (Multiple Indexing and Method-Object Relations) is a fusion approach taking advantage of heterogeneity (Womser-Hacker, 1997).The MIMOR model samples users' relevance feedback to predict optimal method-object relations where methods are indexing algorithms or retrieval models.These are assigned to the characteristics of users and documents with the goal of improving the overall retrieval quality.From a computational viewpoint, MIMOR is designed as a linear combination of the results of different retrieval systems.The contribution of each system or algorithm to the fusion result is governed by a weight for that system.
A central aspect in MIMOR is learning.The weight of the linear combination of each information retrieval system is adapted according to the success of this system in previous search tasks.The success is measured by the relevance feedback of the users.A system which gave a high retrieval status value (RSV) and consequently a high rank to a document which then received positive relevance feedback contributes Learning in MIMOR leads to a fusion which combines the individual systems in an optimal way after a sufficient period of usage.As a result, MIMOR takes advantage of two of the most promising strategies for improving information retrieval systems, these are relevance feedback and fusion.However, the optimal combination may depend on the context and especially on the users' individual perspectives and the characteristics of the documents.Therefore, MIMOR needs to consider context.

Evaluation
So far, MIMOR has been evaluated twice on a large scale with text data from the Cross Language Evaluation Forum (CLEF, cf.Peters, Braschler, Gonzalo, & Kluck, 2002).In one set of experiments, a corpus of 80000 text documents was processed by a retrieval system with different parameter settings.The results were fused with equal weights and with optimized weights.These optimal weights were derived from the CLEF 2001 data.The fusion with MIMOR led to encour- aging results and gave a much higher performance than the single systems (Hackl, Kölle, Mandl, & Womser-Hacker, 2002).
Our focus is on another set of experiments, where the heterogeneity of the retrieval systems was higher.Therefore, it is more applicable to music retrieval where the differences between the systems are quite substantial.In this experiment, MIMOR was evaluated with a commercially available retrieval software, in this case IBMs DB2 text extender. 1Details of this evaluation can be found in Li (2002).
The corpus consisted of all issues of the German news magazine Der Spiegel from the year 1994 and contained some 14000 documents.Queries were formed from the 30 CLEF topics from the 2000 campaign.The CLEF topics contain three parts; a title, a short description and a longer description (narrative).All of these parts were used for the experiments with DB2.
Text Extender allows many parameter settings mainly based on different linguistic processing modules.Some of these parameter settings were used to construct the different systems for our fusion experiment.Text Extender supports the Boolean retrieval model as well as a probabilistic model.It comprises linguistic pre-processing including stemming and a n-gram approach which does not use words but n-grams as basic units.Alternatively, a precise index without pre-processing was used.
As Table 1 shows, there is an increase in the quality of the retrieval results which lies between 5% and 11%.This gain is calculated from the best individual retrieval result.This means that fusing the best result with another result which may be worse leads to an overall improvement.As Table 1 shows, the variability between the individual retrieval systems modeled within DB2 is high.The probabilistic and the Boolean retrieval model differ in their basic approach.Furthermore, some of the runs use words and others use ngrams as a basic representational unit.Therefore, fusion is an especially promising approach in a highly heterogeneous environment like music retrieval.

Context model
The performances of information retrieval systems differ from domain to domain, and characteristics of the documents relevant for the indexing procedure may be responsible.In one experiment for example, optimal similarity functions for short queries could be developed (Kwok & Chan, 1998).MIMOR builds upon the idea that formal properties can be exploited to improve fusion.Some retrieval methods work better, for example, for short documents.The weight of these systems should be high for short documents only.Some characteristics of text documents seem to be good candidates for such distinctions.Length, difficulty, syntactic complexity, and even layout can be assessed automatically.
These properties are modeled as clusters.All documents which have a property in common belong to the same cluster.Each cluster can develop its own adequate MIMOR model with weights for all participating systems.
The term clustering is usually used for non-supervised learning methods which find structures in data without hypotheses.However, the assignment of text documents to clusters for the improvement of information retrieval processes may also be carried out with supervised learning methods.Therefore, the term cluster in this article does not restrict this process to algorithms based on unsupervised learning.Supervised learning methods for pre-defined classes and even human assignment are compatible with MIMOR.
A theoretical justification for a cluster model can be found in the evaluation strategies for clustering algorithms like minimal description length or category utility (Witten & Frank, 2000).Category utility estimates the value of a cluster by checking how well it can be used to predict attribute values of objects.Clusters are good if the probability of an object having a certain value is higher for objects in a specific cluster than for all objects.If good clusters are found and one attribute is an appropriate retrieval system, then the probability is high that a good retrieval system for that specific object is used.
Introducing clusters in MIMOR can be regarded as the implementation of an individual MIMOR model for each cluster.The final result considers only the weight of the cluster to which the document belongs.The learning formula needs to be modified accordingly.The change in the weight is now applied only to the cluster containing the document.
Clustering documents is a tedious task.In many cases, the hard assignment of a document to only one class is difficult.Therefore, this condition needs to be relaxed and fuzzy clustering has also been integrated into MIMOR (Mandl & Womser-Hacker, 2001).

User model
Further refinement of MIMOR can be achieved by integrating a user model.Unlike other user models in information retrieval, MIMOR introduces an adaptation in the core of an information retrieval system and applies it to the calculation of the RSV.
Similar to the properties of the documents, an additional MIMOR model for each person could be introduced, leading to optimal user models.However, the training of a MIMOR model requires a substantial amount of relevance feedback decisions.Therefore, the user is forced to submit many decisions before the system can be used effectively.Another disadvantage is common to all inductive and incremental learning algorithms.The occurrence of some unusual cases in the initial learning phase may lead the algorithm to an unstable learning curve.This may result in a degradation of the retrieval behavior.
Both problems are solved by introducing separate private and public models.The private model contains a user specific MIMOR model optimized by all the relevance feedback decisions of that user.The public model is trained with all decisions of all users of the system.The public MIMOR is optimized but not individualized.Therefore, it can be used for any user beginning to work with MIMOR because an individual model is not available.Over time, such a beginner will collect a significant number of relevance judgements and will eventually reach a fully individualized and saturated model.During this process, the public model will lose its influence while the importance of the private model grows.
The user model in MIMOR differs from many individualization approaches in information retrieval.Often, the individual preferences are stored as a content model.Many systems use interest vectors.MIMOR applies individualization to the algorithmic layer of the system.

M-MIMOR: Self adaptation for music retrieval systems
The MIMOR approach is very well suited for music retrieval.Music retrieval incorporates high diversity along several dimensions of system parameters.The choices for parameter values are almost arbitrary.On the other hand, MIMOR offers a fusion method which learns from the preferences of the user.A mapping is established between the application of system features and success expressed by positive user feedback.Instead of focusing on one value for each system parameter, each user receives the most appropriate mixture of the options available.As a consequence, we propose a MIMOR for music objects called M-MIMOR.
The diversity in music retrieval approaches has been sketched in Section 3. In the following sections, these aspects of diversity are handled by our M-MIMOR model.Based on the literature and experience from text retrieval, the following distribution of fusion parameters is most favorable.Representation and matching aspects are treated in the basic MIMOR system by allowing a variety of representations.Different styles and contexts are consequently treated in the context model.The heterogeneous user population and different usage scenarios need to be captured by the user model.

Representation of musical objects in M-MIMOR
A large variety of representation formalisms has been presented in Section 3.For a fusion, the aspects shown in Table 2 should be integrated when available in order to achieve a highly heterogeneous representation mix.
Further aspects may need to be integrated in specific situations.Matching aspects do not play such an important role in music retrieval.Since the representations are very different, the similarity algorithms are often adapted for specific representation schemes.Combining all matching methods and representations schemes is sometimes useful in text retrieval.However, it will rarely prove useful in music retrieval because it may lead to inappropriate combinations.

M-MIMOR context model
Automatic genre detection systems have also been developed for music objects (Tzanetakis, Essl, & Cook, 2001).Therefore, genre can be used as one feature in M-MIMOR.The calculation of the similarity between query and musical objects needs to consider not only the systems involved.In addition, the clusters to which an object belongs and the membership function M enter the formal model.

M-MIMOR user model
The reasons for similarity or relevance judgements in music retrieval are highly subjective.Each user may find a different combination of musical characteristics important in a certain situation and apply them to his individual judgement.
Because MIMOR integrates different aspects of music it must be individualized to reach a high overlap between the users' preferences and the internal representation.Between public and private models, another layer for group-specific MIMOR models, e.g., for researchers, could be implemented in the future.

Conclusion
This article introduces a model for music retrieval which automatically learns to adapt itself to the cognitive preferences of the user and supports the multimodal nature of music.Since the evaluation of musical objects is highly subjective, a retrieval system needs to dynamically identify the most appropriate combination of system parameters for a given user.M-MIMOR manages this integration in a linear combination of many possible variables.Consequently, M-MIMOR takes personalization and adaptivity one step further.
As a result, no viewpoint expressed in a certain algorithm or representation method is necessarily neglected but may contribute with a useful weight to the final result.The fusion of a diversity of perspectives will ultimately lead to better retrieval performance.
weight to the final result.The effect of this learning process is shown in Figure4.The following formula enables such a learning process:
, will offer potential for the evaluation of M-MIMOR.A previous version of this paper was published in the proceedings of ISMIR: Fingerhut, Michael (ed.): ISMIR 2002 Proceedings: Third International Conference on Music

Table 1 .
Overview of the experiments.