Joan Serrà
 
Imagen
In this study we compare the use of different music representations for retrieving alternative performances of the same musical piece, a task commonly referred to as version identification. Given the audio signal of a song, we compute descriptors representing its melody, bass line and harmonic progression using state-of-the-art algorithms. These descriptors are then employed to retrieve different versions of the same musical piece using a dynamic programming algorithm based on nonlinear time series analysis. First, we evaluate the accuracy obtained using individual descriptors, and then we examine whether performance can be improved by combining these music representations (i.e. descriptor fusion). Our results show that whilst harmony is the most reliable music representation for version identification, the melody and bass line representations also carry useful information for this task. Furthermore, we show that by combining these tonal representations we can increase version detection accuracy. Finally, we demonstrate how the proposed version identification method can be adapted for the task of query-by-humming. We propose a melody-based retrieval approach, and demonstrate how melody representations extracted from recordings of a cappella singing can be successfully used to retrieve the original song from a collection of polyphonic audio. The current limitations of the proposed approach are discussed in the context of version identification and query-by-humming, and possible solutions and future research directions are proposed.

J. Salamon, J. Serrà, and E. Gómez. Tonal representations for music retrieval: from version identification to query-by-humming. Int. Journal of Multimedia Information Retrieval, special issue on Hybrid Music Information Retrieval. In press.

 

ISMIR + MIREX 2012

05/10/2012

 
Imagen
Next week I'll be attending the ISMIR conference in Porto, Portugal, where I coauthor 4 papers:
Moreover, I'll be presenting our approach to music structure annotation, which obtained very good results in this year's MIREX campaign:

 
 
Imagen
_Popular music is a key cultural expression that has captured listeners' attention for ages. Many of the structural regularities underlying musical discourse are yet to be discovered and, accordingly, their historical evolution remains formally unknown. Here we unveil a number of patterns and metrics characterizing the generic usage of primary musical facets such as pitch, timbre, and loudness in contemporary western popular music. Many of these patterns and metrics have been consistently stable for a period of more than fifty years. However, we prove important changes or trends related to the restriction of pitch transitions, the homogenization of the timbral palette, and the growing loudness levels. This suggests that our perception of the new would be rooted on these changing characteristics. Hence, an old tune could perfectly sound novel and fashionable, provided that it consisted of common harmonic progressions, changed the instrumentation, and increased the average loudness.

J. Serrà, Á. Corral, M. Boguñá, M. Haro, and J. Ll. Arcos. Measuring the evolution of contemporary western popular music. Scientific Reports 2: 521. Jul 2012.

 
 
Digital sampling can be defined as the use of a fragment of another artist’s recording in a new work, and is common practice in popular music production since the 1980’s. Knowledge on the origins of samples hold valuable musicological information, which could in turn be used to organise music collections. Yet the automatic recognition of samples has not been addressed in the music retrieval community. In this paper, we introduce the problem, situate it in the field of content-based music retrieval and present a first strategy to approach it. Evaluation confirms that our modified optimised fingerprinting approach is indeed a viable strategy.

J. Van Balen, M. Haro, and J. Serrà. Automatic identification of samples in hip hop music. Proc. of the Int. Symp. on Computer Music Modeling and Retrieval (CMMR), pp. 544-551. London, UK. June 2012.
 
 
In this contribution, we discuss content-based retrieval strategies that follow the query-by-example paradigm: given an audio query, the task is to retrieve all documents that are somehow similar or related to the query from a music collection. Such strategies can be loosely classified according to their specificity, which refers to the degree of similarity between the query and the database documents. Here, high specificity refers to a strict notion of similarity, whereas low specificity to a rather vague one. Furthermore, we introduce a second classification principle based on granularity, where one distinguishes between fragment-level and document-level retrieval. Using a classification scheme based on specificity and granularity, we identify various classes of retrieval scenarios, which comprise audio identification, audio matching, and version identification. For these three important classes, we give an overview of representative state-of-the-art approaches, which also illustrate the sometimes subtle but crucial differences between the retrieval scenarios. Finally, we give an outlook on a user-oriented retrieval system, which combines the various retrievalstrategies in a unified framework.

P. Grosche, M. Müller, and J. Serrà. Audio content-based music retrieval. In Multimodal Music Processing, M. Müller, M. Goto, and M. Schedl eds., Dagstuhl Follow-Ups, Dagstuhl Publishing, Wadern, Germany, vol. 3, ch. 9, pp. 157-174. Apr 2012.
 
 
Imagen
We study and characterize the rank-frequency distribution of  MFCC code-words, considering speech, music, and environmental sound sources. We show that, regardless of the sound source, MFCC code-words follow a shifted power-law distribution. This implies that there are a few code-words that occur very frequently and many that happen rarely. We also observe that the inner structure of the most frequent code-words has characteristic patterns. For instance, close MFCC coefficients tend to have similar quantization values in the case of music signals. Finally, we study the rank-frequency distributions of individual music recordings and show that they present the same type of heavy-tailed distribution as found in the large-scale databases. This fact is exploited in two supervised semantic inference tasks: genre and instrument classification. In particular, we obtain similar classification results as the ones obtained by considering all frames in the recordings by just using 50 (properly selected) frames. Beyond this particular example, we believe that the fact that MFCC frames follow a power-law distribution could potentially have important implications for future audio-based applications.

M. Haro, J. Serrà, Á. Corral, and P. Herrera. Power-law distribution in encoded MFCC frames of speech, music, and environmental sound signals. Proc. of the Int. World Wide Web Conf., Workshop on Advances in Music Information Research (AdMIRe), pp. 895-902. Lyon, France. April 2012.
_

 
 
Imagen
In this paper we compare the use of different musical representations for the task of version identification (i.e. retrieving alternative performances of the same musical piece). We automatically compute descriptors representing the melody and bass line using a state-of-the-art melody extraction algorithm, and compare them to a harmony-based descriptor. The similarity of descriptor sequences is computed using a dynamic programming algorithm based on nonlinear time series analysis which has been successfully used for version identification with harmony descriptors. After evaluating the accuracy of individual descriptors, we assess whether performance can be improved by descriptor fusion, for which we apply a classification approach, comparing different classification algorithms. We show that both melody and bass line descriptors carry useful information for version identification, and that combining them increases version detection accuracy. Whilst harmony remains the most reliable musical representation for version identification, we demonstrate how in some cases performance can be improved by combining it with melody and bass line descriptions. Finally, we identify some of the limitations of the proposed descriptor fusion approach, and discuss directions for future research.

J. Salamon, J. Serrà, and E. Gómez. Melody, bassline, and harmony representations for music version identification. Proc. of the Int. World Wide Web Conf., Workshop on Advances in Music Information Research (AdMIRe), pp. 887-894. Lyon, France. April 2012._

 
 
Imagen
We study and characterize the statistical properties of timbral encodings, here called timbral code-words. In particular, we report on rank-frequency distributions of timbral code-words from disparate sources such as speech, music, and environmental sounds. Analogously to text corpora, we find a heavy-tailed Zipfian distribution with exponent close to one. Importantly, this distribution is found independently of different encoding decisions and regardless of the audio source. Further analysis reveals that the most frequent code-words tend to have a more homogeneous structure. We also find that speech and music databases have specific, distinctive code-words while, in the case of the environmental sounds, this database-specific code-words are not present. Finally, we show that a Yule-Simon process with memory provides a reasonable quantitative approximation for our data, suggesting the existence of a common simple generative mechanism for all considered sound sources.

M. Haro, J. Serrà, P. Herrera, and Á.Corral. Zipf's law in short-time timbral codings of speech, music, and environmental sound signals. PLoS ONE, vol. 7, issue 3, art. e33993. March 2012. 

 

Article accepted!

20/02/2012

 
Finally, the part of my thesis related to version groups has been accepted as an article at Pattern Recognition Letters (Title: Characterization and exploitation of community structure in cover song networks). Hopefully it will be available soon in ScienceDirect.
 
 
On January 20 I'll be giving a talk at KIIT-Gurgaon (India), in the Workshop on Computational Models for Music Information Research, a satellite event of the International Symposium on Frontiers of Research in Speech and Music (FRSM). The workshop will be chaired by Xavier Serra and it will include many other talks from the partners of the CompMusic project. The title of my talk will be "Machine learning for music discovery". The handouts will be available at this website soon.