Keywords

1 Introduction

The subject of music genre classification, though visible not only in the Music Information Retrieval (MIR) research but also in commercial applications, still needs attention in terms of effectiveness and quality of classification [69, 12, 21]. This is especially important when applied to big music databases available for public [15]. One of the most popular query criteria are: artist, genre, mood, tempo or specific title, when looking for a specific audio track. Lately, genre has become one of the most popular choices, but not always this information is stored in a specific music track. That’s why the subject of a more deep content exploring, i.e. taking into consideration sound source separation in the context of music recognition should be addressed, as separation of individual auditory sources, apart from instrument recognition and automatic transcription systems, may be very useful in genre classification. The instrument separation approach seems to be very useful in improving the genre recognition process, since the separation of individual auditory sources and instrument recognition systems get high applicability during the last few years. Therefore, one of the aims of this paper is to propose a set of parameters, which after the audio track separation preprocessing may enable to describe musical content of a piece of music in a more efficient way for the purpose of better distinguishing between selected musical genres. Such an operation may bring the efficiency of genre classification in the databases containing thousands of music tracks similar to earlier experiments carried out on small music databases.

This paper overviews related research, then presents databases and parameters used in experiments. In addition main principles of the algorithm used for musical instrument separation are presented. Then, proposed optimization of the feature vectors is shortly discussed. The following Section contains music genre classification results based on two decision systems, separation of music tracks and concerning two databases. A summary is also included providing the most important conclusions.

1.1 Related Research

Because of the extraordinary increase of the available multimedia data, it is necessary to make the recognition process automatic. Although the division of music into genres is subjective and arbitrary, there are perceptual criteria related to the texture, instrumentation and rhythmic structure of music that can be used to characterize a particular genre. Humans can accurately predict a musical genre based on 25 s, which confirms that we can judge the genre using only the musical surface, without constructing any higher level of theoretical descriptions.

The experiments presented by Tzanetakis and Cook (2002) [17] resulted in 61 % of effectiveness for 10 music genres, what is comparable to results aimed by human musical genre classification. The authors proposed three music feature sets for representing timbral texture, rhythmic content and pitch content. Other experiments were made by Kirss (2007) [10] on 250 musical excerpts from five different electronic music genres such as deep house techno, uplifting trance, drum and bass and ambient. By using SVM (Support Vector Machine) 96.4 % of accuracy was reached. It shows that it is possible to obtain a high evaluation efficiency on small datasets and small number of genres, which can also be classified as subgenres. The ISMIS 2011 contest left us with the final results [9] of almost 88 % of accuracy. The tests were concluded in the ISMIS database, which consists of 1310 songs of six genres. It is worth to mention that most misclassification was between Rock and Metal, and Classical and Jazz genres.

The importance of feature extraction in the topic of musical genre classification is confirmed by many researchers. In recent years, an extensive research has been conducted on the subject of audio sound separation, and resulted in interesting ideas and solutions. Uhle et al. (2003) [18] designed a system for drum beat separation based on Independent Component Analysis. In contrast, Smaragdis and Brown (2003) [16] applied Non-Negative Matrix Factorization (NMF) to create a system for transcription of polyphonic music that showed remarkable results on piano music. Helen and Virtanen (2005) [3] used NMF, combined with a feature extraction and classification process, and achieved promising results in drum beat separation from popular music. Similar techniques were used by other researchers [1, 13, 19] in percussive-harmonic signal separation or in instrument separation [4].

It is noteworthy that musical social systems, in addition to the area of computer games, have become one of the most profitable financial ventures in recent years. That is why the development of such kind of applications makes it necessary to improve classification of music genre and other most searched criteria, which is still far from being satisfying. Despite major achievements in the field of MIR, there are still some challenges in this area, to name a few: the problem of scalability, large size of the data, different standards (formats) of storing the music information (also media, e.g. a soundtrack), methods of transmitting multimedia data, varying degrees of compression, synchronizing the playback of various media elements, and others.

2 Genre Classification Experiments

2.1 Separation of Music Tracks

Separation of music tracks is based on the NMF method [11]. This method performs well in blind separation of drums and melodic parts of audio recordings. NMF performs a decomposition of the magnitude spectrogram V ( \( V \approx W \bullet H \)) obtained by Short-Time Fourier Transform (STFT), with spectral observations in columns, into two non-negative matrices W and H (where \( W \in R_{ \ge 0}^{n \times m} \),\( H \in R_{ \ge 0}^{n \times m} \) and constant \( r \in N \)). Matrix W resembles characteristic spectra of the audio events occurring in the signal (such as notes played by an instrument), and matrix H measures their time-varying gains. Columns of W are not required to be orthogonal as is in principal component method. Specifically, an approach based on an iterative algorithm for computing two factors based on the Kullback-Leibler divergence of V given W and H is used in our experiments. This means that the factorization process is achieved by iterative algorithms minimizing cost-functions, which interprets the matrices V and (W, H) as probability distributions [14, 20].

Then, to each NMF component (column of W and corresponding row of H) we apply a pre-trained SVM classifier to distinguish between percussive and non-percussive components. The task of this pre-trained SVM classification which bases on features such as harmonicity of the spectrum and periodicity of the gains is to distinguish between percussive and non-percussive signals bases. By selecting the columns of W that are classified as percussive and multiplying them with their estimated gains in H, we obtain an estimate of the contribution of percussive instruments to each time-frequency bin in V. Thus, we can construct a soft mask that is applied to V to obtain an estimated spectrogram of the drum part, which is transferred back to the time domain through the inverse STFT using the OLA (overlap-add) operation between the short-time sections in the inverting process. It should be reminded that the redundancy within overlapping segments and the averaging of the redundant samples averages out the effect of the window analysis. More details on the drum separation procedure can be found in the introductory paper by Weninger et al. (2011) [19].

2.2 Databases

Music Information Retrieval systems enable to search music basing on metadata: set of parameters which describe the specific track, track title, author, album, year, genre, etc. Another important topic of MIR is related to music databases available for research and experiments. Two music databases were employed in the experiments shown in this paper, namely ISMIS and SYNAT [8] databases.

The ISMIS music database was prepared for a data mining contest associated with the 19th Internat. Symp.on Methodologies for Intelligent Systems [9]. It consists of over 1300 music tracks of 6 music genres: classical, jazz, blues, pop, rock and heavy metal (see Table 1). For each of 60 performers there are 15-20 music tracks, which are then partitioned into 20-s segments and parameterized. Music tracks are prepared as stereo signals (44.1 kHz, 16 bit, .wav format.

Table 1. List of cardinality of the music genres for ISMIS database

The SYNAT database consists of over 50.000 music tracks of 30-second long excerpts of songs in mp3 format, retrieved from the Internet by a music robot. ID3 tags of music excerpts were automatically assigned to songs by the music robot. The tags were saved in a fully automatic way without human control [5]. SYNAT contains 22 genres: Alternative Rock, Blues, Broadway and Vocalists, Children’s Music, Christian and Gospel, Classic Rock, Classical, Country, Dance and DJ, Folk, Hard Rock and Metal, International, Jazz, Latin Music, Miscellaneous, New Age, Opera and Vocal, Pop, Rap and Hip-Hop, Rock, R and B, and Soundtracks. However, for the experiments carried out within this study over 8.000 music excerpts from the SYNAT database representing 13 music genres were used. The cardinality of the music excerpts for the original and separated signals, in relation to specific music genre, is presented in Table 2. From the original audio signal harmonic (H), drum (D), piano (P), trumpet (T) and saxophone (S) signals were retrieved using different options of the Non-Negative Matrix Factorization separation method (cost function; extended KL-divergence method, window sizes: 20, 30, 40 ms, window function (Hann), window overlap (0.5), number of components (5, 10, 20, 30).

Table 2. Cardinality of the classes for original and separated signals, based on 8244 elements received from SYNAT database, representing 13 music genres

Marsyas (GTZAN) [2] is a commonly used database in MIR. It consists of 1000 songs, representing 10 genres (100 songs per each genre). They are as follows: Pop, Rock, Country, Rap and Hip-Hop, Classical, Jazz, Dance and Dj, Blues, Hard Rock and Metal, Reggae. For the purpose of showing the usability of the original feature vector used, preliminary results obtained for this database are recalled here.

2.3 Basic Parametrization

The so-called ‘basic’ parameters were adapted from previous studies [9] to be able to compare previous results. The list of parameters for ISMIS and SYNAT databases is given in Table 3. Most of the parameters are based on MPEG 7 standard, others are Mel Frequency Cepstral Coefficients (MFCC) as well as some dedicated time-domain-related descriptors.

Table 3. List of parameters for ISMIS and SYNAT databases

Before the Feature Vectors (FVs) were used in the classification experiments, two normalization methods were employed for data pre-processing, i.e. Min-Max normalization, Zero-Mean normalization and both methods used jointly. However, the normalization of training and test datasets are performed in that way that the mean and standard deviation values are calculated only for training dataset, and only the current value is retained from training and test datasets (used respectively for normalization of training and test datasets). Also in the classification process separability of data was checked based on Best First, Greedy, Ranker and PCA methods, and in the main experiments reduced FVs were employed.

3 Experiments

3.1 Algorithms Used in Music Genre Classification

The most popular methods for music genre classification are: Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), Decision Trees, Rough Sets and Minimum-distance methods, to which a very popular k-Nearest Neighbor (k-NN) method belongs. Since preliminary experiments carried out by the authors showed that SVM (including co-training applied for SVM) classification returns best accuracy of classification, that’s why the results obtained while employing this method are to be show [13]. Also, despite computational expensiveness of the kNN algorithm, it is a good method for solving multi-class classification problem and is also commonly used in MIR area, what makes it possible to compare the experimental results. For these reasons results of this method are also presented. The core experiment, as well as normalization methods, were prepared in Java programming language using Eclipse environment. Weka library for Java was used for data managing: selecting attributes by their name, selecting instances (FVs of specific music excerpts) and in classification process. Selecting the best attributes (Best First, Greedy, Ranker and PCA methods) was done in Weka graphical interface.

3.2 Feature Vector Optimization - Adding New Parameters

In the separation process several specific instrument/path (as mentioned earlier, i.e. harmonic, drum, piano, trumpet, saxophone) were retrieved aimed at FV optimization. For this purpose new FVs were created containing the originally extracted parameters and those based on separated music tracks. In that way an original signal (O) with added harmonic (H) components formed OH feature vector, etc. Several mixtures were tried, and a summary of results is shown in Table 4. Table 4 presents the results of overall correctness of classification for 16 mixtures of signals for full set of FVs (p_173 per each signal) for the SYNAT database, as presented above.

Table 4. Results of the overall correctness of classification for 16 mixtures of signals for a full set of FVs (p_173 per each signal)

The best results were obtained for the OH signal, however the differences between the specific mixed signals are not high. It may also be observed that the mixture of two signals is the most promising resulting in the highest correctness, however a combination of three signals (OHD) was also retained in experiments. In Table 5 results (Precision (Prec) and True Positive (TP)) for the kNN algorithm and optimized FV (using the PCA method only 59 parameters were retained) are shown for SYNAT database.

Table 5. kNN-based classification results

Table 6 presents the results for SVM algorithm. OH signal gave much better results (~4 % better) in comparison to Original signal for such genres as alternative rock, blues, jazz, new age, pop (true positive rate) and Latin music (precision). That confirms that separating harmonic path for genres where harmonic plays significant role is useful. On the basis of the results some other conclusions may be derived, such as: for the OH signal an improvement of level of confusion between Pop and New Age, Pop and Latin Music and Pop and Rap and Hip-Hop is also observed and is equal ~2.77 % (in case of New Age) and ~0.88 % in case of two others. In case of using Co-training for the SVM algorithm results gained approx. 1 % of correctness.

Table 6. Results for the SVM algorithm

Table 7 presents the summary results of the overall CCI (Correctly Classified Instances) for the small (ISMIS) and big (SYNAT) databases conducted with the Co-SVM method, as the one which gives the highest correctness of classification, involving the original FV (191 parameters for ISMIS) and reduced ones (VoP p_52,_59,_60).

Table 7. Comparison of results of overall classification for small and big database obtained for the original and mixed signals involving Co-training method

Even though the results of CCI for the bigger database (SYNAT) are ~ 10 % lower than for the smaller one (ISMIS), it is still a very good result considering the fact, that there are as many as 13 genres to be classified instead of only few as in the case of ISMIS database. Moreover some of those 13 genres are similar and very often confused not only by the machine learning methods but also by human listeners, which makes it considerably more difficult to be classified correctly.

Finally, eight music genres common for SYNAT and GTZAN databases were compared in the context of the FV robustness. In the experiments the kNN algorithm (with the Euclidean function) and the ‘basic’ FV (containing 173 parameters) were used and this experiment resulted in the following classification (approx.): Classical 88 %, Blues 70 %, Country 59 %, DanceDj 52 %, HardRock 78 %, Jazz 68 %, Pop 71 %, Rap 58 %, Rock 51 %. The results obtained for the GZTAN database compared to the SYNAT database are comparable. The reason for the a bit lower classification efficiency for these particular genres may be a greater variety of songs in the GZTAN database. Also, the database contains also recordings with reduced quality, which may have an adverse effect on the effectiveness of the overall parameterization.

4 Summary

In this paper a new strategy to music genre classification was proposed. The main principle is separating music tracks at the pre-processing phase and extending vector of parameters by descriptors related to a given musical instrument components that are characteristic for the specific musical genre. This allows for more efficient automatic musical genre classification. It was also shown that extending the original signal even with one or two parameters (such as OP, OT and OS signals) influences the results for classification for specific genres, what confirms the importance of specific parameters in relation to music genre. It has also to be noted that the overall classification gave better correctness for any type of mixed signals in comparison to the original signal.