Keywords

1 Introduction

With the advent of the internet, the number of songs being created as well as the number available to the average person has grown a lot. Simply put, it’s overwhelming. Sifting through the deluge of songs manually isn’t practical or appealing. It needs to be automated.

Automatic classification of music is a growing field with the primary goal of making it easier for people to find songs they like and for vendors to present those songs to their listeners. It can also lay the foundation for figuring out ways to represent similarity between two musical pieces and in the making of a good recommendation system.

Given the perplexing nature of music, its classification requires specialized representations, abstraction and processing techniques for effective analysis, evaluation and classification that are fundamentally different from those used for other mediums and tasks.

2 Literature Review

2.1 History of MIR and Music Classification

The field of Music Information Retrieval (MIR) can be traced back to the 60s with reference to the works done by Kassler in [1]. Even Automatic Transcription of Music was attempted as early as the 70s [2]. However, there were two limiting factors that prevented progress in the field at the time. Firstly, the high computational requirements of the problem domain was simply not available. And secondly, other related fields of study such as Digital Signal Processing, Speech Processing, and Machine Learning were also not advanced enough. So, the field stalled for the next few decades.

In the 1990s, the field regained prominence as computational resources improved greatly and the rise of the internet resulted in massive online music collection. So, there was both an opportunity and demand for MIR systems. The organization of the first International Symposium on Music Information Retrieval (ISMIR 1) in 2000 highlights this resurgence of interest in the field. 280 people from 25 different countries participated in ISMIR Conference Malaga 2015.

As for the methodologies used, MIR in the 90s was influenced by the field of Text Information Retrieval (IR), techniques for searching and retrieving text documents based on user queries. So, most of the algorithms were developed based on symbolic representations such as MIDI files [3]. One such method is described in [4].

However, recognizing approximate units of meaning in MIR, like it is done in a lot of text-IR methods was difficult [5].

Instead, statistical non-transcriptive approaches for non-speech audio signals started being adopted in the second half of the 90s [3]. This was probably influenced by progress of such methods in other fields of speech processing. For example, in [6], the authors reported 98% accuracy in distinguishing music from speech in commercial radio broadcasts. This was based on the statistics of the energy contour and the zero-crossing rate.

In [7], the authors introduced similar statistical methods for retrieval and classification of isolated sounds. Similarly, in [8], an algorithm for music-speech classification based on spectral feature was introduced. It was trained using supervised learning.

And so, starting in the 2000s, instead of methods attempting note-level transcriptions, researchers focused on direct extraction of information of audio signals using Signal Processing and Machine Learning techniques.

Currently, three basic strategies are being applied in MIR: [9]

  • Based on Conceptual Metadata - Suited for low-specificity queries.

  • Using High-level Descriptions - Suited for mid-specificity queries. item Using Low-level Signal-based Properties - Used for all specificity.

But still most of the MIR techniques being employed at present use low-level signal features instead of high-level descriptors [10]. Thus, there exists a semantic gap between human perception of music and how MIR systems work.

2.2 Audio Processing

Particularly speaking, music signal processing may appear to be the junior relation of the large and mature field of speech signal processing, not least because many techniques and representations originally developed for speech have been applied to music, often with good results. However, music signals have certain characteristics that are different from spoken language and other signals [11].

2.3 Genre Based Classification

In [12], Scaringella et al. discuss how and why musical genres are a poorly defined concept making the task of automatic classification non-trivial. Still, although the boundaries between genre are fuzzy and there are no well-defined definitions, it is still one of the widely used method of classification of music. If we look at human capability in genre classification, Perrot et al. [13] found that people classified songs–in a ten-way classification setup–with an accuracy of 70% after listening to 3 s excerpts.

The features used for genre based classification have been heavily influenced by the related field of speech recognition. For instance, Mel-frequency Cepstral Coefficients (MFCC), a set of perceptually motivated features that is widely used in music classification, was first used in speech recognition.

The seminal paper on musical genre classification by Tzanetakis et al. [14] presented three feature sets for representing timbral texture, rhythmic content and pitch content. With the proposed feature set, they achieved a classification accuracy of 61% for ten musical genre.

Timbral features are usually calculated for every short-time frame of sound based on the Short Time Fourier Transform (STFT). So, these are low-level features. Typical examples are Spectral Centroid, Spectral Rolloff, Spectral Flux, Energy, Zero Crossings, and the afore-mentioned Mel-Frequency Cepstral Coefficients (MFCCs). Among these, MFCC is the most widely preferred feature [15, 16]. Logan [17] investigated the applicability of MFCCs to music modeling and found it to be “at least not harmful”.

Rhythmic features capture the recurring pattern of tension and release in music while pitch is the perceived fundamental frequency of the sound. These are usually termed as mid-level features.

Apart from these, many non-standard features have been proposed in the literature.

Li et al. [18] proposed a new set of features based on Daubechies Wavelet Coefficient Histograms (DWCH), and also presented a comparative study with the features included in the MARSYAS framework. They showed that it significantly increased the classifier accuracy.

Anglade, Amélie, et al. [19] propose the use of Harmony as a high-level descriptor of music.

Music classification has been attempted through a variety of methods. Some of the popular ones are SVM, K Nearest Neighbours and variants of Neural Networks. The results are also widely different. In [20], 61% accuracy has been achieved using a Multilayer Perceptron based approach. While in [21], the authors have achieved 71% accuracy through the use of an additional rejection and verification stage. Haggblade et al. [22], compared simpler and more naive approaches (k-NN and k-Means) with more sophisticated neural networks and SVMs. They found that the latter gave better results.

However, lots of unique methods – either completely novel or a variation of a standard method – have been put into use too. In [23], the authors propose a method that uses Chord labeling (ones and zeros) in conjunction with a k-window subsequence matching algorithm used to find subsequence in music sequence and a Decision tree for the actual genre classification.

It is also noted that high-level and contextual concepts can be as important as low-level content descriptors [19].

2.4 Mood Based Classification

As mood is a very human thing, Mood Based Classification, also known as Mood Emotion Recognition (MER), requires knowledge of both technical aspects as well as the human emotional system.

Generally, emotions are conceptualized in two ways:

Categorical Conceptualization. This approach to MER categorizes emotions into distinct classes. It requires a set of base emotions (happiness, anger, sadness, etc.) from which other secondary emotion classes can be derived [24]. However, this approach runs into the problem that the whole spectrum of human emotions cannot be captured by a small number of classes.

Dimensional Conceptualization. It defines Musical Values as numerical values over a number of emotion dimensions. So, the focus is on distinguishing emotions based on their position on a predefined space. Most of these conceptualizations map to three axes of emotions: valence (pleasantness), arousal (activation) and potency (dominance). By placing emotions on a continuum instead of trying to label them as discrete, this approach can encompass a wide variety of general emotions.

Thayer [25] proposed a similar two-dimensional approach that adopts the theory that mood is entailed from two factors: -Stress (happy/anxious) -Energy (calm/energetic). This divides music mood into four clusters: Contentment, Depression, Exuberance and Anxious/Frantic (Fig. 1).

Fig. 1.
figure 1

Thayer’s two-dimensional model of mood

Although, the two-dimensional approach has been criticized as deficient (leading to a proposal of the third dimension of potency), it seems to offer the right balance between sufficient “verbosity” and low complexity [26].

So, we use a similar simplified two-dimensional model based on arousal and valence (Fig. 2):

Fig. 2.
figure 2

Two-dimensional model of mood based on Arousal and Valence

2.5 Features in MER

Some of the commonly used features in MER are:

  • Energy: Energy related features such as audio power, loudness, specific loudness sensation coefficients (SONE), are correlated to the perception of arousal. Lu et al. [27] used it to classify arousal.

  • Melody: These include features such as Pitch (perceived fundamental frequency), chromogram centroid, etc.

  • Timbre: As with the AMGC problem, MFCC is widely used in MER too. Apart from MFCC, octave-based spectral contrast as well as DWCH (Daubechies wavelets coefficient histogram) are also proposed in literature.

So, we see that the features used in MER are almost the same as those in AMGC. However, Fu et al. note in their extensive survey on Audio-based Music Classification [28] that although their effectiveness is debatable, mid-level features such as Rhythm seem to be more popular in MER.

The algorithms used in AMGC are also popular in MER. So, support vector machines, Gaussian mixture models, neural networks, and k-nearest neighbor are the ones regularly used.

3 Methodology

3.1 Audio Signal Pre-processing

The pre-processing in music classification systems is used in order to increase the efficiency of subsequent feature extraction and classification stages and therefore to improve the over-all system performance. Commonly pre-processing includes framing and windowing of the input signal. At the end of pre-processing, the smoothed frame are forwarded to the feature extraction stage.

Framing. Framing is the process of dividing the whole audio sample into frames. Although an audio signal changes continuously, the assumption that on short time scales it remains statistically stationary can be made. So, we frame the signal into 20–40 ms frames. A shorter frame gives too few samples while in a longer one, the signal varies too much.

Windowing. Windowing is necessary because whenever we do a finite Fourier transform, it is implicitly being applied to an infinite repeating signal. So, for instance if the start and end of a finite sample doesn’t match then that will look just like a discontinuity in the signal, and show up a lots of high-frequency noise in the Fourier Transform, which is harmful. If the sample happens to be a perfect sinusoid but with an integer number of periods then it doesn’t fit exactly into the finite sample and the FT will show appreciable energy in all sorts of places nowhere near the real frequency.

3.2 Feature Integration

As the features are temporal, the feature integration is also temporal. We used the mean and variance of the features for temporal feature integration although they capture neither the temporal dynamics nor dependencies among the individual feature dimensions. As seen below, the mean and standard deviation of MFCCs for a classical and a hiphop songs are sufficiently distinguishable. So, this representation of the features can be used to separate classes of music (Fig. 3).

Fig. 3.
figure 3

Comparison of means for Classical and Hiphop songs

Fig. 4.
figure 4

Comparison of standard deviations for Classical and Hiphop songs

3.3 Dataset

The publicly available GTZAN dataset introduced in [14] has become one of the standard datasets for Music Genre Classification used by researchers across the world. We too used this dataset. The dataset contains 100 representative excerpts from ten different genre. They were taken from radio, compact disk, and MP3 compressed audio files. All the files are stored as 22 050 Hz, 16-bit, mono audio files. The Genres dataset has the following classes: classical, country, disco, hiphop, jazz, rock, blues, reggae, pop, metal (Fig. 4).

For mood based classification, in 2013, Soleymani et al. [29] created a 1000 songs dataset for emotional analysis of music which uses the Valence-Arousal axes for representing emotional values for songs. The songs, in the dataset, each 45 s long, were collected from FMA. They used Amazon Mechanical Turk as a crowd sourcing platform for collecting more than 20,000 annotations on the 1,000 songs.

Furthermore, their analysis on the annotations revealed a higher agreement in arousal ratings compared to the valence ratings. We have used a filtered version (with some redundancies removed) of that dataset resulting in a final set of 744 songs. We further labeled them as high/low arousal and high/low valence songs based on the numerical values in the dataset. To achieve equal number of songs in each class, we finally used 600 songs of those 744 songs.

3.4 Classifier

For classification process, we used Support Vector Machine and Feed-Forward Artificial Neural Network.

Support Vector Machine. Support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification analysis. The popularity of Support Vector machine is huge as a lot of research papers [16, 19, 21, 22] shows its implementation. For the construction of multi-class SVM, we use one vs one SVM classifier. This leads to \(\frac{N(N-1)}{2}\) classifiers.

Genre Based Classification: In genre based classification linear kernel is used with the soft margin method.

Mood Based Classification: In mood based classification gaussian kernel and laplacian, the kernel is used which are the non linear type.

Feed-Forward Neural Network. A Feed-Forward Neural Network is a type of Neural Network architecture where the connections are “fed forward”. Research papers [16, 19,20,21,22] shows the implementation of artificial neural network in the field of music classification. For training, Backpropagation algorithm is used which calculates the error at a layer and propagates it back to the earlier layers.

Genre Based Classification: In genre based classification we used Cross-entropy error function for output as probabilities and softmax activation function.

Mood Based Classification: In mood based classification we used Least mean squares error function and logistic sigmoid activation function.

4 Results and Discussion

4.1 Effect of Frame Size

As seen in the figure, frame size of 11.5 ms and 23 ms performed considerably better than the bigger frame sizes. We chose the 23 ms (1024 samples) frame size because the smaller 512 sample frame size would lead to higher number of frames and hence necessitate more computation (Fig. 5).

Fig. 5.
figure 5

Effect of frame size

4.2 Effect of Frame Overlap

We explored four different overlapping schemes: 0%, 25%, 50% and 75% overlap. In each of the cases, we received almost the same accuracy (75.4% on No-overlap, 75.8% on Quarter overlap, 76% on half-overlap, and 75.2% on three-quarters overlap). And so, as it seemed to indicate that the overlapping didn’t have any bearing on our results, we chose the less computationally intensive option of using no overlap at all.

4.3 Comparison of Features

Genre Based Classification. MFCC was found to be the best feature for genre classification (in fact, it was found to do well in mood classification too) (Table 1).

Table 1. Genre classification using ANN and SVM

Mood Based Classification. Results favor MFCC here too (Tables 2 and 3).

Table 2. Mood classification(Arousal) using ANN and SVM
Table 3. Mood classification(Valence) using ANN and SVM
Fig. 6.
figure 6

Effect of number of MFCCs on result

4.4 Effect of MFCCs on the Result

The results indicate that once we use more than 10 MFCC Coefficients, the accuracy plateaus and doesn’t increase at all. So, using around 15 coefficients is found to be good enough for the problem domain (Fig. 6).

4.5 Effect of Number of Hidden Nodes

We used only one hidden layer as it should be enough for our problem domain. As seen in the figure, for any number of hidden numbers after six or so, we get almost the same accuracy. As a rule of thumb, it is usually recommended that the number of nodes be around the mean of the number of inputs and outputs, so we chose 30 as our final number of hidden nodes (Fig. 7).

Fig. 7.
figure 7

Effect of number of hidden nodes

The number of nodes had minimal effect in regard to mood classification.

4.6 Effect of Number of Iterations

As seen in the figure, for genre classification, the number of iterations has an effect on the accuracy up to a certain point (around 20 iterations) (Fig. 8).

Fig. 8.
figure 8

Effect of number of iterations

As for Arousal, the increase in iterations had no effect on the accuracy (Fig. 9).

Fig. 9.
figure 9

Final SVM and ANN comparison based on genre

4.7 Final Results

Fig. 10.
figure 10

Final SVM and ANN comparison based on arousal

Table 4. Genre classification performance measure

Genre Classification. For our final model we used ANN with these feature: MFCC, Spectral Centroid, Zero Crossing, Compactness and RMS (Fig. 10 and Tables  4, 5 and 6).

Table 5. Genre classification confusion matrix
Table 6. Arousal classification performance measure
Fig. 11.
figure 11

Final SVM and ANN comparision based on valence

Mood Classification. For our final model we used ANN with these feature: Spectral centroid, MFCC, Zero Crossing, Compactness, Rhythm, Spectral Flux, RMS and Spectral Variability (Fig. 11 and Tables 7, 8 and 9).

Table 7. Arousal classification confusion matrix
Table 8. Valence performance measure

For our final model we used ANN with these feature: Spectral centroid, MFCC, Zero Crossing, Compactness, Rhythm, Spectral Flux, RMS and Spectral Variability.

Table 9. Valence confusion matrix

5 Conclusion

Any type of classification of music is difficult simply because the classifications themselves don’t have a clear definition. Still, we can work with fuzzy boundaries between these classes to get good enough results with Music Classification Systems.

In this paper, we studied many such components and approaches such as: types and combinations of features for proper representation of songs, feature integration approaches, classifier types, and their parameters, etc.

All these studies were done in order to tackle two related but distinct problems:

  • In Automatic Music Genre Classification (AMGC), good performances were achieved with both of the classifiers employed: the final SVM model got 83% accuracy while the ANN model got 88% accuracy for five genres. These results are comparable with the state-of-the-art results, especially involving the same dataset.

  • In Music Mood Classification however, the good results couldn’t be replicated. The result along both axes of the music mood model used (arousal and valence) were underwhelming. Around 73% accuracy was achieved using ANN for the binary low/high arousal classification. SVM did even worse with around 70% accuracy. For low/high valence classification, both of the classifiers settled on 67% accuracy.

5.1 Limitations and Future Work

Distance Measure for Songs. One way to achieve song clustering or even classification is to develop distance measures to figure out the “distance” or difference between any two given songs. So, we tried to do the same. However, our initial attempts at using a simple Euclidean Distance measure were unsuccessful and later attempts using Gaussian Mixture Models proved to be too computationally intensive to be useful.

Future work could focus on figuring out appropriate distance measures for specific types of music being compared.

Feature Learning. Filtering and pre-processing might result in better high-level features. Or perhaps unsupervised feature learning methods as done in [30] might yield even better features. These approaches weren’t explored in this paper.

Deep Learning. Future work could involve application of deep learning techniques in the problem domain.

Multi-tagging. A song can belong to multiple genre. So it is sure to consist of features characterizing multiple genre. Future work can be done to resolve this ambiguity.