Keywords

1 Introduction

Music plays an important role in everyone’s life. According to the quotes of famous German philosopher Friedrich Nietzsche, “without music, life would be a mistake”. Researchers are taking interest in automatically analyzing the emotional contents of music and several recent developments in the field of Music Information Retrieval system have been reported [17]. Emotion can be conveyed through music. The recognition of emotion from music plays significant role in the improvement of information retrieval of music as music database is growing in size. Retrieval of music on the basis of emotion is useful for various applications such as music recommendation systems [8]; content based searching, music therapy, song selection in hand portable devices [9], in various TV and radio programs etc. Lots of work has been done in the field of analyzing and recognizing emotional contents of music. But mostly researchers consider this task as a single label classification task [35, 10]. While a piece of music may contains several different emotions at the same time. So, the task of emotion recognition from music is a multi-label classification task. Hence, the objective of this research work is to focus on multi-label classification methods and to develop an emotion recognition system using music database.

The goal of multi-label classification is to obtain a model that assigns a set of class labels to each object or data samples unlike multi-class classification in which the classifier predicts single class [1113]. In recent years, multi-label learning has gained popularity in the research community which results the development of a variety of multi-label learning algorithms. In multi-label learning, a classifier learns from a number of data samples or instances, where each data sample can be associated with multiple classes and so after be able to predict the possible class labels for a new data sample. Multi-label classification is different from single label classification problem where each data sample belongs to only one class label from a set of disjoint class labels \({\mathcal{L}}\) Single label classification problem can be recognized as a binary classification for \(\left| {\mathcal{L}} \right| = 2\) or multi-class classification for \(\left| {\mathcal{L}} \right| > 2\). The need of multi-label classification emerges from the various real world problems such as in the text categorization a text may belong to different categories, in the medical diagnosis where a patient may have diabetes and cancer at the same time, in image and email classification etc.

There are two ways to handle multi-label classification task which we will discuss in details in the second section. In this research work, we proposed Binary Relevance based Least Squares Twin Support Vector Machine (BR-LSTSVM) classifier in which the multi-label problem is divided into several single-label classification problems. The reason for which we have used Binary Relevance method will be discussed in Sect. 3. From the literature survey, it is found that Support Vector Machine (SVM) has shown better performance as compared to the other existing classifiers [1419]. But the problem with this is its high computational complexity [20, 21]. To handle this problem, Jayadeva et al. proposed a novel classifier Twin Support Vector Machine (TWSVM) which is four times faster than that of conventional SVM [21]. TWSVM is not only better in terms of speed but also shows better performance over SVM. But again, TWSVM requires the optimization of two Quadratic Programming Problems (QPPs). In order to utilize the better speed of TWSVM, Kumar et al. proposed a novel binary classifier named as LSTSVM which is the least squares variant of TWSVM [22]. In LSTSVM, two complex QPPs are transformed into two linear equations which are easy to solve. So, in this paper we used LSTSVM as a base classifier because it has several advantages such as better generalization ability, faster computational speed and easier implementation. Therefore, this paper has adopted Binary Relevance based LSTSVM multi-label classifier for the emotion recognition from music.

The paper is organized as follows: Section 2 discusses the literature survey on emotion detection system using music data and various multi-label classification methods. Section 3 gives the detail description of the proposed approach. Dataset description, performance evaluation parameters and results are discussed in Sect. 4. Conclusion is drawn in Sect. 5.

2 Literature Survey

2.1 Emotion Recognition System

Feng et al. analyzed two musical features (tempo and articulation) and by using these features recognized four different emotions such as happiness, fear, anger and sadness [23]. Lie et al. recognized emotion from music acoustic data [24]. They used music rhythm and timbre feature to represent stress dimension and intensity feature for energy dimension of Thayer model. Yang et al. developed a Music Emotion Recognition System by utilizing the idea of content based retrieval [3]. For this purpose, they used regression approach to recognize emotion contents from music. They predicted the Arousal Valence values of each music sample which became a point in the arousal-valence plane so that users can efficiently retrieve the songs by specifying a point in that plane. Han et al. developed a music emotion recognition system (SMERS) by using Support Vector regression [5]. They focused on predicting the arousal and valence of an audio content of a song. They extracted seven different musical features such as rhythm, pitch, tonality, harmonics, temp, key and loudness and recognized eleven emotions for example- angry, excited, relaxed, sleepy, pleased, bored, sad, nervous, calm, happy and peaceful based on these features. In another research work, Yang et al. considered both lyrics and audio features and recognized emotion from music by using SVM [25]. They divided the music samples into several frames and extracted both textual audio features of it. Trohidis et al. used the “Tellegen-Watson-Clark” model of mood and developed emotion detection system as a multi-label classification task [7]. Authors compared four multi-label classification approaches and among which random k-labelsets (RAkEL) performed well. Tzacheva et al. used Thayer’s model to represent different emotions [4]. They assumed the task of emotion detection as a single-label classification task and compared the performance of Bayesian Neural Network and J48 Decision Tree to detect four emotions.

2.2 Multi-label Classification

Multi-label classification problem is mainly divided into two categories: Problem transformation methods and algorithm adaptation methods.

2.2.1 Problem Transformation Method

In the problem transformation method, the multi-label classification problem is transformed into a set of single label problems. This approach is independent from the algorithm and any existing classification technique can be applied to multi-label classification problems. Various problem transformation methods are available in the literature and used by the researchers for transferring multi-label classification problem into single label problems [1113, 26, 27].

  • Binary Relevance ( BR ): This is one of the most popular approach of problem transformation method in which multi label dataset is divided into k single label datasets \(({\text{k}} = \left| {\mathcal{L}} \right|)\), each for one class label and a binary classifier is constructed for each label.Spyromitros et al. proposed a classifier for handling multi-label data called BRkNN which is the BR followed by kNN (k-Nearest Neighbor) [28]. If the computation cost of computing k nearest neighbors is ‘C’, then the computational cost ofBRkNNis |L| × C. This problem can be resolved by adopting single search for kNN but at the same time it does not consider the correlation among class labels and generate independent predictions for each class label. Classifier Chaining (CC) method which is closely related to BR method includes ‘k’ binary classifiers as in BR. This method is proposed by Read et al. in which k binary classifiers are linked along a chain in which ith classifier handles the classification problem associated with the class label ‘i’ [29].

  • Ranking by Single Label: In this method, the multi-label dataset is transformed into single label datasets by using several ways such as-ignorance of instances with multi-label, random selection of label, counts for each label and find maximum and minimum counts of labels and assign weight to each label. A single label classifier generates vote (Probability) and assigns rank to each class label [13, 30].

  • Ranking by Pairwise Comparison ( RPC ): This method transforms the multi-label dataset into k(k–1)/2 binary label datasets where \(\left| {{\text{k}} = {\mathcal{L}}} \right|\) and requires the training of k(k–1)/2 binary classifiers, one for each pair of class labels. The binary label dataset for each pair of class labels is generated by taking those examples which associated with at least one of the class label but not both. For a new data sample, all the binary classifiers are invoked and ranking is assigned to each label by counting the vote obtained from each classifier [13, 30].

  • Calibrated Label Ranking ( CLR ): This method is proposed by Furnkranz et al. which is an extension of previously discussed RPC approach [31]. In this method, an additional label \(c_{0}\) (also known as calibration label) is added to the original multi label dataset which partitioned the labels into relevant and irrelevant class labels. It generates the ranking of the label as: \(c_{i1} > c_{i2} > \cdots > c_{ij} > c_{0} > c_{ij + 1} > \cdots > c_{ik} .\) Each data sample that is associated with a class label is treated as a positive data sample for that particular label and negative for the calibration label. Then binary classifier is trained with these datasets in order to discriminate between the class labels and calibrated labels and for a new data sample, ranks are assigned to each label by counting the vote [11, 31]. A variant of CLR, named as Quick Weighted algorithm for Multi-label Learning (QWML) is proposed by Mencia et al. [32]. The voting strategy of QWML is different from the majority voting used by CLR. This approach focuses on classes with “low voting loss”.

  • Label Powerset ( LP ): This method considers each unique set of class labels in the original dataset as single class label and transformed the original multi-label dataset into single label datasets. So, the task becomes a single label classification problem and assigns the most probable class label to the new data sample. The random k-labelsets (RAkEL) approach is obtained by ensemble of LP classifiers where different random subset is used to train each LP classifier.

2.2.2 Algorithm Adaptation Method

Clare and King use C4.5 algorithm for multi-label classification problem with the modified entropy calculation [33]:

$$\text{Entropy} = - \mathop \sum \limits_{i = 1}^{N} (p\left( {\lambda_{i} } \right)\text{log}\,p\left( {\lambda_{i} } \right) + q\left( {\lambda_{i} } \right)\text{log}\,q(\lambda_{i} )$$
(1)

where, \((p\left( {\lambda_{i} } \right)\) represents relative frequency of class \(\lambda_{i}\) and \(q\left( {\lambda_{i} } \right) = 1 - p\left( {\lambda_{i} } \right)\). This modified entropy based C4.5 approach allows multiple class labels at the leaves.

  • Boosting: AdaBoost.MH and AdaBoost.MR are two extended version of basic AdaBoost algorithm for multi-label data classification. AdaBoost.MH takes examples in the form of example-label pairs and the weight of misclassified example-label is increased in each iteration. AdaBoost.MH minimizes the hamming loss. While, AdaBoost.MR finds the hypothesis and arranges the correct class labels on the basis of ranking. Ensemble of classifier chain (ECC) is a multi-label classification technique that is obtained by combining several CC classifiers c1,c2,c3,…,cn. Each CC classifier ckis trained with the random subset of multi label dataset and gives different multi label predictions. The result of each CC classifier is combined per label and each label obtains a number of votes [29].

  • Lazy Learning: Various algorithm adaptation approaches based on K-Nearest Neighbor are proposed by the researchers. The process of aggregation of class labels of a given data samples differs with each other. Multi-label k-Nearest Neighbor (ML-kNN) approach is a lazy learning approach used for the multi-label data classification. This algorithm utilizes maximum a posteriori (MAP) rule to predict the class labels by reasoning with the class labeling information embodied in the neighbors [34].

3 Proposed Algorithm

Consider a training dataset \(D = (x_{i} ,Y_{i} ), 1 \le i \le n\) of ‘n’ data samples where \(x_{i} \in \chi\) (data sample space) represents input data sample and \(Y_{i} \in y\) (label vector space) represents class label. Let the training dataset consists ‘m’ features. The objective of the multi-label learning is to obtain a multi-label classifier that optimizes evaluation parameters. In this research work, we adopted Binary Relevance (BR) problem transformation method to develop multi-label classifier for emotion recognition from music. Although, BR does not handle the label dependency, yet still it has several advantages over other existing methods such as, it has linear complexity regarding the number of class labels and any binary classifier can be considered as a base learner [35]. Binary Relevance method divides the multi-label dataset into \(\left| L \right|\) dataset each for one class. In this way, it divides the multi-label classification problems into \(\left| L \right|\) single label classification problem. For each dataset \(D_{j} , 1 \le j \le k\), (where \(k = \left| L \right|\))considers the data samples of jth label as positive and others as negative. Binary Relevance predicts the class for a new data sample by combining the labels that are positively predicted by each classifier. As a base classifier, we used Least Squares Twin Support Vector Machine due to its better generalization ability and faster computational speed. LSTSVM is a binary classifier that constructs two hyper-planes, one for each class by solving following two linear equations:

$$\begin{aligned} & { \hbox{min} }\left( {{\text{w}}_{1} ,{\text{b}}_{1} ,\upxi} \right)\frac{1}{2}\parallel {\text{X}}_{1} {\text{w}}_{1} + {\text{e}}_{1} {\text{b}}_{1} \parallel^{2} + \frac{{{\text{c}}_{1} }}{2}\upxi^{\text{T}}\upxi \, \\ & {\text{s}}.{\text{t}}.{-}\left( {{\text{X}}_{2} {\text{w}}_{1 } + {\text{e}}_{2} {\text{b}}_{1} } \right) +\upxi = {\text{e}}_{2} \\ \end{aligned}$$
(2)
$$\begin{aligned} & { \hbox{min} }\left( {{\text{w}}_{2} ,{\text{b}}_{2} ,\upeta} \right)\frac{1}{2}\parallel {\text{X}}_{2} {\text{w}}_{2} + {\text{e}}_{2} {\text{b}}_{2} \parallel^{2} + \frac{{{\text{c}}_{2} }}{2}_{{}}\upeta^{\text{T}}\upeta \, \\ & {\text{s}}.{\text{t}}.\left( {{\text{X}}_{1} {\text{w}}_{2 } + {\text{e}}_{1} {\text{b}}_{2} } \right) +\upeta = {\text{e}}_{1} \\ \end{aligned}$$
(3)

where \({\text{X}}_{1}\) and \({\text{X}}_{2}\) are two matrices contain the data samples of positive and negative class correspondingly. \({\text{W}}_{1} \;{\text{and}}\;{\text{W}}_{2}\) are the normal vectors to the hyper-plane, \({\text{b}}_{ 1}\) and \({\text{b}}_{2}\) are bias terms, \(e_{1} \;\text{and}\;e_{2}\) are the two vectors of one’s, \(c_{1}\) and \(c_{2}\) are positive penalty parameters and \(\xi\) and \(\upeta\) are slack variables due to the negative and positive class respectively. Hyper-plane parameters are obtained by solving above two equations as:

$$\left[ {\begin{array}{*{20}c} {{\text{w}}_{1} } \\ {{\text{b}}_{1} } \\ \end{array} } \right] = - \left( {{\text{B}}^{\text{T}} {\text{B}} + \frac{1}{{{\text{c}}_{1} }}{\text{A}}^{\text{T}} {\text{A}}} \right)^{ - 1} {\text{B}}^{\text{T}} {\text{e}}_{2}$$
(4)
$$\left[ {\begin{array}{*{20}c} {{\text{w}}_{2} } \\ {{\text{b}}_{2} } \\ \end{array} } \right] = \left( {{\text{A}}^{\text{T}} {\text{A}} + \frac{1}{{{\text{c}}_{2} }}{\text{B}}^{\text{T}} {\text{B}}} \right)^{ - 1} {\text{A}}^{\text{T}} {\text{e}}_{1}$$
(5)

where, \({\text{A}} = \left[ {{\text{X}}_{1 } {\text{e}}_{1} } \right]\) and \({\text{B}} = \left[ {{\text{X}}_{2 } {\text{e}}_{2} } \right]\). These parameters generate hyper-planes by using following equation:

$${\text{x}}^{\text{T}} {\text{w}}_{1} + {\text{b}}_{1} = 0\,{\text{and}}\,{\text{x}}^{\text{T}} {\text{w}}_{2} {\text{ + b}}_{2} = 0$$
(6)

The class label is assigned to a given data sample according to the following decision function:

$$f\left( x \right) = \text{arg}\;\mathop {\hbox{min} }\limits_{i = + 1, - 1} \frac{{\left| {w_{i} \cdot x + b_{i} } \right|}}{{\parallel w_{i} \parallel }}$$
(7)

LSTSVM also gives promising results in the classification of non-linearly separable data samples with the help of kernel function. Non-linear LSTSVM classifier determines following kernel surfaces in high dimensional space:

$${\text{Ker}}\left( {{\text{x}}^{\text{T}} ,{\text{Z}}^{\text{T}} } \right)\upmu_{1} +\upgamma_{1} = 0 \;{\text{and}}\;{\text{Ker}}\left({{\text{x}}^{\text{T}},{\text{Z}}^{\text{T}}}\right)\upmu_{2} +\upgamma_{2} = 0$$
(8)

Here, ‘Ker’ refers to the kernel function and \({\text{Z}} = \left[ {{\text{X}}_{1} {\text{X}}_{2} } \right]^{\text{T}}\). Non-linear LSTSVM solves following two linear equations to separate the data samples of two classes:

$$\begin{aligned} & \hbox{min} \left( {\mu_{1} ,\gamma_{1} ,\xi } \right)\frac{1}{2}\parallel \text{Ker}\left( {X_{1} ,Z^{T} } \right)\mu_{1} + e\gamma_{1} \parallel^{2} + \frac{{c_{1} }}{2}\xi^{T} \xi \\ & {\text{s}}.{\text{t}}.{-} \left( {\text{Ker}\left( {X_{2} ,Z^{T} } \right)\mu_{1} + e\gamma_{1} } \right) = e - \xi \\ \end{aligned}$$
(9)
$$\begin{aligned} { \hbox{min} }\left( {\mu_{2} ,\gamma_{2} ,\xi } \right)\frac{1}{2}\parallel \text{Ker}\left( {X_{2} ,Z^{T} } \right)\mu_{2} + e\gamma_{2} \parallel^{2} + \frac{{c_{2} }}{2}\eta^{T} \eta \hfill \\ {\text{s}}.{\text{t}}.\left( {\text{Ker}\left( {X_{1} ,Z^{T} } \right)\mu_{2} + e\gamma_{2} } \right) = e - \eta \hfill \\ \end{aligned}$$
(10)

Equations (9) and (10) determine kernel surface parameters as:

$$\left[ {\begin{array}{*{20}c} {\mu_{1} } \\ {\gamma_{1} } \\ \end{array} } \right] = - (Q^{T} Q + \frac{1}{{c_{1} }}P^{T} P)^{ - 1} Q^{T} e$$
(11)
$$\left[ {\begin{array}{*{20}c} {\mu_{2} } \\ {\gamma_{2} } \\ \end{array} } \right] = (P^{T} P + \frac{1}{{c_{2} }}Q^{T} Q)^{ - 1} P^{T} e$$
(12)

where \({\text{P}} = \left[ {K\left( {X_{1} ,D^{T} } \right)e} \right]{\text{Q}} = \left[ {K\left( {X_{2} ,D^{T} } \right)e} \right]\). New data sample is classified according to the following formulation:

$$\text{class}\left( j \right) = \text{argmin}(j = 1,2)\frac{{\left| {x^{T} \mu_{j} + \gamma_{j} } \right|}}{{\mu_{j} }}$$
(13)

In this paper, we used Gaussian Kernel function which is defined as:

$$\begin{aligned} {\text{K}}_{\text{G}} \left( {{\text{x}}_{\text{i}} ,{\text{x}}_{\text{j}} } \right) = \exp \left( { - \frac{{\left\| {{\text{x}}_{\text{i}} - {\text{x}}_{\text{j}} } \right\|^{2} }}{{2 \upsigma^{2} }}} \right) \hfill \\ \hfill \\ \end{aligned}$$
(14)

where \({\text{x}}_{\text{i}}\, {\text{and x}}_{\text{j}}\) are two input vectors. In this way, we construct the binary classifier for each dataset and determine the class label for each data sample. The positively predicted class labels by each classifier are combined for a given data sample i.e., the data sample belongs to those class labels which are positively predicted by the classifier in each dataset for that particular data sample. For a new data sample, each binary classifier predicts the class label. Then the result of each classifier is combined for positive class labels.

4 Numerical Experiment

4.1 Dataset Description

The experiment is performed on Emotions dataset which is taken from Mulan’s repository [36]. The domain of Emotions dataset is music and it contains 593 instances. Each music instance can be labeled with six different emotions such as L1 angry-aggressive (189 examples), L2 quiet-still (148 examples), L3 amazed-surprised (173 examples), L4 sad-lonely (168 examples), L5 happy-pleased (166 examples) and L6 relaxing-calm (264 examples).The dataset contains 72 features which are broadly falls into two main categories: rhythmic and timbre. It comprises 8 rhythmic features and 64 timbre features. Label cardinality of the multi-label dataset is the average number of class labels of the data samples. Label Density of the multi-label dataset is the average number of class labels of the data samples divided by total number of labels \(\left| L \right|\). Label cardinality and density of emotions dataset is 1.869 and 0.311 respectively.

4.2 Performance Evaluation Parameters

The metrics require for the performance evaluation of a multi-label classifier are different than those used in conventional single-label multi-classifier. Let the set of class labels predicted by the proposed classifier ML-LSTSVM for a given data sample \(x_{i}\) be represented by \(P_{i}\). This research work used two different types of evaluation measures-example based and label based. “The example based evaluation metrics measure the average differences of the actual and the predicted sets of labels over all data samples of the evaluation dataset”. On the other hand, “label-based evaluation metrics measure the predictive performance for each label separately and then average the performance over all labels”. Example based performance evaluation measure includes six metrics (accuracy, precision, recall, subset accuracy, F1 score, hamming loss) and label-based evaluation measure includes 8 metrics (macro accuracy, macro precision, macro recall, macro-F1, micro accuracy, micro precision, micro recall, micro F1) as shown in Table 1. Here,\(TP_{i} ,TN_{i} ,FP_{i} ,FN_{i}\) indicate the number of True Positive, True Negative, False Positive and False Negative data samples of ith class.

Table 1 Performance evaluation measure

4.3 Results and Discussion

The performance of the proposed BR-LSTSVM based emotion recognition system is compared with nine existing multi-label classification approaches such as BR, CC, RAkEL, CLR, MLkNN, ML C4.5, QWML and ECC. In this study, Support Vector Machine is used as a base classifier in BR, CC, RAkEL, CLR, QWML and ECC. All these approaches are evaluated against 14 evaluation metrics for example Hamming Loss, Accuracy, Precision, Recall, Subset Accuracy, F1-Score, Micro Precision, Macro Precision, Micro Recall, Macro Recall, Micro Accuracy, Macro Accuracy, Micro F1 and Macro F1. The proposed multi-label classifier BR-LSTSVM and other existing multi-label classifiers used in this study are implemented in matlab on windows 7 with Intel Core i-7 processor (3.4 GHz) with 12-GB RAM. The experiment is performed by using 10-fold cross validation method and the parameters are selected by using Grid Search approach. Penalty parameter is selected from the set of \({\text{c}}_{i} \in \left\{ {10^{{ -8}} ,\,\, \ldots \,\,10^{3} } \right\}\) and Gaussian kernel parameter sigma is chosen from the set of \(\sigma \in \left\{ {2^{ - 5} ,\,\, \ldots \,\,2^{10} } \right\}\). Table 2 shows the performance comparison of existing multi-label classification approaches with the proposed classifier BR-LSTSVM for emotions dataset.

Table 2 Performance comparison for emotions dataset

The downward symbol associated with the metrics indicates that the value of the corresponding metric should be as less as possible i.e., the multi-label classifier performs well if the value of Hamming Loss is low. The upward symbol associated with the metrics indicates that the value of the corresponding metric should be as high as possible. i.e., the multi-label classifier performs well if the value of accuracy, precision or any other associated metrics is high. The best results obtained by the multi-label learning approaches in each evaluation measure are indicated by bold value. From the Table 2, it is clear that the BR-LSTSVM based emotion recognition system has achieved better performance as compared to the other existing approaches for 11 evaluation metrics. BR-LSTSVM based music emotion recognition system performs well on Hamming Loss, Accuracy, Precision, Recall, F1-score, Micro Precision, Micro Recall, Micro F1, Macro Precision, Macro Recall and Macro F1. The performance with respect to other evaluation parameters is also comparable with other existing approaches.

5 Conclusion

It is established through the literature survey that there are several emotion recognition systems based on music database exist but most of them considered the task of emotion recognition as a single label classification task. While a song or a piece of music may contains different emotions at the same time. Also, there were many researchers worked on multi-label music emotion recognition system but did not achieve better performance. So, in this paper we developed a multi-label classifier BR-LSTSVM based emotion recognition system. The proposed system achieves better performance in terms of eleven evaluation parameters as compared to the other existing multi-label approaches. In this research work, we only considered the musical features. In future it is interesting to analyze the textual features such as lyrics of the song with musical features for emotion recognition of a musical content.