Bi-Modal Deep Boltzmann Machine Based Musical Emotion Classification

Huang, Moyuan; Rong, Wenge; Arjannikov, Tom; Jiang, Nan; Xiong, Zhang

doi:10.1007/978-3-319-44781-0_24

Moyuan Huang¹⁶,
Wenge Rong¹⁶,
Tom Arjannikov¹⁷,
Nan Jiang¹⁶ &
…
Zhang Xiong¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9887))

Included in the following conference series:

International Conference on Artificial Neural Networks

4058 Accesses
6 Citations

Abstract

Music plays an important role in many people’s lives. When listening to music, we usually choose those music pieces that best suit our current moods. However attractive, automating this task remains a challenge. To this end the approaches in the literature exploit different kinds of information (audio, visual, social, etc.) about individual music pieces. In this work, we study the task of classifying music into different mood categories by integrating information from two domains: audio and semantic. We combine information extracted directly from audio with information about the corresponding tracks’ lyrics using a bi-modal Deep Boltzmann Machine architecture and show the effectiveness of this approach through empirical experiments using the largest music dataset publicly available for research and benchmark purposes.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Binary Emotion Classification of Music Using Deep Neural Networks

A survey of music emotion recognition

Article 22 January 2022

An Automatic Emotion Recognition System for Annotating Spotify’s Songs

Keywords

1 Introduction

Music plays an important and influential role in most of our lives; for instance, we often listen to specific kinds of music to help enhance or alter our mood, particularly during special occasions (e.g. a romantic dinner, a national sports event, etc.). Hence, it is essential that we use information about emotions and mood in music retrieval tasks, such as classification and recommendation [11].

To this end, many approaches based on audio analysis were proposed and proved applicable, but they quickly reached a so called “glass ceiling” performance barrier [13]. As it became evident that using features based on audio alone is not enough, many researchers started combining features from different domains [9]. One such domain, music lyrics, has become a popular source of features for music emotion and mood classification among other music retrieval tasks. Mayer et al. [14] show that, in some emotion categories, when features derived from lyrics are included, the classifier performance improves over using the leading audio features alone. However, Hu et al. [6] reveal that this is not true for all of the mood categories. To further improve the classification performance, some researchers integrate audio features with lyrics together and form hybrid features that could carry information from two different modalities (domains) simultaneously [6]. Accordingly different integration strategies (e.g., early fusion [5], late fusion [10] and model fusion [20]) are proposed in the literature.

In this work, we follow the feature fusion model and use a hybrid model based on Multimodal Deep Boltzmann Machine; in addition to fusing different modalities, it is also able make use of unlabelled data to further improve performance [19]. Additionally, we adopt the commonly used Russell’s 2-dimensional Valence-Arousal (V-A) model of affect [15] to capture the emotional content of music lyrics. To show the effectiveness of our approach, we conduct an experimental study on the largest dataset that is publicly available for music retrieval research, the Million Song Dataset [1], from which we are able to use over 230,000 music tracks that contain both lyric and audio features.

2 Related Work

Among the first to tackle the task of automatically classifying music into emotion-based categories, Li and Ogihara used Support Vector Machines (SVM) with audio-based features (related to timbre, pitch and rhythm) and reported 45 % accuracy on a dataset of consisting of 499 music clips and 13 mood categories [12].

Starting in 2007, the Audio Music Mood Classification task appeared regularly in the literature to encourage the development of improved music-IR systems. Since then, datasets comprised of hundreds of music tracks were collected and made available to the research community and more than two hundred systems have been evaluated. Despite other supervised methods like Gaussian Mixture Model [13], Random Forest and K-Nearest Neighbor, many studies found that SVM combined with spectral features often yield the best results [21].

Due to the limiting factors of features based solely on audio [13] and because of the semantically rich nature of music lyrics, lyric-based features found their way into emotion-based music classification. Among others, Hu et al. [6] investigate the usefulness of low-level text features such as the Bag-of-Words (BoW) representation of lyrics, also parts of speech and function words. They also combine lyric and audio features and report accuracy as high as 72 % on a private dataset consisting of 5,585 music tracks and 18 mood categories [7]. He et al. [3] report that higher-order BoW features such as tf-idf weighted unigram, bigram and trigram, can capture more semantic relations in lyrics for mood classification. Similarly, other lyric features derived from the Affective Norm of English Words also obtain encouraging results [8].

There are several ways to combine information from different domains, such as audio and text. The early fusion methods simply concatenate audio and lyric features to create feature vectors in a new space [5]; in the late fusion normally separate classifiers are trained on the features from their own separate domains [10]. While Xue et al. [20] fused audio and ltext domains through a model fusion scheme. In this work, we follow the idea to use Deep Boltzmann Machines for multimodal learning [19] and demonstrate its effectiveness on the largest publicly available music dataset.

3 Bi-Modal Deep Boltzmann Machine Model

Deep Boltzmann Machine (DBM) [16] is a deep neural network architecture based on Restricted Boltzmann Machine [18]. It contains a set of visible units $\mathbf {v}\in {\{0,1\}}^{D}$ and a sequence of layers comprised of hidden units $\mathbf {h}^{(1)}\in {\{0,1\}}^{{F}_{1}},\mathbf {h}^{(2)}\in {\{0,1\}}^{{F}_{2}},...,\mathbf {h}^{(n)}\in {\{0,1\}}^{{F}_{n}}$. The connections are available only between units in adjacent layers, i.e. no connection is allowed between any two units within the same layer or between any two units in non-adjacent layers. The energy of the joint configuration $\{\mathbf {v},\mathbf {h}\}$ is defined according to $\mathbf {h=\{h^{(1)},h^{(2)},...,h^{(n)}\}}$ and parameters $\mathbf {\theta =\{\mathbf {W}^{(1)},\mathbf {W}^{(2)},...,\mathbf {W}^{(n)},\mathbf {b},\mathbf {b}^{(1)},\mathbf {b}^{(2)},...,\mathbf {b}^{(n)}\}}$. The DBM assigns probability to a set of visible units according to the Boltzmann distribution:

$$\begin{aligned} P(\mathbf {v};\theta )=\frac{1}{ Z (\theta )}\sum _{\mathbf {h}}exp(-E(\mathbf {v},\mathbf {h}^{(1)},\mathbf {h}^{(2)};\theta )) \end{aligned}$$

(1)

where $ Z (\theta )$ is the normalising constant.

Multimodal DBM is a generative model for that can create fused representations by combining features from different modalities in a model fusion scheme [19]. Figure 1 illustrates the proposed audio-text aware bi-modal DBM architecture; it consists of two 2-layer DBM networks, with an additional layer of hidden units added on top to join the two DBMs and form a single model.

Let $\mathbf {v}_{a}\in \mathbb {R}^{D}$ denote the audio input and $\mathbf {v}_{t}\in \mathbb {R}^{K}$ denote the text input, where $K,D\in \mathbb {R}$ is the dimension of audio and text features. Then, the joint distribution of bi-modal input can be then written as:

$$\begin{aligned} \begin{aligned} P(\mathbf {v}_{a},\mathbf {v}_{t};\theta )=\sum _{\mathbf {h}_{\mathbf {a}}^{(2)},\mathbf {h}_{\mathbf {t}}^{(2)},\mathbf {h}^{(3)}}&P(\mathbf {h}_{\mathbf {a}}^{(2)},\mathbf {h}_{\mathbf {t}}^{(2)},\mathbf {h}^{(3)})\left( \sum _{\mathbf {h}_{\mathbf {a}}^{(1)}}P(\mathbf {v}_{a},\mathbf {h}_{\mathbf {a}}^{(1)}\mid \mathbf {h}_{\mathbf {a}}^{(2)})\right) \\&\left( \sum _{\mathbf {h}_{\mathbf {t}}^{(1)}}P(\mathbf {v}_{t},\mathbf {h}_{\mathbf {t}}^{(1)}\mid \mathbf {h}_{\mathbf {t}}^{(2)})\right) \end{aligned} \end{aligned}$$

(2)

The second term in Eq. 2 denotes the probability distribution of the audio modality, which assigns probability to $\mathbf {v}_{a}$ in a Gaussian RBM scheme:

(3)

The third term in Eq. 2 denotes the probability distribution of the text modality, where $\mathbf {v\in \mathbb {N}^{k}}$ denotes a vector of visible units and each ${v}_{k}$ is the number of times word k occurs in the lyrics with the dictionary size M. The model assigns probability to $\mathbf {v}_{t}$ in a Replicated Softmax RBM scheme:

(4)

The parameters of DBM can be initialised randomly. However, here we use a greedy layer-wise pre-training strategy [16, 19].

4 Experimental Study

In our experiments, we use the largest publicly available music dataset, the Million Song Dataset (MSD) [1]. It is a conglomeration of several datasets containing different information about the tracks; we use two of its subsets. First, MusiXmatch, contains information about the lyrics, each song is described as a set of words from the recorded top 5,000 frequent words across all lyrics. Second, Last.fm, contains annotations obtained from music listeners in a form of tags, like “happy” and “upbeat”; from it, we select tracks that are described by emotion related tags. Additionally, we obtain already pre-extracted audio-based features from the MSD Benchmarking dataset, which is an extension of MSD and was created for the purposes of comparing different approaches while maintaining invariability in various experimental parameters [17]. To capture both modalities, in our experiments, each music track is represented by both lyrics (found in MusiXmatch dataset) and audio-based features (from MSDB dataset), there are 236,486 tracks that satisfy these conditions.

Initially, to test the validity of our approach, we select only the tracks that contain “happy” and “sad” tags. After removing ambiguous tracks that contain both tags, we obtain 7,945 “happy” songs and 5,840 “sad” tracks. To avoid classifier bias due to class imbalance, we perform random subsampling and then conduct a binary emotion classification experiment.

In a multi-class scenario, some songs may cover a variety of emotions, rendering the representation by independent dimensions inadequate. For this reason, we employ Russell’s Valence-Arousal model [15] and follow Corona’s and O’Mahony’s scheme of selecting social tags that clearly indicate the song’s emotional trend [2]. We group the tags according to their quadrants in the Valence-Arousal model and report the final number of tracks tagged by each emotion group in Table 1. We use the tracks that have the emotion-related tags as labelled data for training the classifier, and the remainder as unlabelled data for unsupervised pre-training. Our final dataset contains 41,727 labelled and 194,759 unlabelled tracks.

Table 1. Mood quadrants and their corresponding number of songs

Full size table

The deep learning architecture is configured as following. The audio pathway is modeled by an RBM with 194 visible units, each taking as input acoustic content descriptors, such as MFCC and SSD features. The visible layer is followed by two layers of hidden units, 100 and 50 each. The text modality is formed by RBM consisting of 5,000-unit visible layer followed by hidden layers of 2,048 and 1,024 units each. A joint layer combines the two modalities and consists of 1,074 hidden units. Its output can be considered as a complex probability estimate of the mood classes. We use the output from our Mulimodal RBM as input to either Softmax or SVM for the final classification decision. Additionally, to test the robustness of our chosen audio features, we expand the audio modality from 194 to 3,456 dimensions by including additional audio-based features. The hidden layers are also expanded to 2,048 and 1,024 respectively; and the joint layer to 2,048 units.

Because the SVM classifier performs slightly better on average, we omit the Softmax results. In our experiments, we perform k-fold repeated random sub-sampling validation with $k=5$. In each fold, 60 % (6,984) tracks are selected for training and 40 % (4,656) for testing. We compute Mean Average Precision (MAP) and Accuracy as metrics to comprehensively evaluate the models. The initial experimental results are shown in Fig. 2, where we also illustrated the baseline SVM performance (no DBM) using early concatenation method to join the two modalities into a single input vector.

As can be seen from Fig. 2, audio-based features indeed outperform the lyric-based features to some extent. We conjecture that this may be because the audio modality is represented by features that were hand-crafted and improved over the years. Meanwhile, the text modality is represented by a shallow BoW statistical measure with large vocabulary, which results in a sparse input vector. This again urges the study on higher level lyric features, which may yield interesting results. We also noticed that the classification performance declined through the audio pathway, which indicates that some valuable information are lost through the extracting process in the audio modality. After expanding the audio modality with additional features, this phenomenon disappears. This indicates the necessity of feature selection. Among all results, the best performance is achieved at the joint layer, which shows the effectiveness of the fusing ability of the proposed approach. After expanding the audio features from 194 to 3,456, the baseline SVM performance did not improve much.

In addition to using the lyric- and audio-based features with our approach, we also compare the model fusion, early fusion and late fusion methods. In late fusion, we first trained two SVM classifiers to represent the two modalities separately, denoting as $p_{a}$ and $p_{t}$. Then the output mood class is assigned by

$$\begin{aligned} p = \alpha p_{a}+(1-\alpha )p_{t} \end{aligned}$$

(5)

where $\alpha $ indicates the relevant importance between audio and lyric features. We set $\alpha = 0.6$, as per Hu et al. [4]. As before, in order to avoid classifier bias towards majority class, we attempt to maintain class balance by ensuring that both training and testing instances are equally distributed across mood classes. Results are shown in Table 2.

Table 2. Comparison of accuracy achieved by the different fusion models

Full size table

Our model outperformed other baseline models in every mood category. The moods in $v^{+}a^{-}$ quadrant obtain the highest accuracy. This is interesting given that the $v^{+}a^{-}$ quadrant has the least number of songs. The reason may be that music pieces in this mood group has many unique lyric terms. Between other mood categories, however, there is no significant differences in the classification accuracy. Moreover, the fusion methods’ accuracy all outperformed the accuracy of classification on single modality, affirming the effectiveness of multi-modal mood classification in the same way as many prior studies show.

5 Conclusion

In this work, we used a deep learning architecture, inspired by the work of Srivastava and Salakhutdinov [19], to effectively fuse the audio and text modalities for music mood classification. Results show that fusing modalities is indeed advantageous in the music mood classification task. In addition to including information from other domains/modalities, it would be interesting to see how other lyric derived features perform with this and other multimodal approaches in the music-IR literature, we leave this to our future work.

References

Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of 12th International Society for Music Information Retrieval Conference, pp. 591–596 (2011)
Google Scholar
Corona, H., O’Mahony, M.P.: An exploration of mood classification in the million songs dataset. In: Proceedings of 12th Sound and Music Computing Conference (2015)
Google Scholar
He, H., Jin, J., Xiong, Y., Chen, B., Sun, W., Zhao, L.: Language feature mining for music emotion classification via supervised learning from lyrics. In: Kang, L., Cai, Z., Yan, X., Liu, Y. (eds.) ISICA 2008. LNCS, vol. 5370, pp. 426–435. Springer, Heidelberg (2008)
Chapter Google Scholar
Hu, X., Choi, K., Downie, J.S.: A framework for evaluating multimodal music mood classification. J. Assoc. Inf. Sci. Technol. (2016) (In press)
Google Scholar
Hu, X., Downie, J.S.: When lyrics outperform audio for music mood classification: a feature analysis. In: Proceedings of 11th International Society for Music Information Retrieval Conference, pp. 619–624 (2010)
Google Scholar
Hu, X., Downie, J.S., Ehmann, A.F.: Lyric text mining in music mood classification. In: Proceedings of 10th International Society for Music Information Retrieval Conference, pp. 411–416 (2009)
Google Scholar
Hu, X., Downie, J.S., Laurier, C., Bay, M., Ehmann, A.F.: The 2007 MIREX audio mood classification task: lessons learned. In: Proceedings of 9th International Conference on Music Information Retrieval, pp. 462–467 (2008)
Google Scholar
Hu, Y., Chen, X., Yang, D.: Lyric-based song emotion detection with affective lexicon and fuzzy clustering method. In: Proceedings of 10th International Society for Music Information Retrieval Conference, pp. 123–128 (2009)
Google Scholar
Kim, Y.E., Schmidt, E.M., Migneco, R., Morton, B.G., Richardson, P., Scott, J.J., Speck, J.A., Turnbull, D.: State of the art report: music emotion recognition: a state of the art review. In: Proceedings of 11th International Society for Music Information Retrieval Conference, pp. 255–266 (2010)
Google Scholar
Laurier, C., Grivolla, J., Herrera, P.: Multimodal music mood classification using audio and lyrics. In: Proceedings of 7th International Conference on Machine Learning and Applications, pp. 688–693 (2008)
Google Scholar
Li, T., Mitsunori, O., Tzanetakis, G. (eds.): Music Data Mining. CRC Press, Boca Raton (2012)
Google Scholar
Li, T., Ogihara, M.: Detecting emotion in music. In: Proceedings of 4th International Society for Music Information Retrieval Conference (2003)
Google Scholar
Lu, L., Liu, D., Zhang, H.: Automatic mood detection and tracking of music audio signals. IEEE Trans. Audio Speech Lang. Process. 14(1), 5–18 (2006)
Article MathSciNet Google Scholar
Mayer, R., Neumayer, R., Rauber, A.: Combination of audio and lyrics features for genre classification in digital audio collections. In: Proceedings of 16th International Conference on Multimedia, pp. 159–168 (2008)
Google Scholar
Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178 (1980)
Article Google Scholar
Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: Proceedings of 12th International Conference on Artificial Intelligence and Statistics, pp. 448–455 (2009)
Google Scholar
Schindler, A., Mayer, R., Rauber, A.: Facilitating comprehensive benchmarking experiments on the million song dataset. In: Proceedings of 2012 International Society for Music Information Retrieval Conference, pp. 469–474 (2012)
Google Scholar
Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory. Technical report, DTIC Document (1986)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15(1), 2949–2980 (2014)
MathSciNet MATH Google Scholar
Xue, H., Xue, L., Su, F.: Multimodal music mood classification by fusion of audio and lyrics. In: He, X., Luo, S., Tao, D., Xu, C., Yang, J., Hasan, M.A. (eds.) MMM 2015, Part II. LNCS, vol. 8936, pp. 26–37. Springer, Heidelberg (2015)
Google Scholar
Yang, Y.H., Chen, H.H.: Machine recognition of music emotion: a review. ACM Trans. Intell. Syst. Technol. 3(3), 338–343 (2012)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No. 61332018), the National Department Public Benefit Research Foundation (No. 201510209), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, China
Moyuan Huang, Wenge Rong, Nan Jiang & Zhang Xiong
Department of Computer Science, University of Victoria, Victoria, Canada
Tom Arjannikov

Authors

Moyuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wenge Rong
View author publications
You can also search for this author in PubMed Google Scholar
Tom Arjannikov
View author publications
You can also search for this author in PubMed Google Scholar
Nan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Xiong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenge Rong .

Editor information

Editors and Affiliations

University of Lausanne, Lausanne, Switzerland
Alessandro E.P. Villa
University of Lausanne, Lausanne, Switzerland
Paolo Masulli
Universitat Politécnica de Catalunya, Terrrassa, Spain
Antonio Javier Pons Rivero

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, M., Rong, W., Arjannikov, T., Jiang, N., Xiong, Z. (2016). Bi-Modal Deep Boltzmann Machine Based Musical Emotion Classification. In: Villa, A., Masulli, P., Pons Rivero, A. (eds) Artificial Neural Networks and Machine Learning – ICANN 2016. ICANN 2016. Lecture Notes in Computer Science(), vol 9887. Springer, Cham. https://doi.org/10.1007/978-3-319-44781-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-44781-0_24
Published: 13 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44780-3
Online ISBN: 978-3-319-44781-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bi-Modal Deep Boltzmann Machine Based Musical Emotion Classification

Abstract

Similar content being viewed by others