Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Automatic music emotion recognition (MER) aims at modeling the association between music and emotion so as to facilitate emotion-based music organization, indexing, and retrieval. This technology has emerged in recent years as a promising solution to deal with the huge amount of music information available digitally [1, 25, 33, 77]. It is generally believed that music cannot be composed, performed, or listened to without affection involvement [32]. The pursuit of emotional experience has also been identified as one of the primary motivations and benefits of music listening [31]. In addition to music retrieval , music emotion also finds applications in context-aware music recommendation , playlist generation, music therapy, and automatic music accompaniment for other media content including image, video, and text, amongst others [37, 51, 66, 78].

Despite the significant progress that has been made in recent years, MER is still considered as a challenging problem because the perception of emotion in music is usually highly subjective. A single, static ground-truth emotion label is not sufficient to describe the possible emotions different people perceive in the same piece of music [15, 26]. On the contrary, it may be more reasonable to learn a computational model from multiple responses of different listeners [47] and to present probabilistic (soft) rather than deterministic (hard) emotion assignments as the final result. In addition, the subjective nature of emotion perception suggests the need of personalization in systems for emotion-based music recommendation or retrieval [82]. Early work on MER often chose to sidestep this critical issue by either assuming that a common consensus can be achieved [25, 62], or by simply discarding music pieces for which a common consensus cannot be achieved [38].

To help address this issue, we have proposed a novel generative model referred to as acoustic emotion Gaussians (AEG) in our prior work [6568, 72]. The name of the AEG model comes from its use of multiple Gaussian distributions to model the affective content of music. The algorithmic part of AEG has been first introduced in [67], along with the preliminary evaluation of AEG for MER and emotion-based music retrieval. More details about the analysis part of the model learning of AEG can be found in a recent article [72]. Due to the parametric nature of AEG, model adaptation techniques have also been proposed to personalize an AEG model in an online, incremental fashion, rather than learning from scratch [7, 68]. The goal of this chapter is to position the AEG model as a theoretical framework and to provide detailed information about the model itself and its application to personalized MER and emotion-based music retrieval.

We conceptualize emotion by the valence–arousal (VA) model [49], which has been used extensively by psychologists to study the relationship between music and emotion [13, 56]. These two dimensions are found to be the most fundamental through factor analysis of self-report of human’s affective response to music stimulus. Despite differences in nomenclature, existing studies give similar interpretations of the resulting factors, most of which correspond to valence (or pleasantness ; positive/negative affective states) and arousal (or activation; energy and stimulation level). For example, happiness is an emotion associated with a positive valence and a high arousal, while sadness is an emotion associated with a negative valence and a low arousal. We refer to the 2-D space spanned by valence and arousal as the VA space hereafter. Moreover, we are concerned with the emotion an individual perceives as being expressed in a piece of music, rather than the emotion the individual actually feels in response to the piece. This distinction is necessary [15], as we do not necessarily feel sorrow when listening to a sad tune, for example.

However, the descriptive power of VA model has been questioned by several researchers, and various extensions or alternative solutions have been proposed [14, 46, 85]. Beyond the valence and arousal, adding more dimensions (e.g., potency, or dominant–submissive) might help resolve the ambiguity between affective terms, such as anger and fear , which are close to one another in the second quadrant of the VA space [2, 10]. AEG is theoretically extendable to model the emotion in higher dimensions. Nevertheless, we stay with the 2-D emotion model here, partly because it is easier to explain AEG graphically, and partly because to date many existing music datasets adopt VA labels [8, 52, 59, 79].

In this chapter, we focus on the dimensional emotion (VA) values. Interested readers can refer to [1, 24, 57] for studies and surveys on categorical MER approaches that view emotions as discrete labels such as mood tags. As the dimensional and categorical approaches may offer complementary advantages [74], researchers have studied the relationship between the discrete emotion labels and the dimensional VA values [50, 65]. Due to its probabilistic nature, AEG can be combined with a probabilistic classification model. Such a combination leads to an approach (called Tag2VA) that is able to project a mood tag to the VA space [65].

The chapter is organized as follows. We first review the related work in Sect. 12.2. Then, we present the mathematical derivation and learning algorithm of AEG in Sect. 12.3, followed by the personalization algorithm in Sect. 12.4. Sections 12.5, 12.6, and 12.7 present the applications of AEG to MER, emotion-based music retrieval , and the Tag2VA projection, respectively. Finally, we conclude in Sect. 12.8.

2 Related Work on Dimensional Music Emotion Recognition

Early approaches to MER [39, 81] assumed that the perceived emotion of a music piece can be represented as a single point in the VA space, in which the valence and arousal values are considered as independent numerical values. The ground-truth VA values of a music piece is obtained by averaging the annotations of a number of human subjects, without considering the covariance of the annotations. To predict the VA value from the feature vector of a music piece, a regression model such as support vector regression (SVR) [55] can be applied. Regression model learning algorithms typically minimize the mismatch (e.g., mean squared loss) between the predicted and the ground-truth VA values in the training data.

As emotion perception is rarely dependent on a single music factor but a combination of them [19, 30], algorithms used feature descriptors that characterize the loudness , timbre , pitch , rhythm, melody, harmony, or lyrics of music [22, 43, 54, 57]. In particular, while it is usually easier to predict arousal using, for example, loudness and timbre features, the prediction of valence has been found more challenging [57, 69, 76]. Cross cultural aspects of emotion perception have also been studied [23]. To exploit the temporal continuity of emotion variation within a piece of music, techniques such as system identification [34], conditional random fields [27, 53], hidden Markov models [40], deep recurrent neural networks [73], or dynamic probabilistic model [71] have also been proposed. Various approaches and features for MER have been evaluated and compared using benchmarking datasets comprising over 1,000 Creative Commons licensed music pieces from the Free Music Archive, in the 2013 and 2014 MediaEval ‘Emotion in Music’ tasks [59, 60].

Recent years have witnessed growing attempts to model the emotion of a music piece as a probability distribution in the VA space [7, 52, 67, 75] to better account for the subjective nature of emotion perception. For instance, Fig. 12.1 shows the VA values applied by different annotators to four music pieces. To characterize the distribution of the emotion annotations for each clip, a typical way is to use a bivariate Gaussian distribution, where the mean vector presents the most possible VA values and the covariance matrix indicates its uncertainty. For a clip with highly subjective affective content, the determinant of the covariance matrix would be larger.

Fig. 12.1
figure 1

Subjects’ annotations of the perceived emotion of four 30-s clips, which from left to right are Dancing Queen by ABBA, Civil War by Guns N’ Roses, Suzanne by Leonard Cohen, and All I Have To Do Is Dream by the Everly Brothers. Each circle here corresponds to a subject’s annotation, and the overall emotion for a clip can be approximated by a 2-D Gaussian distribution (the red cross and blue ellipse). The ellipse outlines the standard deviation of a Gaussian distribution

Existing approaches for predicting the emotion distribution of a music clip from acoustic features fall into two categories. The heatmap approach [53, 75] quantizes each emotion dimension by W equally spaced cells, leading to a \(W \times W\) grid representation of the VA space. The approach trains \(W^2\) regression models for predicting the emotion intensity of each cell. Higher intensity at a cell indicates that people are more likely to perceive the corresponding emotion from the clip. The emotion intensity over the VA space creates a heatmap-like representation of emotion distribution. However, heatmap is not a continuous representation of emotion, and emotion intensity cannot be strictly considered as a probability estimate.

The Gaussian-parameter approach [52, 75], on the other hand, models emotion distribution of a clip as a bivariate Gaussian and trains multiple regressors, each for a parameter of the mean vector and the covariance matrix. This makes it easy to apply lessons learned from modeling the mean VA values. In addition, performance analysis of this approach is easier; one can analyze the importance of different acoustic features to each Gaussian parameter individually. However, since the regression models are trained independently, the correlation between valence and arousal is not exploited. The parameter estimation of the mean and variance is disjoined as well.

A different methodology to address the subjectivity is to call for a user-dependent model trained on annotations of a specific user to personalize the emotion prediction [79, 84, 86]. In [79], two personalization methods are proposed; the first trains a personalized MER system for each individual specifically, whereas the second groups users according to some personal factors (e.g., gender, music experience, and personality) and then trains group-wise MER system for each user group. Another two-stage personalization scheme has also been studied [82]: the first stage estimates the general perception of a music piece, whereas the second one predicts the difference between the general perception and the personal one of the target user.

We note that none of the aforementioned approaches renders a strict probabilistic interpretation [72]. In addition, many existing work is developed on discriminative models such as multiple linear regression and SVR. Few attempts are made to develop a principled probabilistic framework that is technically sound for modeling the music emotion and that permits extending the user-independent model to a user-dependent one, preferably in an online fashion.

We also note that most existing work focuses on the annotation aspect of music emotion research, namely MER. Little work has been made to the retrieval aspect—the development of emotion-based music retrieval systems [77]. In what follows, we present the AEG model and its applications to the both of these two aspects.

3 Acoustic Emotion Gaussians : A Generative Approach for Music Emotion Modeling

In [6568, 72], we proposed AEG, which is fundamentally different from the existing regression or heatmap approaches. As Fig. 12.2 shows, AEG involves the generative process of VA emotion distributions from audio signals . While the relationship between audio and music emotion may sometimes be complicated and difficult to observe directly from an emotion-annotated corpus, AEG uses a set of clip-level latent topics \(\{z_k\}_{k=1}^K\) to resolve this issue.

Fig. 12.2
figure 2

Illustration of the generative process of the AEG model

We first define the terminology and explain the basic principle of AEG. Suppose that there are K audio descriptors \({\{A_k\}}_{k=1}^K\), each is related to some acoustic feature vectors of music clips. Then, we map the associated feature vectors of \(A_k\) to a clip-level topic \(z_k\). To implement each \(A_k\), we use a single Gaussian distribution in the acoustic feature space. The aggregated Gaussians of \({\{A_k\}}_{k=1}^K\) is called an acoustic GMM (Gaussian mixture model). Subsequently, we map each \(z_k\) to a specific area in the VA space, which is modeled by a bivariate Gaussian distribution \(G_k\). We refer to the aggregated Gaussians of \({\{G_k\}}_{k=1}^K\) as an affective GMM. Given a clip, its feature vectors are first used to compute the posterior distribution over the topics, termed as a topic posterior representation \(\varvec{\theta }\). In \(\varvec{\theta }\), the posterior probability of \(z_k\) (denoted as \(\theta _k\)) is associated with \(A_k\) and will then be used to show the clip’s importance to \(G_k\). Consequently, the posterior distribution \(\varvec{\theta }=\{\theta _k\}_{k=1}^K\) can be incorporated into learning the affective GMM as well as making emotion prediction for a clip.

AEG-based MER follows the flow depicted in Fig. 12.2. Based on \(\varvec{\theta }\) of a test clip, we obtain the weighted affective GMM \(\sum _k \theta _k G_k\), which is able to generate various emotion distribution. Following this sense, if a clip’s acoustic features can be completely described by the h-th topic \(z_h\), i.e. \(\theta _h =1\), and \(\theta _k=0\), \(\forall k\ne h\), then its emotion distribution would exactly follow \(G_h\). As will be described in Sect. 12.5, we can further approximate \(\sum _k \theta _k G_k\) by a single, representative affective Gaussian \(\hat{G}\) for simplicity. This is illustrated in the rightmost of Fig. 12.2.

3.1 Topic Posterior Representation

The topic posterior representation of a music clip is generated from its audio. We note that the temporal dynamics of audio signals is regarded as essential for human to perceive musical characteristics such as timbre, rhythm, and tonality. To capture more local temporal variation of the low-level features, we represent the acoustic features at a time instance in the segment-level, which corresponds to sufficiently long duration (e.g., 0.4 s). A segment-level feature vector \(\mathbf {x}\) can be formed by, for example, concatenating the mean and standard deviation of the frame-level feature vectors within the segment. As a result, a clip is divided into multiple overlapped segments which are then represented by a sequence of vectors, \(\{\mathbf {x}_1, \ldots , \mathbf {x}_{T}\}\), where T is the length of the clip.

To start the generative process of AEG, we first learn an acoustic GMM as the bases to represent a clip. This acoustic GMM can be trained using the expectation–maximization (EM) algorithm on a large set of segment-level vectors \(\mathcal {F}\) extracted from existing music clips. The learned acoustic GMM defines the set of audio descriptors \(\{A_k\}_{k=1}^K\), and can be expressed as follows:

$$\begin{aligned} p(\mathbf {x}) = \sum _{k = 1}^K {\pi _k A_k (\mathbf {x} \mid {\mathbf {m}}_k ,{\mathbf {S}}_k )}\,, \end{aligned}$$
(12.1)

where \(A_k(\cdot )\) is the k-th component Gaussian distribution, and \(\pi _k\), \(\mathbf {m}_k\), and \(\mathbf {S}_k\) are its corresponding prior weight, mean vector, and covariance matrix, respectively. Note that we substitute equal weight for the GMM (i.e., \(\pi _k = \frac{1}{K}\), \(\forall k\)), because the original \(\pi _k\) learned from \(\mathcal {F}\) does not imply the prior distribution of the feature vectors in a clip. Such a heuristic usually results in better performance as pointed in [63].

Suppose that we have an emotion-annotated corpus \(\mathcal {X}\) consisting of N music clips \(\{s_i\}_{i=1}^N\). Given a clip \(s_i = \{\mathbf {x}_{i,t}\}_{t=1}^{T_i}\), we then compute the segment-level posterior probability for each feature vector in \(s_i\) based on the acoustic GMM,

$$\begin{aligned} p(A_k \mid \mathbf {x}_{i,t} ) = \frac{ A_k (\mathbf {x}_{i,t}\mid {\mathbf {m}}_k ,{\mathbf {S}}_k )}{\sum \nolimits _{h = 1}^K A_h (\mathbf {x}_{i,t} \mid {\mathbf {m}}_h ,{\mathbf {S}}_h ) }\,. \end{aligned}$$
(12.2)

Finally, the clip-level topic posterior probability \(\theta _{i,k}\) of \(s_i\) can be approximated by averaging the segment-level ones,

$$\begin{aligned} \theta _{i,k} \leftarrow p(z_k \mid s_i ) \approx \frac{1}{{T_i }}\sum _{t = 1}^{T_i} p(A_k \mid \mathbf {x}_{i,t}) \,. \end{aligned}$$
(12.3)

This approximation assumes that \(\theta _{i,k}\) is equally contributed by each segment of \(s_i\) and thereby capable of representing the clip’s acoustic features. We use a vector \(\varvec{\theta }_i \in \mathbb {R}^K\), whose k-th component is \(\theta _{i,k}\), as the topic posterior of \(s_i\).

3.2 Prior Model for Emotion Annotation

To consider the subjectivity of emotional responses of a music clip, we ask multiple subjects to annotate the clip. However, as some subjects’ annotations may not be reliable, we introduce a user prior model to quantify the contribution of each subject.

Let \(\mathbf {e}_{i,j} \in \mathbb {R}^2\) (a vector including the valence and arousal values) denote one of the annotations of \(s_i\) given by the j-th subject, and let \(U_i\) denote the number of subjects who have annotated \(s_i\). Note that \(\mathbf {e}_{q,j}\) and \(\mathbf {e}_{r,j}\), where \(q\ne r\), may not correspond to the same subject. Then, we build the user prior model \(\gamma \) to describe the confidence of \(\mathbf {e}_{i,j}\) in \(s_i\) using a single Gaussian distribution,

$$\begin{aligned} \gamma (\mathbf {e}_{i,j} \mid s_i ) \equiv G({\mathbf {e}}_{i,j} \mid \mathbf {a}_i ,\mathbf {B}_i ), \end{aligned}$$
(12.4)

where \(\mathbf {a}_i=\frac{1}{U_i}\sum _{j=i}^{U_i} \mathbf {e}_{i,j}\), \(\mathbf {B}_i = \frac{1}{U_i} \sum _{j=1}^{U_i} (\mathbf {e}_{i,j}-\mathbf {a}_i)(\mathbf {e}_{i,j}-\mathbf {a}_i)^T\), and \(G(\mathbf {e} \mid \mathbf {a}_i, \mathbf {B}_i)\) is called the annotation Gaussian of \(s_i\). One can observe what \(\mathbf {a}_i\) and \(\mathbf {B}_i\) look like from the four example clips in Fig. 12.1. Empirical results show that a single Gaussian performs better than a GMM for setting up \(\gamma (\cdot )\) [67].

The confidence of \(\mathbf {e}_{i,j}\) can be estimated based on the likelihood calculated by Eq. 12.4. If an annotation is far away from the mean, it gives small likelihood accordingly. In addition to Gaussian distributions, any criterion that is able to reflect the importance of a user’s annotation of a clip can be applied to \(\gamma \).

The probability of \(\mathbf {e}_{i,j}\), referred to as the clip-level annotation prior, can be calculated by normalizing the likelihood of \(\mathbf {e}_{i,j}\) over the cumulative likelihood of all other annotations in \(s_i\),

$$\begin{aligned} p(\mathbf {e}_{i,j} \mid s_i ) \equiv \frac{ \gamma ( \mathbf {e}_{i,j} \mid s_i )}{\sum \nolimits _{r=1}^{U_i} \gamma (\mathbf {e}_{i,r} \mid s_i ) }\,. \end{aligned}$$
(12.5)

Based on the clip-level annotation prior, we further define the corpus-level clip prior to describe the importance of each clip,

$$\begin{aligned} p(s_i \mid \mathcal {X}) \equiv \frac{\sum \nolimits _{j=1}^{U_i} \gamma (\mathbf {e}_{i,j} \mid s_i)}{ \sum \nolimits _{q=1}^N \sum \nolimits _{r=1}^{U_q} \gamma (\mathbf {e}_{q,r} \mid s_q) }\,. \end{aligned}$$
(12.6)

From Eqs. 12.5 and 12.6 we can make two observations. First, if a clip’s annotations are consistent (i.e., \(\mathbf {B}_i\) is small), it is considered less subjective. Second, if a clip is annotated by more subjects, the corresponding \(\gamma \) model should be more reliable. As a result, we can define the corpus-level annotation prior \(\gamma _{i,j}\) for each \(\mathbf {e}_{i,j}\) in the corpus \(\mathcal {X}\) by multiplying Eqs. 12.5 and 12.6:

$$\begin{aligned} \gamma _{i,j} \leftarrow p(\mathbf {e}_{i,j} \mid \mathcal {X}) \equiv \frac{\gamma (\mathbf {e}_{i,j} \mid s_i)}{\sum \nolimits _{q=1}^N \sum \nolimits _{r=1}^{U_q} \gamma (\mathbf {e}_{q,r} \mid s_i)}\,, \end{aligned}$$
(12.7)

which is computed beforehand and fixed in learning the affective GMM.

3.3 Learning the Affective GMM

Given a training music clip \(s_i\) in the corpus \(\mathcal {X}\), we assume the emotional responses can be generated from an affective GMM weighted by its topic posterior \(\varvec{\theta }_i\),

$$\begin{aligned} p(\mathbf {e}_{i,j} \mid \varvec{\theta }_i) = \sum _{k = 1}^K {\theta _{i,k} G_k(\mathbf {e}_{i,j} \mid \varvec{\mu }_k ,\varvec{\varSigma }_k )} \,, \end{aligned}$$
(12.8)

where \(G_k(\cdot )\) is the k-th affective Gaussian with mean \(\varvec{\mu }_k\) and covariance \(\varvec{\varSigma }_k\) to be learned. Here \(\theta _{i,k}\) stands for the fixed weight associated with \(A_k\) to carry the audio characteristics of \(s_i\). We therefore call \(\varvec{\theta }_i\) an acoustic prior. Then, the objective function is in the form of the marginal likelihood function of the annotations

$$\begin{aligned} {\begin{matrix} p(\mathbf {E} \mid \mathcal {X}, \varvec{\varLambda }) &{} = \sum _{i=1}^N p(s_i \mid \mathcal {X}) \sum _{j=1}^{U_i} p(\mathbf {e}_{i,j} \mid s_i) p(\mathbf {e}_{i,j} \mid \varvec{\theta }_i, \varvec{\varLambda }) \\ &{} = \sum _{i=1}^N \sum _{j=1}^{U_i} p(s_i \mid \mathcal {X})p(\mathbf {e}_{i,j} \mid s_i) p({\mathbf {e}}_{i,j} \mid \varvec{\theta }_i, \varvec{\varLambda }) \\ &{} = \sum _{i=1}^N \sum _{j=1}^{U_i} p(\mathbf {e}_{i,j} \mid \mathcal {X}) \sum _{k=1}^K \theta _{i,k} G_k( \mathbf {e}_{i,j} \mid {\varvec{\mu }}_k, {\varvec{\varSigma }}_k ) \,,\\ \end{matrix}} \end{aligned}$$
(12.9)

where \(\mathbf {E} = \{\mathbf {e}_{i,j}\}_{i=1,j=1}^{N,U_i}\), \(\mathcal {X} = \{s_i,\varvec{\theta }_i\}_{i=1}^N\), and \(\varvec{\varLambda }=\{\varvec{\mu }_k,\varvec{\varSigma }_k\}_{k=1}^K\) is the parameter set of the affective GMM. Taking the logarithm of Eq. 12.9 and replacing \(p(\mathbf {e}_{i,j} \mid \mathcal {X})\) by \(\gamma _{i,j}\) leads to

$$\begin{aligned} L = \log \sum _i \sum _j \gamma _{i,j} \sum _k \theta _{i,k} G_k(\mathbf {e}_{i,j} \mid \varvec{\mu }_k , \varvec{\varSigma }_k) \,, \end{aligned}$$
(12.10)

where \({\sum _i}{\sum _j}{\gamma _{i,j}}=1\). To learn the affective GMM, we can maximize the log-likelihood in Eq. 12.10 with respect to the Gaussian parameters. We first derive a lower bound of L according to Jensen’s inequality,

$$\begin{aligned} L \ge L_\text {bound}=\sum _i \sum _j \gamma _{i,j} \log \sum _k \theta _{i,k} G_k(\mathbf {e}_{i,j} \mid \varvec{\mu }_k, \varvec{\varSigma }_k ) \,. \end{aligned}$$
(12.11)

Then, we treat \(L_\text {bound}\) as a surrogate of L and use the EM algorithm [3] to estimate the parameters of the affective GMM. In the E-step, we derive the expectation over the posterior distribution of \(z_k\) for all the training annotations,

$$\begin{aligned} Q = \sum _i \sum _j \gamma _{i,j} \sum _k p(z_k \mid \mathbf {e}_{i,j}) \Big ( \log \theta _{i,k} + \log G_k(\mathbf {e}_{i,j} \mid \varvec{\mu }_k, \varvec{\varSigma }_k ) \Big ) \,, \end{aligned}$$
(12.12)

where

$$\begin{aligned} p(z_k \mid \mathbf {e}_{i,j}) = \frac{\theta _{i,k} G_k(\mathbf {e}_{i,j} \mid \varvec{\mu }_k, \varvec{\varSigma }_k )}{\sum \nolimits _{h = 1}^K \theta _{i,h} G_k(\mathbf {e}_{i,j} \mid \varvec{\mu }_h, \varvec{\varSigma }_h)} \,. \end{aligned}$$
(12.13)

In the M-step, we first set the derivative of Eq. 12.12 with respect to \(\varvec{\mu }_k\) to zero and obtain the updating form for the mean vector,

$$\begin{aligned} \varvec{\mu }'_k \leftarrow \frac{\sum _i \sum _j \gamma _{i,j} p(z_k \mid \mathbf {e}_{i,j}) \mathbf {e}_{i,j}}{\sum _i \sum _j \gamma _{i,j} p(z_k \mid \mathbf {e}_{i,j})}\,. \end{aligned}$$
(12.14)

Following a similar line of reasoning, we obtain the update rule for \(\varvec{\varSigma }_k\):

$$\begin{aligned} \varvec{\varSigma }'_k \leftarrow \frac{\sum _i\sum _j\gamma _{i,j}p(z_k\mid \mathbf {e}_{i,j})(\mathbf {e}_{i,j}-\varvec{\mu }'_k)(\mathbf {e}_{i,j}-\varvec{\mu }'_k )^T }{\sum _i \sum _j \gamma _{i,j} p(z_k \mid \mathbf {e}_{i,j})} \,. \end{aligned}$$
(12.15)

Theoretically, the EM algorithm iteratively maximizes the \(L_{\text {bound}}\) value in Eq. 12.11 until convergence. One can fix the number of maximal iterations or set a stopping criterion for the increasing ratio of \(L_{\text {bound}}\).

Note that we can ignore the annotation prior by setting a uniform distribution, i.e., \(\forall i, j\), \(\gamma _{i,j} = 1\). This case is called “AEG Uniform” in the evaluation. In contrast, the case with nonuniform annotation prior is called “AEG AnnoPrior.”

3.4 Discussion

As Eqs. 12.14 and 12.15 show, the re-estimated parameters \(\varvec{\mu }'_k\) and \(\varvec{\varSigma }'_k\) are collectively contributed by \(\mathbf {e}_{i,j}, \forall ~i,~j\), with the weights governed by the product of \(\gamma _{i,j}\) and \(p(z_k \mid \mathbf {e}_{i,j})\). Consequently, the learning process seamlessly takes the annotation prior, acoustic prior, and annotation clusters over the current affective GMM into consideration. In such a way, the annotations of different clips can be shared with one another according to their corresponding prior probabilities. This can be a key factor that enables AEG to generalize the audio-to-emotion mapping.

As the affective GMM is getting fitted to the data, a small number of affective Gaussian components might overly fit to some emotion annotations, rendering the so-called singularity problem [3]. When this occurs, the corresponding covariance matrices would become non-positive definite (non-PD). Imagining that when a component affective Gaussian is contributed by only one or two annotations, the corresponding covariance shape will become a point or a straight line in the VA space. To tackle this issue, we can remove the component Gaussian when it happens to produce a non-PD covariance matrix during the EM iterations [72].

We note that “early stop” is a very important heuristic while learning the affective GMM. We find that setting a small number for the maximal iteration (e.g., 7–11) or a larger stopping threshold for the increasing ratio of \(L_{\text {bound}}\) (e.g., 0.01) empirically leads to better generalizability. It can not only prevent the aforementioned singularity problem but also avoid overly fitting to the training data. Empirical results show that the accuracy of MER improves as the iteration evolves and then degrades when the optimal iteration number has reached [72]. Moreover, AEG AnnoPrior empirically converges faster and learns smaller covariances than AEG Uniform does.

4 Personalization with AEG

The capability for personalization is a very important characteristic that completes the AEG framework, making it more applicable to real-world applications. As AEG is a probabilistic, parametric model, it can incorporate personal information of a particular user via model adaptation techniques to make custom predictions. While such personal information may include personal emotion annotation, user profile, transaction records, listening history, and relevance feedback, we focus on the use of personal emotion annotations in this chapter.

Because of the cognitive load for annotating music emotion, it is usually not easy to collect a sufficient amount of personal annotations at once to make the system reach an acceptable performance level. On the contrary, a user may provide annotations sporadically in different listening sessions. To this end, an online learning strategy [5] is desirable. When the annotations of a target user are scarce, a good online learning method needs to prevent over-fitting to the personal data in order to keep certain model generalizability. In other words, we cannot totally ignore the contributions of emotion perceptions from other users. Motivated by the Gaussian Mixture Model-Universal Background Model (GMM-UBM) speaker verification system [48], we first treat the affective GMM learned from broad subjects (called background users) as a background (general) model, and then employ a maximum a posteriori (MAP)-based method [16, 48] to update the parameters of the background model using the personal annotations in an online manner. Theoretically, the resulting personalized model will appropriately find a good trade-off between the target user’s annotations and the background model.

4.1 Model Adaptation

In what follows, the acoustic GMM will stay fixed throughout the personalization process, since it is used as a reference model to represent the music audio. In contrast, the affective GMM is assumed to be learned on plenty of emotion annotations from quite a few subjects, so it possesses a sufficient representation (well-trained parameters) for user-independent (i.e., general) emotion perceptions. Our goal is to learn the personal perception with respect to the affective GMM \(\varvec{\varLambda }\) accordingly.

Suppose that we have a target user \(u_\star \) annotating M number of music clips denoted as \(\mathcal {X_\star } = \{\mathbf {e}_i,\varvec{\theta }_i\}_{i=1}^M\), where \(\mathbf {e}_i\) and \(\varvec{\theta }_i\) are the emotion annotation and the topic posterior of a clip, respectively. We first compute each posterior probability over the latent topics based on the background affective GMM,

$$\begin{aligned} p(z_k \mid \mathbf {e}_i, \varvec{\theta }_i ) = \frac{\theta _{i,k} G_k( \mathbf {e}_i \mid \varvec{\mu }_k, \varvec{\varSigma }_k )}{\sum _{h = 1}^K \theta _{i,h} G_k( \mathbf {e}_i \mid \varvec{\mu }_h, \varvec{\varSigma }_h ) }. \end{aligned}$$
(12.16)

Then, we derive the expected sufficient statistics on \(\mathcal {X_\star }\) over the posterior distribution of \(p(z_k \mid \mathbf {e}_i, \varvec{\theta }_i)\) for the mixture weight, mean, and covariance parameters:

$$\begin{aligned} \varGamma _k =&\sum _{i=1}^M p( z_k \mid \mathbf {e}_i , \varvec{\theta }_i )\,, \end{aligned}$$
(12.17)
$$\begin{aligned} \mathbb {E} (\varvec{\mu }_k ) =&\frac{1}{\varGamma _k}\sum _{i=1}^M p( z_k \mid \mathbf {e}_i, \varvec{\theta }_i) \mathbf {e}_i \,, \end{aligned}$$
(12.18)
$$\begin{aligned} \mathbb {E} (\varvec{\varSigma }_k ) =&\frac{1}{\varGamma _k}\sum _{i=1}^M p( z_k \mid \mathbf {e}_i, \varvec{\theta }_i ) \big (\mathbf {e}_i - \mathbb {E}(\varvec{\mu }_k) \big ) \big (\mathbf {e}_i -\mathbb {E} (\varvec{\mu }_k) \big )^T \,. \end{aligned}$$
(12.19)

Finally, the new parameters of the personalized affective GMM can be obtained according to the MAP criterion [16]. The resulting update rules are the forms of interpolations between the expected sufficient statistics (i.e., \(E(\varvec{\mu }_k)\) and \(E(\varvec{\varSigma }_k)\)) and the parameters of the background model (i.e., \(\varvec{\mu }_k\) and \(\varvec{\varSigma }_k\)) as follows

$$\begin{aligned} \varvec{\mu }'_k \leftarrow \alpha _k^\text {m} \mathbb {E}(\varvec{\mu }_k) + \left( 1 - \alpha _k^\text {m} \right) {\varvec{\mu }_k} \, , \end{aligned}$$
(12.20)
$$\begin{aligned} \varvec{\varSigma }'_k \leftarrow \alpha _k^\text {v} \mathbb {E}(\varvec{\varSigma }_k) + \left( 1- \alpha _k^\text {v} \right) \left( \varvec{\varSigma }_k + \varvec{\mu }_k \varvec{\mu }_k^T \right) - \varvec{\mu }'_k ({\varvec{\mu }'_k})^T\,. \end{aligned}$$
(12.21)

The coefficients \(\alpha _k^\text {m}\) and \(\alpha _k^\text {v}\) are data-dependent and are defined as

$$\begin{aligned} \alpha _k^\text {m} = \frac{\varGamma _k}{\varGamma _k + \beta ^\text {m}} \,,~~~ \alpha _k^\text {v} = \frac{\varGamma _k}{\varGamma _k + \beta ^\text {v}} \,, \end{aligned}$$
(12.22)

where \(\beta ^\text {m}\) and \(\beta ^\text {v}\) are related to the hyper parameters [16] and thus should be empirically defined by users. Note that there is no need to update the mixture weights, as they are already occupied by the fixed topic posterior weights.

4.2 Discussion

The MAP-based method is preferable in that we can determine the interpolation factor that balances the contribution between the personal annotations and the background model without loss of model generalizability, as demonstrated by its superior effectiveness and efficiency in speaker adaptation tasks [48]. If a personal annotation \(\{\mathbf {e}_m,\varvec{\theta }_m\}\) is highly correlated to a latent topic \(z_k\) (i.e. \(p(z_k | \mathbf {e}_m, \varvec{\theta }_m)\) is large), the annotation will contribute more to the update of \(\{\varvec{\mu }'_k, \varvec{\varSigma }'_k\}\). In contrast, if the user’s annotations have nothing to do with \(z_h\) (i.e., the cumulative posterior probability \(\varGamma _h=0\)), the parameters of \(\{\varvec{\mu }'_h, \varvec{\varSigma }'_h\}\) would remain the same as those of the background model, as shown by the fact that \(\alpha _k\) would be 0.

Another advantage of the MAP-based method is that users are free to provide personal annotations for whatever songs they like, such as the songs they are more familiar with. This can help reduce the cognitive load of the personalization process. As the AEG framework is audio-based, the annotated clips can be arbitrary and does not have to be those included in the corpus for training the background model.

Finally, we note that the model adaptation procedure only needs to be performed once, so the algorithm is fairly efficient. It only requires K times of computing the expected sufficient statistics and updating the parameters. In consequence, we can keep refining the background model whenever a small number of personal annotations are available, and readily use the updated model for personalized MER or music retrieval. The model adaptation method for GMM is not limited to the MAP method. We refer interested readers to [7, 35] for more advanced methods.

5 AEG-Based Music Emotion Recognition

5.1 Algorithm

As described in Sect. 12.3, we predict the emotion distribution of an unseen clip by weighting the affective GMM using the clip’s topic posterior \({\hat{\theta }} = \{{\hat{\theta }}_k\}_{k=1}^K\) as

$$\begin{aligned} p(\mathbf {e} \mid {\hat{\theta }}) = \sum _{k = 1}^K \hat{\theta }_{k} G_k( \varvec{\mu }_k ,\varvec{\varSigma }_k ) \,. \end{aligned}$$
(12.23)

In addition, we can also use a single, representative affective Gaussian \(G(\hat{\varvec{\mu }}, \hat{\varvec{\varSigma }})\) to summarize the weighted affective GMM. This can be done by solving the following optimization problem

$$\begin{aligned} \underset{{\hat{\varvec{\mu }}}, {\hat{\varvec{\varSigma }}}}{\min } ~~ \sum _{k=1}^K {\hat{\theta }}_k D_\text {KL} \big ( G_k( {\varvec{\mu }}_k, {\varvec{\varSigma }}_k ) ~\big | \big |~ G( {\hat{\varvec{\mu }}}, {\hat{\varvec{\varSigma }}} ) \big ) \,, \end{aligned}$$
(12.24)

where

$$\begin{aligned} D_\text {KL} ( G_A \parallel G_B) = \frac{1}{2} \Big ( \text {tr}(\varvec{\varSigma }_A \varvec{\varSigma }_B^{-1}) - \log \mid \varvec{\varSigma }_A \varvec{\varSigma }_B^{-1}\mid + (\varvec{\mu }_A-\varvec{\mu }_B)^T {\varvec{\Sigma }}_B^{-1} (\varvec{\mu }_A-\varvec{\mu }_B) - 2 \Big ) \nonumber \\ \end{aligned}$$
(12.25)

denotes the one-way (asymmetric) Kullback–Leibler (KL) divergence (a.k.a. relative entropy) [35] from \(G_A(\varvec{\mu }_A, \varvec{\varSigma }_A)\) to \(G_B(\varvec{\mu }_B, \varvec{\varSigma }_B)\). This optimization problem is strictly convex in \({\hat{\varvec{\mu }}}\) and \({\hat{\varvec{\varSigma }}}\), which means that there is a unique minimizer for the two variables, respectively [11]. Let the partial derivative with respect to \({\hat{\varvec{\mu }}}\) be 0, we have

$$\begin{aligned} \sum \nolimits _k {\hat{\theta }}_k ( 2 {\hat{\varvec{\mu }}} - 2 \varvec{\mu }_k) =0 \, . \end{aligned}$$
(12.26)

Given the fact that \(\sum _k {\hat{\theta }}_k = 1\), we derive

$$\begin{aligned} {\hat{\varvec{\mu }}} = \sum _{k=1}^K {\hat{\theta }_k} \varvec{\mu }_k \,. \end{aligned}$$
(12.27)

Setting the partial derivative with respect to \(\varvec{\varSigma }_k^{-1}\) to 0,

$$\begin{aligned} \sum \nolimits _k \hat{\theta }_k \left( {\varvec{\varSigma }}_k - {\hat{\varvec{\varSigma }}} + \left( {\varvec{\mu }}_k - {\hat{\varvec{\mu }}} \right) {\left( {\varvec{\mu }}_k - {\hat{\varvec{\mu }}} \right) }^T \right) = 0\,, \end{aligned}$$
(12.28)

we obtain the optimal covariance matrix by,

$$\begin{aligned} {\hat{\varvec{\varSigma }}} = \sum _{k=1}^K {{\hat{\theta }}_k} \left( {\varvec{\Sigma }_k + \left( {\varvec{\mu }}_k - {\hat{\varvec{\mu }}} \right) \left( {\varvec{\mu }}_k - {\hat{\varvec{\mu }}} \right) ^T } \right) \,. \end{aligned}$$
(12.29)

5.2 Discussion

Representing the predicted result as a single Gaussian is functionally necessary, because it is easier and more straightforward to interpret or visualize the emotion prediction to the users with only a single mean (center) and covariance (uncertainty). However, this may run counter to the theoretical arguments given in favor of a GMM that permits emotion modeling in finer granularity. For instance, it is inadequate for the clips whose emotional responses are by nature bi-modal. We note that in applications such as emotion-based music retrieval (cf. Sect. 12.6) and music video generation [66], one can directly use the raw weighted GMM (i.e., Eq. 12.23) as the emotion index of a song in response to queries given in the VA space. We will detail this aspect later in Sect. 12.6.

The computation of Eqs. 12.27 and 12.29 is quite efficient. The complexity depends mainly on K and the number of frames T of a clip: computing \(\theta _k\) requires KT operations (cf. Eq. 12.2), whereas computing \({\hat{\varvec{\mu }}}\) and \({\hat{\varvec{\varSigma }}}\) requires K vector multiplications and K matrix operations, respectively. This efficiency is important for dealing with a large-scale music database and for application such as real-time music emotion tracking on a mobile device [27, 53, 64, 70, 71].

5.3 Evaluation on General MER

5.3.1 Dataset

We use the AMG1608 dataset [8] for evaluating both general and personalized MER. The dataset contains 1,608 30-s music clips annotated by 665 subjects (345 are male; average age is \(32.0\pm 11.4\)) recruited mostly from the crowdsourcing platform Mechanical Turk [44]. The subjects were asked to rate the VA values that best describe their general (instead of moment-to-moment) emotion perception of each clip via the internet. The VA values, which are real values ranging in between [–1, 1], are entered by clicking on the emotion space on a square interface panel. The subjects were instructed to rate the perceived rather than felt emotion. Each music clip was annotated by 15–32 subjects. Each subject annotated 12–924 clips, and 46 out of the 665 subjects annotated more than 150 music clips, making the dataset a useful corpus for research on MER personalization. The average Krippendorff’s \(\alpha \) across the music clips is 0.31 for valence and 0.46 for arousal, which are both in the range of fair agreement. Please refer to [8] for more details about this dataset.

5.3.2 Acoustic Features

As different emotion perceptions are usually associated with different patterns of features [18], we use two toolboxes, MIRtoolbox [36] and YAAFE [42], to extract four sets of frame-based features from audio signals, including MFCC-related features, tonal features, spectral features, and temporal features, as listed in Table 12.1. We down-sample all the audio clips in AMG1608 at 22,050 Hz and normalize them to the same volume level. All the frame-based features are extracted with the same frame size of 50 ms and 50 % hop size. Each dimension in the frame-based feature vectors is normalized to zero mean and unit standard deviation. We concatenate all the four sets of features for each frame, as this leads to better performance in acoustic modeling in our pilot study [83]. As a result, a frame-level feature vector contains 72 dimensions of features.

Table 12.1 Frame-based acoustic features used in the evaluation

However, it does not make sense to analyze and predict the music emotion on a specific frame. Instead of bag-of-frames approach [61, 63], we adopt the bag-of-segments approach for the topic posterior representation, because a segment is able to capture more local temporal variation of the low-level features. Our preliminary result has also confirmed this hypothesis. To generate a segment-level feature vector representing a basic term in the bag-of-segments approach, we concatenate the mean and standard deviation of 16 consecutive frame-level feature vectors, leading to a 144-dimensional vector for a segment. The hop size for a segment is four frames. Given the acoustic GMM (cf. Eq. 12.1), we then follow Eqs. 12.2 and 12.3 addressed in Sect. 12.3.1 to compute the topic posterior vector of a music clip.

5.3.3 Evaluation Metrics

The accuracy of general MER is evaluated using three performance metrics: two-way KL divergence (KL2) [35], Euclidean distance, and \(R^2\) (also known as the coefficient of determination) [58]. The first two measure the distance between the prediction and the ground truth. The lower the value is, the better the performance. KL2 considers the performance with respect to the bivariate Gaussian distribution of a chip, while the Euclidean distance is concerned with the VA mean only. \(R^2\) is also concerned with the VA mean only. In contrast to the distance measure, a high \(R^2\) value is preferred. Moreover, \(R^2\) is computed separately for valence and arousal.

Specifically, we are given the distribution of the ground-truth annotations \(\mathcal {N}_i= G( \mathbf {a}_i,\mathbf {B}_i)\) (cf. Sect. 12.3.2) and the predicted distribution of each test clip \(\hat{\mathcal {N}}_i=G( \hat{\varvec{\mu }}_i, \hat{\varvec{\varSigma }}_i)\), both of which are modeled as a bivariate Gaussian distribution, where \(i \in \{1,\ldots ,N\}\) denotes the index of a clip in the test set. Instead of one-way KL divergence (cf. Eq. 12.25) for determining the representative Gaussian, we evaluate the performance of emotion distribution prediction based on the KL2 divergence defined by

$$\begin{aligned} D_\text {KL2} (G_A, G_B) \equiv \frac{1}{2} \Big ( D_\text {KL}(G_A \parallel G_B) + D_\text {KL}( G_B \parallel G_A) \Big ) \,. \end{aligned}$$
(12.30)

The average KL2 divergence (AKL), which measures the symmetric distance between the predicted emotion distribution and the ground truth one, is computed by \(\tfrac{1}{N}\sum \nolimits _{i=1}^{N} D_\text {KL2}(\mathcal {N}_i, \hat{\mathcal {N}_i})\). Using the \(l_2\) norm, we can compute the average Euclidean distance (AED) between the mean vectors of two Gaussian distributions by \(\frac{1}{N} \sum \nolimits _{i=1}^{N}{\Vert \mathbf {a}_i - {\hat{\varvec{\mu }}} \Vert _2 }\). The \(R^2\) statistics is a standard way to measure the fitness of regression models [58]. It is used to evaluate the prediction accuracy as follows:

$$\begin{aligned} R^2= 1-\frac{\sum _{i=1}^N ({\hat{e}}_{i} - e_i )^2 }{\sum _{i=1}^{N}{ (e_i - \bar{e} )^2 } }\,, \end{aligned}$$
(12.31)

where \(\hat{e}_{i}\) and \(e_i\) denote the predicted (either valence or arousal) value and the ground truth one of a clip, respectively, and \(\bar{e}\) is the ground-truth value over the test set. When the predictive model perfectly fits the ground-truth values, \( R^2 \) is equal to 1. If the predictive model does not fit the ground-truth well, \( R^2 \) may become negative.

We perform three-fold cross-validation to evaluate the performance of general MER. Specifically, the AMG1608 dataset is randomly partitioned into three folds, and an MER model is trained on two of them and tested on the other one. Each round of validation generates the predicted result of one-third of the complete dataset. After three rounds, we will have the predicted result of each clip in the complete dataset. Then, AKL, AED, and the \(R^2\) for valence and arousal are computed over the complete dataset, instead of computing the performance over each one-third of the dataset. This strategy gives an unbiased estimate for \(R^2\).

Table 12.2 Performance evaluation on general MER (\(\downarrow \) stands for smaller-better and \(\uparrow \) larger-better)

5.3.4 Result

We compare the performance of AEG with two baseline methods. The first one, referred to as the base-rate method, uses a reference affective Gaussian whose mean and covariance are set using the global mean and covariance of the training annotations without taking into account the acoustic features. In other words, the prediction for every test clip would be the same for the base-rate method. The performance of this base-rate method can be considered as a lower bound in this task accordingly. Moreover, we compare the performance of AEG with SVR [55], a representative regression-based approach for predicting emotion values or distributions, using the same type of acoustic features. Specifically, the feature vector of a clip is formed by concatenating the mean and standard deviation of all the frame-level feature vectors within a clip, yielding a 144-dimensional vector. We use the radial basis function (RBF) kernel SVR implemented by the libSVM library [6], with parameters optimized by grid search with three-fold cross-validation on the training set. We further use a heuristic favorable for SVR to regularize every invalid predicted covariance parameter [72]. This heuristic significantly improves the AKL performance of SVR.

Our pilot study empirically shows that AEG Uniform gives better emotion prediction in AED, compared to AEG AnnoPrior, possibly because the introduction of the annotation prior (cf. Eq. 12.7) may bias the estimation of the mean parameters in the EM learning. In contrast, AEG AnnoPrior leads to better result in AKL, indicating its capability of estimating a more proper covariance for a learned affective GMM. In light of this, we use a following hybrid method to take advantage of both AEG AnnoPrior and AEG Uniform in optimizing the affective GMM. Suppose that we have learned two affective GMMs, one for AEG AnnoPrior and the other for AEG Uniform. To generate a combined affective GMM, for its k-th component Gaussian, we take the mean from the k-th Gaussian of AEG Uniform and the covariance from the k-th Gaussian of AEG AnnoPrior. This combined affective GMM is eventually used to predict the emotion for a test clip with Eqs. 12.27 and 12.29 in this evaluation.

Table 12.2 compares the performance of AEG with the two baseline methods. It can be seen that both SVR and AEG outperform the base-rate method by a great margin, and that AEG can outperform SVR. For AEG, we can obtain better AKL and better \(R^2\) for valence when \(K=128\), but better AED and better \(R^2\) for arousal when \(K=256\). The best \(R^2\) achieved for valence and arousal are 0.1601 and 0.6686. In particular, the superior performance of AEG in \(R^2\) for valence is remarkable. Such observation suggests AEG a promising approach, as it is typically more difficult to model the valence perception from audio signals [74].

Fig. 12.3
figure 3

Performance evaluation on general MER, using different numbers of latent topics in AEG. a AKL, smaller-better. b AKL, smaller-better. c \(R^{2}\) of valence, larger-better. d \(R^{2}\) of arousal, larger-better

Figure 12.3 presents the result of AEG when we vary the value of K (i.e., the number of latent topics). It can be seen that the performance of AEG improves as a function of K when K is smaller than 256, but starts to decrease when K is sufficient larger. The best result is obtained when K is set to 128 or 256. As the parameters of SVR-RBF has also been optimized, this result shows that, if the optimal case of AEG is not attained (e.g., K = 64 or 512), AEG is still on par with the state-of-the-art SVR approach to general MER.

5.4 Evaluation on Personalized MER

5.4.1 Evaluation Setup

The trade-off between the number of personal annotations (feedbacks) and the performance of personalization is important for personalized MER. On one hand, we hope to have more personal annotations to more accurately model the emotion perception of a particular user. On the other hand, we want to restrict the number of personal annotations so as to relieve the burden on the user. To reflect this, evaluation on the performance of personalized MER is conducted by fixing the test set for each user, but varying the number of available emotion annotations from the particular user to test how the performance improves as personal data amasses.

We consider 41 users who have annotated more than 150 clips in this evaluation. We use the data of six of them for parameter tuning, and the data of the remaining 35 in the evaluation and report the average result for these 35 test users. One hundred annotations of each test user are randomly selected as the personalized training set for personalization for the user. Once the model is created, another 50 clips annotated by the same user are randomly selected. Specifically, for each test user, a general MER model is trained with 600 clips randomly selected from the original AMG1608, excluding those annotated by the test user under consideration and those self-inconsistent annotations. Then, the general model is incrementally personalized five times using different numbers of clips selected from the personalized training set. We use 10, 20, 30, 40, and 50 clips iteratively, with the preceding clips being a subset of the current ones each time. The process is repeated 10 times for each user.

We use the following evaluation metrics here: the AED, the \(R^2\), and the average likelihood (ALH) of generating the ground-truth annotation (a single VA point) \(\mathbf {e}_\star \) of the test user using the predicted affective Gaussian, i.e., \(p(\mathbf {e}_\star \mid {\hat{\varvec{\mu }}}_\star , {\hat{\varvec{\varSigma }}}_\star )\). Larger ALH corresponds to better accuracy. We do not report KL divergence here because each clip in the dataset is annotated by a user at most once, which does not constitute a probability distribution.

5.4.2 Result

We compare the MAP-based personalization method of AEG with the two-stage personalization method of SVR proposed in [79]. In the two-stage SVR method, the first stage creates a general SVR model for general emotion prediction, whereas the second stage creates a personalized SVR that is trained solely on a user’s annotations. The final prediction is obtained by linearly combining the predictions from the general SVR and the personalized SVR with weights 0.7 and 0.3, respectively. The weights are derived empirically according to our pilot study. As for AEG, we only update the mean parameters with \(\beta ^\text {m} = 0.01\), because our pilot study shows that updating the covariance empirically does not lead to better performance. This observation is also in line with the findings in speaker adaptation [48]. We train the background model with AEG Uniform for simplicity.

Fig. 12.4
figure 4

Performance evaluation on personalized MER, with varying numbers of personal data. a ALH, smaller-better. b AED, smaller-better. c \(R^{2}\) of valence, larger-better. d \(R^{2}\) of arousal, larger-better

Figure 12.4 compares the result of different personalized MER methods, when we vary the number of available personal annotations. The starting point of each curve is the result given by the general MER model trained on partial users of the AMG1608 dataset. We can see that the result of the general model is inferior to those reported in Fig. 12.3, showing that a general MER model is less effective when it is used to predict the emotion perception of individual users, compared to the case of predicting the average emotion perception of users. We can also see that the result of the considered personalized methods generally grows as the number of personal annotations increases. When the value of K is sufficiently large, AEG-based personalization methods can outperform the SVR method. Moreover, while the result of SVR starts to saturate when the number of personal annotations is larger than 20, AEG has the potential of keeping on improving the performance by exploiting more personal annotations. We also note that there is no significant performance difference for AEG when K is large enough (e.g., \(\ge \)128).

Although our evaluation shows that personalization methods can improve the result of personalized emotion prediction, the low values in the \(R^2\) statistics for valence and arousal still show that the problem is fairly challenging. Future work is still needed to improve either the quality of the emotion annotation data or the feature extraction or machine learning algorithms for modeling emotion perception.

6 Emotion-Based Music Retrieval

6.1 The VA-Oriented Query Interface

The VA space offers a ready canvas for music retrieval through the specification of a point in the emotion space [80]. Users can retrieve music pieces of certain emotions without specifying the titles. Users can also draw a trajectory to indicate the desired emotion changes across a list of songs (e.g., from angry to tender).

Fig. 12.5
figure 5

The stress-sensitive user interface for emotion-based music retrieval. Users can (a) specify a point or (b) draw a trajectory, while specifying the variance with different levels of duration

In addition to the above point-based query, one can also issue a Gaussian-based query to an AEG-based retrieval system. As Fig. 12.5 shows, users can specify the desired variances (or the confidence level at the center point) of emotion by pressing a point in the VA space with different levels of duration or strength. The variance of the Gaussian gets smaller as one increases the duration or strength of pressing, as Fig. 12.5a shows. Larger variances indicate less specific emotion around the center point. After specifying the size of a circular variance shape, one can even pinch fingers to adjust the variance shape. For a trajectory-based query input, similarly, the corresponding variances are determined according to the dynamic speed when drawing the trajectory, as Fig. 12.5b shows. Fast speed corresponds to a less specific query and the system will return pieces whose variances of emotion are larger. If songs with more specific emotions are desirable, one can slow down the speed when drawing the trajectory. The queries inputted by such a stress-sensitive interface can be handled by AEG for emotion-based music retrieval.

6.2 Overview of the Emotion-Based Music Retrieval System

As Fig. 12.6 shows, the content-based retrieval system can be divided into two phases. In the feature indexing phase, we index each music clip in an unlabeled music database by one of the following two approaches: The emotion prediction approach indexes a clip with the predicted emotion distribution (an affective GMM or a single 2-D Gaussian) given by MER, whereas the folding-in approach indexes a clip with the topic posterior (a K-dimensional vector). In the later music retrieval phase, given an arbitrary emotion-oriented query, the system returns a list of music clips ranked according to one of the following two approaches: likelihood/distance-based matching and pseudo song-based matching. These two ranking approaches correspond to one of the two indexing approaches, respectively, as summarized in Table 12.3. We present the details of the two approaches in the following subsections.

Fig. 12.6
figure 6

The diagram of the content-based music retrieval system using an emotion query

6.3 The Emotion Prediction-Based Approach

This approach indexes each clip as a single, representative Gaussian distribution or an affective GMM in the offline MER procedure. The query is then used to compare with the predicted emotion distribution of each clip in the database. The system ranks all the clips based on the likelihoods or distances in response to the query. Clips with larger likelihood or smaller distance should be placed in the higher order.

Given a point query \({\tilde{\mathbf {e}}}\), the corresponding likelihood of the indexed emotion distribution of a clip \(\hat{\varvec{\theta }}_i\) is generated by a single Gaussian \(p(\tilde{\mathbf {e}} \mid \hat{\varvec{\mu }}_i, {\hat{\varvec{\varSigma }}}_i)\) or an affective GMM \(p(\tilde{\mathbf {e}} \mid \hat{\varvec{\theta }}_i)\) (cf. Eq. 12.23), where \(\{\hat{\varvec{\mu }}_i, {\hat{\varvec{\varSigma }}}_i\}\) is the predicted parameters of the representation Gaussian for \(\hat{\varvec{\theta }}_i\), and \({\hat{\theta }}_{i,k}\) is the k-th component of \({\hat{\varvec{\theta }}}_i\). Note that here we use the topic posterior vector to represent a clip in the database.

When it comes to a Gaussian-based query \(\tilde{G}=G( {\tilde{\varvec{\mu }}}, {\tilde{\varvec{\varSigma }}})\), the approach generates the ranking scores based on the KL2 divergence. In the case of indexing with a single Gaussian, we use Eq. 12.30 to compute \(D_\text {KL2} \big ( \tilde{G}, G({\hat{\varvec{\mu }}}_i, {\hat{\varvec{\varSigma }}}_i) \big )\) between the query and a clip. On the other hand, in the case of indexing with an affective GMM, we compute the weighted KL2 divergence by

$$\begin{aligned} D_\text {KL2}\big ( \tilde{G}, p(\mathbf {e} \mid {\hat{\varvec{\theta }}}_i ) \big ) = \sum _{k=1}^K {\hat{\theta }}_{i,k} D_\text {KL2} \big ( \tilde{G}, G_k( \varvec{\mu }_k, \varvec{\varSigma }_k) \big )\,. \end{aligned}$$
(12.32)
Table 12.3 The two implementations of the emotion-based music retrieval system

6.4 The Folding-In-Based Approach

As Fig. 12.7 shows, this approach estimates the probability distribution \(\varvec{\lambda } = \{\lambda _k\}_{k=1}^K\), subject to \(\sum _k{\lambda _k}=1\), for an input VA-oriented query in an online manner. Each estimated \(\lambda _k\) corresponds to the relevance of a query to the k-th latent topic \(z_k\), so we can treat the distribution of \(\varvec{\lambda }\) as the topic posterior of the query and call it a pseudo song. In the case of Fig. 12.7, for example, we show a query that is very likely to be represented by the second affective Gaussian component. The folding-in process is likely to assign a dominative weight \(\lambda _2 =1\) for \(z_2\), and \(\lambda _h=0\), \(\forall h\ne 2\). This implies that the query is highly related to the song whose topic posterior is dominated by \(\theta _2\). Therefore, the pseudo song can be used to match with the topic posterior vector \({\hat{\varvec{\theta }}}_i\) of each clip in the database.

Fig. 12.7
figure 7

Illustration of the folding-in process of emotion-based music retrieval by AEG

Given a point query \(\tilde{{\mathbf {e}}}\), we start the folding-in process by first generating the pseudo song via maximizing the query likelihood of the \(\varvec{\lambda }\)-weighted affective GMM with respective to \(\varvec{\lambda }\). By taking the logarithm of Eq. 12.23, we obtain the following objective function:

$$\begin{aligned} \underset{\varvec{\lambda }}{\max } ~~ \log \sum _{k=1}^K \lambda _k ~ G_k( \tilde{\mathbf {e}} \mid \varvec{\mu }_k, \varvec{\varSigma }_k) \,, \end{aligned}$$
(12.33)

where \(\lambda _k\) is the k-th component of the vector \(\varvec{\lambda }\). In some sense, a good \(\varvec{\lambda }\) will make the corresponding \(\varvec{\lambda }\)-weighted affective GMM well generate the query \({\tilde{\mathbf {e}}}\). The problem in Eq. 12.33 can be solved by the EM algorithm. In the E-step, the posterior probability of \(z_k\) is computed by

$$\begin{aligned} p(z_k \mid \tilde{\mathbf {e}}) = \frac{ \lambda _k G_k( \tilde{\mathbf {e}} \mid \varvec{\mu }_k, \varvec{\varSigma }_k ) }{\sum \nolimits _{h=1}^K \lambda _h G_h( \tilde{\mathbf {e}} \mid \varvec{\mu }_h, \varvec{\varSigma }_h ) }\,. \end{aligned}$$
(12.34)

In the M-step, we then only update \(\lambda _k\) by

$$\begin{aligned} \lambda '_k \leftarrow p(z_k \mid \tilde{\mathbf {e}}) \,. \end{aligned}$$
(12.35)

As for a Gaussian-based query \(\tilde{G}\), we fold in the query into the learned affective GMM to estimate a pseudo song as well. This time, we maximize the following log-likelihood function:

$$\begin{aligned} \underset{\varvec{\lambda }}{\max } ~~ \log \sum _{k=1}^K \lambda _k ~p( \tilde{G} \mid G_k) \,, \end{aligned}$$
(12.36)

where \(p(\tilde{G} \mid G_k )\) is the likelihood function based on KL2 (cf. Eq. 12.30):

$$\begin{aligned} p(\tilde{G} \mid G_k ) = \exp \big ( -D_\text {KL2}( \tilde{G}, G_k) \big ) \,. \end{aligned}$$
(12.37)

Again, Eq. 12.36 can be solved by the EM algorithm, with the following update,

$$\begin{aligned} \lambda '_k \leftarrow p(z_k \mid \tilde{G}) = \frac{ \lambda _k p( \tilde{G} \mid G_k)}{\sum \nolimits _{h=1}^K \lambda _h p( \tilde{G} \mid G_h)} \,. \end{aligned}$$
(12.38)

The EM processes for both point- and Gaussian-based queries stop early after few iterations (e.g., 3), because the pseudo song estimation is sensitive to over-fitting. Several initialization settings can be used, such as a random, uniform, or prior distribution. Considering the stability and the reproducibility of the experimental result, we opt for using a uniform distribution for initialization. Note that random initialization may introduce discrepant results among different trials even with identical experimental settings, whereas initializing with a prior distribution may render biased results in favor of songs that predominates the training data [67]. Finally, the retrieval system ranks all the clips in descending order of the following cosine similarities in response to the pseudo song:

$$\begin{aligned} \varPhi (\varvec{\lambda }, \varvec{\theta }_i) = \frac{\varvec{\lambda }^T \varvec{\theta }_i}{\Vert \varvec{\lambda }\Vert \Vert \varvec{\theta }_i\Vert }\,. \end{aligned}$$
(12.39)

6.5 Discussion

The Emotion Prediction approach is straightforward, as the purpose of MER is to automatically index unseen music pieces in the database. In contrast, the folding-in approach goes one step further to embed a VA-based query into the space of music clips. Although the folding-in process requires an additional step of estimating the pseudo song, it is in fact more flexible. In a personalized music retrieval context, for example, a personalized affective GMM can readily produce a personalized pseudo song for comparing with the original topic posterior vectors of all the pieces in the database, without the need to predict the emotion again with the personalized model.

The complexity of the emotion prediction approach mainly comes from computing the likelihood of a point query on each music clip’s emotion distribution or the KL divergence between the Gaussian query and the emotion distribution of each clip. Therefore, the matching process needs to compute N (the number of clips in the database) times the likelihood or the KL divergence. In the folding-in approach, the complexity comes from estimating the pseudo song (with the EM algorithm) and computing the cosine similarity between the pseudo song and each clip. EM needs to compute \(K\times \) ITER times the likelihood of a component affective Gaussian or the Gaussian KL divergence, where ITER is the number of EM iterations. Then, the matching process computes N times the cosine similarity. Obviously, computing the likelihood on an emotion distribution (i.e., a single Gaussian or a GMM) is computationally more expensive than computing the cosine similarity (as K is usually not large). Therefore, when N is large (e.g., \(N \gg K\times \) ITER), the folding-in approach is considered as a more feasible one in practice.

6.6 Evaluation for Emotion-Based Music Retrieval

6.6.1 Evaluation Setup

The AMG1608 dataset is again adopted in this music retrieval evaluation. We consider two emotion-based music retrieval scenarios: query-by-point and query-by-Gaussian. For each scenario, we create a set of synthetic queries and use the learned AEG model to respond to each test query and return a ranked list of music clips from an unlabeled music database. The generation of the test query set for query-by-point is simple. As Fig. 12.8a shows, we uniformly sample 100 2-D query points within \(\left[ [-1,-1]^T, [1,1]^T\right] \) in the VA space. The test query set for query-by-Gaussian is then based on this set of points. Specifically, we convert a point query to a Gaussian query by associating with the point a 2-by-2 covariance matrix, as Fig. 12.8b shows. Motivated by our empirical observation from data, the covariance of a Gaussian query is set in inverse proportion to the distance between the mean of the Gaussian query (determined by the corresponding point query) and the origin of the VA space. That is, if a given point query is far from the origin (with large emotion magnitude), the user may want to retrieve songs with a specific emotion (with smaller covariance ellipse).

Fig. 12.8
figure 8

Test queries used in evaluating emotion-based music retrieval: a 100 points generated uniformly in between [–1, 1]. b 100 Gaussians generated based on the previous 100 points

The performance is evaluated by aggregating the ground-truth relevance scores of the retrieved music clips according to the normalized discounted cumulative gain (NDCG), a widely used performance measure for ranking problems [28]. The NDCG@P, which measures the relevance of the top P retrieved clips for a query, is computed by

$$\begin{aligned} {\text {NDCG}}@P = \frac{1}{Z_P} \left\{ R(1) + \sum _{i=2}^P \frac{R(i)}{\log _2 i} \right\} \,, \end{aligned}$$
(12.40)

where R(i) is the ground-truth relevance score of the rank-i clip, \(i=1,\ldots ,Q\), where \(Q\ge P\) is the number of clips in the music database, and \(Z_P\) is the normalization term that ensures the ideal NDCG@P equal 1. Let \(\mathcal {N}_i\) (with parameters \(\{\mathbf {a}_i, \mathbf {B}_i\}\)) denote the ground-truth annotation Gaussian of the rank-i clip. For a point query \(\tilde{\mathbf {e}}\), R(i) is obtained by \(p(\tilde{\mathbf {e}} \mid \mathbf {a}_i, \mathbf {B}_i)\), the likelihood of the query point. For a Gaussian query \(\tilde{\mathcal {N}}\), R(i) is given by \(p(\tilde{\mathcal {N}} \mid \mathcal {N}_i)\) defined by Eq. 12.37. From Eq. 12.40, we see that if the system ranks the clips in similar order as the descending order obtained on \(\{R(i)\}_{i=1}^Q\), we obtain a larger NDCG. We report the average NDCG computed over the test query set. Note that we do not adopt evaluation metrics, such as the mean average precision and the area under the ROC curve, because currently it is not trivial to set a threshold to binarize R(i).

We perform threefold cross-validation as that used in evaluating general MER. In each round, the test fold (with 536 clips) serves as the unlabeled music database.

6.6.2 Result

We implement a random approach to reflect the lower bound performance using a random permutation for each test query, without taking into consideration any ranking approach. We further implement an Ensemble approach that averages the rankings of a test query given by emotion prediction and folding-in. Specifically, both approaches assign an ordinal number to a clip according to their respective rankings. Then, we average the two ordinal numbers of a clip as a new score, and re-rank all the clips in ascending order of their new scores.

Note that we only consider AEG Uniform for simplicity in the result presentation. Our preliminary study reveals that AEG Uniform in general perform slightly better than AEG AnnoPrior and the hybrid method mentioned in Sect. 12.5.3.4 in the retrieval task. Moreover, for the folding-in approach, early stop is not only important to the folding-in process, but also necessary to learning the affective GMM. According to our pilot study, setting ITER between 2 and 4 for learning affective GMM and \(\textit{ITER}=3\) for learning the pseudo song lead to the optimal performance.

Figure 12.9 compare the NDCG@5 of the emotion prediction and folding-in approaches to emotion-based music retrieval using either point-based or Gaussian-based queries. We are interested in how the result changes as we vary the number of latent topics. It can be found that the two approaches perform very similarly for point-based query when K is in between 64 and 256. Moreover, we see that emotion prediction can outperform folding-in for Gaussian-based query when K is sufficiently large (\(K\ge 64\)). The optimal model is attained when \(K=128\) in all cases. Similar to the result in general MER, it seems that setting K either too large or too small would lead to sub-optimal result.

Fig. 12.9
figure 9

Evaluation result of emotion-based music retrieval. a Point-based query, larger-better. b Gaussian-based query, larger-better

Table 12.4 The query-by-point retrieval performance in terms of NDCG@5, 10, 20, and 30
Table 12.5 The query-by-Gaussian retrieval performance in terms of NDCG@5, 10, 20, and 30

Tables 12.4 and 12.5 present the result of NDCG@5, 10, 20, and 30 for different retrieval methods, including the random baseline, emotion prediction, folding-in, and the ensemble approaches. The latter three use AEG Uniform with \(K=128\). It is obvious that the latter three can significantly outperform the random baseline, demonstrating the effectiveness of AEG in emotion-based music retrieval. It can also be found that the ensemble approach leads to the best result.

A closer comparison between emotion prediction and folding-in for point-based query shows nip and tuck, whereas the former performs consistently better regardless of the value of P for Gaussian-based query. Moreover, the NDCG measure seems more favorable for point-based query than Gaussian-based one. Our observation indicates that the standard deviation of the ground-truth relevance scores (i.e., \(\{R(i)\}_{i=1}^Q\)) for Gaussian-based query is much larger, resulting in a more challenging measurement basis than that for point-based query. However, the relative performance difference between the two methods is similar for point-based and Gaussian-based queries.

7 Connecting Emotion Dimensions and Categories

In addition to describing emotions by dimensions, emotions can also be described in terms of discrete labels (or tags). While the dimensional approach offers a simple means for constructing a 2-D user interface, the categorical approach offers an atomic description of music that is easy to be incorporated into conventional text-based retrieval systems. Being two extreme scenarios (discrete/continuous), the two approaches actually share a unified goal of understanding the emotion semantics of music. As the two approaches are functionally complementary, it is therefore interesting to explore the relationship between them and combine their advantages to enhance the performance of emotion-based music retrieval systems. For example, as a novice user may be unfamiliar with the essence of the valence and activation dimensions, it would be helpful to display emotion tags in the emotion space to give the user some cues. This can be achieved if we have the mapping between the emotion tag space and the VA space.

In this section, we briefly introduce the Tag2VA approach that can maps a mood tag to the VA space.

Fig. 12.10
figure 10

Illustration of the generation flow between tag-based and VA-based emotion semantics of music. Two component models, namely Acoustic Tag Bernoullis (ATB) and AEG, are shown in the left and right panels, respectively. The affective Gaussian of a tag can be generated by following the black dashed arrows

7.1 Algorithm Overview

Based on AEG, we can unify the two semantic modalities under a unified probabilistic framework, as illustrated in Fig. 12.10. Specifically, we establish two component models, the Acoustic Tag Bernoullis (ATB) model and the AEG model, to computationally model the generative processes from acoustic features to an mood tag and a pair of valence-activation values, respectively. ATB is a probabilistic classification model (a.k.a. the CBA model [20]) which can be learned from a tag-labeled music dataset. The latent topics \(\{z_k\}_{k=1}^K\) can act as a bridge between the two spaces, so that the ATB and AEG models can share and transit the semantic information to each other. The latent topics are learned directly from acoustic feature vectors, and thus the training datasets for learning the ATB and AEG models can be totally separate, relieving the requirement for a jointly-annotated dataset for the two emotion modalities. Note that we model each tag independently as a binary classification problem, so that an ATB model is learned for one tag.

Once we have learned the AEG model and the ATB model for a tag, we can obtain the VA value for the tag. As Fig. 12.10 illustrates, we first generate the topic posterior probability of the tag using a method similar to the folding-in approach (cf. Sect. 12.6.4) over the mixture of Bernoulli models. With the topic posterior \(\varvec{\theta }\), we can then directly predict the affective Gaussian using the AEG-based MER (cf. Sect. 12.5.1). Interested readers are referred to [65] for more details.

7.2 Result

We use the AMG1608 dataset to provide qualitative evaluation on the Tag2VA approach. AMG1608 additionally contains the binary labels of 34 mood tags, which are used to train 34 ATB models, respectively. The AEG model is trained following that described in general MER evaluation (cf. Sect. 12.5.3.4). Figure 12.11 presents the tag cloud generated from the VA Gaussians of the 34 mood tags. The font size of a tag is inversely proportional to the variance of the corresponding VA Gaussian. From the result, it can be seen that the automatically generated tag cloud reasonably matches the result by the psychologists [65].

Fig. 12.11
figure 11

The tag cloud generated from AMG1608

8 Conclusion

AEG is a principled probabilistic framework that nicely unifies the computation processes for MER and emotion-based music retrieval for dimensional emotion representations such as valence and arousal. Moreover, AEG better takes into account the subjective nature of music emotional responses through the use of probabilistic inference and model adaptation, further making it possible to personalize an emotion-based MIR system. The source codes for implementing AEG can be retrieved from the link: http://slam.iis.sinica.edu.tw/demo/AEG/.

Despite that AEG is a powerful approach, there remains a number of challenges for MER, including

  • Is it the best way to consider the valence–arousal space as a coordinate space (with two orthogonal axes)?

  • How do we define the “intensity” of emotion? Does the magnitude of a point in the emotion space implies intensity? Would it be possible to train regressors that treat the emotion space as a polar coordinate?

  • What are the features that are more important for modeling emotion?

  • Cross genre generazability [13].

  • Cross culture generazability [23].

  • How to incorporate lyrics features for MER?

  • How to model the effect of the singing voice in emotion perception?

  • How do findings in MER help emotion-based music synthesis or manipulation?

We note that the number of topics in AEG is crucial to its performance. Like many probabilistic models in text information retrieval, this is an open problem [4, 21]. Empirically, larger number of topics fine grains the model resolution and thereby results in better accuracy. Similarly, it makes sense to use more topics to model a larger music dataset (with more songs and annotations). However, to understand the relationship between the topic number and performance in a real-world setting, more user studies are still required in the future.

Moreover, AEG is only suitable for an emotion-based MIR system when we characterize emotions in terms of valence and arousal. It does not apply to systems that use categorical mood tags to describe emotion. A corresponding probabilistic model for categorical MER is yet to be developed. More research efforts are also needed for the personalization and retrieval aspects for categorical MER.

The AEG model itself can also be improved in a number of directions. For example, there are several alternative methods that one can adopt to enhance the latent acoustic descriptors (i.e., \(\{A_k\}_{k=1}^K\) in Sect. 12.3) for clip-level topic poster representation, such as deep learning [54] or sparse representations [61]. One can also perform discriminative training to reduce the prediction error using the same corpus with respect to the selection of Gaussian components or parameter refinement over the affective GMM. For example, a stacked discriminative learning on the parameters initialized by a EM-learned generative model has been studied for years in speech recognition [9, 29]. Following this research line, it may help improve AEG as well. Finally, the AEG framework can be extended to include multi-modal content such as lyrics, review comments, album cover, and music video. For instance, one can accompany a given silent video sequence with a piece of music based on music emotion [66]. To incorporate the lyrics into AEG, on the other hand, one can learn a lyric topic model via algorithms such as pLSA [21] and LDA [4], and compute the probability distribution for each song’s lyrics based on the topic model.