Keywords

1 Introduction

Music Emotion Recognition (MER), a subject of both Music Information Retrieval and Affective Computing, aims to identify the emotion conveyed by a musical clip [18]. Driven by large demands in the music industry such as providing a content-oriented categorization scheme, generating playlist automatically and music recommendation [6, 10, 22], MER has developed rapidly in recent years [1].

Traditional methods are feature-based ones. The most commonly used acoustic features (e.g. Mel Frequency Cepstrum Coefficient, spectral shape in timber, Chromagram and Rhythm strength) are summarized in [9]. On the one hand, since there are hundreds and thousands of features to be considered, feature selection methods [16] for removing redundant ones and principle component analysis (PCA) methods [14] for dimension reduction are introduced. On the other hand, manual features sophisticated designed to express the nature of music emotion have always been an interest in the field. Since most features are low level and related to tone color, in 2018, [15] designed musical texture related and expressive technique related features.

In recent years, with the development of computer hardware and the large available online data, deep learning has demonstrated its great power in many fields including Music Emotion Recognition. Instead of designing specific features which is a challenging task [12] and demands large human labor, a Neural Network itself could extract the very pertinent representations.

Besides simply using deep learning models, such as CNN used in [12], different mechanisms are used to improve the performance. Similar to the multitask learning theory, hoping the first several layers to extract commonly acoustic features and the last several layers to extract target-oriented features, [13] stacked one CNN layer with two RNN branches for arousal and valence regression. Some hope to utilize auxiliary information. Inspired by speech emotion recognition tasks considering spoken content [5, 19] proposed a multimodal architecture based on audio and lyric. In [11], additional harmonic and percussive features are fed into the bi-directional LSTM model. Because of the lacking of training data, others also use the transfer learning method, aiming to make use of excellent features in related domains [3].

As we can see above, despite different mechanisms, the base architectures are usually CNN. It is known that a convolutional neural network learns hierarchical features from level to level [21] and that a higher-layer feature maps depend on lower-layer maps [2]. For instance, in the early layers, low-level information such as tempo, pitch, (local) harmony or envelop might be extracted [3], while high-level semantic patterns such as expressivity and musical texture features would be detected [15] in later layers. However, on the one hand, while doing convolutional operations, the low-resolution features which contain abundant low-frequency information are treated equally across channels [23] (i.e. tempo and pitch information may not contribute exactly the same), hence, the extracted features are not powerful enough. On the other hand, without processing the whole music clip, only understanding a few important aspects, one could recognize its emotion, whereas, a CNN would process all the feature maps which is in contrast to human perception [4].

Fortunately, the problems existed could just be solved by the Channel-wise Attention Mechanism, which has been successfully applied in Computer Vision [20], Natural Language Processing, and Speech Processing. The Channel-wise Attention Mechanism could re-weight feature maps in channels. Moreover, since a feature map is computed from earlier ones, it is natural to apply attention mechanism in multiple layers [2]. In this way, multiple semantic abstractions could be gained [2]. The applied Channel-wise attention mechanism is a sophisticatedly chose one, detailed in Sect. 2.

Music Emotion Recognition tasks could be categorized into a classification one and a regression one. The proposed method is tested on both tasks and the performance has been improved. It should be mentioned that, since public music emotion classification datasets are small, which will even limit the performance of the baseline network, a larger music emotion classification dataset is made.

In summary, the contribution of this paper could be put as follows:

  • (I). To solve the problem of treating each feature map equally while recognizing musical emotion patterns, Channel-wise Attention Mechanism is applied in multi-layers.

  • (II). The Channel-wise Attention Mechanism is a sophisticated chosen one.

  • (III). The procedure of how to make a large musical emotion classification dataset is introduced.

  • (IV). The proposed method could be proven useful to a certain extent.

2 Proposed Method

Fig. 1.
figure 1

The Channel-wise Attention Mechanism for Music Emotion Recognition

There are many sophisticated channel-wise attention mechanisms in literature, such as that in SENet [7], that in RCAB [23]. Obviously, we should not simply draw one of them to use, we should choose or design one on our needs. Firstly, to fully consider the interrelationships among all channels, the channel-wise attention mechanism is designed with the fully connected layer and the activation function, like that in SENet [7], rather than using convolutional ones whose receptive field is limited to only a few channels, such as that in RCAB [23]. Secondly, to learn a non-mutually-exclusive relationship, some channel-wise mechanism is under the mode of an encoder and a decoder scheme. However, after an encoder operation, whether it depends on doting with a weight matrix or convolutional operations, the rank is decreased after encoding, meaning some information from the feature map (though less important) will be lost. This is undesired. Thirdly, considering of computational efficiency, the designed channel-wise attention mechanism is lightweight.

Next, the proposed channel-wise attention mechanism and the backbone architecture will be introduced.

2.1 Channel-wise Attention Mechanism

The channel-wise attention mechanism block is a transformation block. On the above reasons, it is designed with a simple gating mechanism with an activation function. Figure 1 illustrates the mechanism along with the operations and the variables.

As a whole, it is a reweighting operation from \( \mathbf {U} ^{l} \in \mathbb {R}^{L\times C}\) to \( \mathbf {\tilde{X}}^{l}\in \mathbb {R}^{L\times C}\), where \( \mathbf {U} ^{l}\) is the output feature map with the length of L and channel number of C after the l-th convolutional block with input \(\mathbf {X}^{l-1}\). All of the superscripts in notation refer to the layer index.

First, each convolutional kernel with a fixed size receptive field serves as a local semantic information extractor. Therefore, each or some of the value in a feature map could not represent what the channel learns [7]. To mitigate this problem, the channel-wise statistic \( \mathbf {Z} ^{l}\, = \, \left[ z{_{1}}^{l} ,\ z{_{2}}^{l} ,\ \dots ,\ z{_{c}}^{l} ,\ \dots ,\ z{_{C}}^{l} \right] \) obtained by using global average pooling \( \varvec{ f }_{gap}\left( \mathbf {\cdot } \right) \) is used as the channel feature descriptor. For detail, \( z{_{c}}^{l}\) is calculated by:

$$\begin{aligned} z_{c}^{l}\, = \, \varvec{ f }_{gap}\left( \varvec{x}{_{\varvec{c}}}^{l} \right) \, = \, \frac{1}{L}\sum _{i\, =\, 1}^{L}\, \varvec{x}{_{\varvec{c}}}^{l}\left( i \right) \end{aligned}$$
(1)

More sophisticated channel descriptors could also be considered.

Next, inter-channel dependencies will be exploit by Eq. (2):

$$\begin{aligned} \mathbf {S} ^{l}\, = \, \sigma \left( g \left( \mathbf {Z} ^{l} \right) \right) \, = \, \sigma \left( \mathbf {W}^{l} \cdot \mathbf {Z} ^{l} \right) ,\ \end{aligned}$$
(2)

Where \( \sigma \left( \mathbf {\cdot } \right) \) denotes the sigmoid activation and \( \mathbf {W}^{l}\, \in \mathbb {R}^{C\times C} \). Obviously, \( g\left( \mathbf {\cdot } \right) \) could also be interpreted as a fully connected layer with \( \mathbf {W}^{l} \) as the corresponding parameter.

Finally, the attended feature map could be obtained by modulating \( \mathbf {U} ^{l} \) with \( \mathbf {S} ^{l} \), for each channel

$$\begin{aligned} \tilde{x}_{c}^{l}\, = \, {{f}_\mathbf{rs}}\left( u_{c}^{l} \, ,\ s_{c}^{l} \right) .\ \end{aligned}$$
(3)

In Eq. (3) \( {{f}_\mathbf{rs}} \) means rescaling \( u_{c}^{l}\) with scalar\(\ s_{c}^{l}\).

2.2 Backbone Architecture

The backbone architecture we adopted is following the audio subnet of the Audio-Lyric Bimodal in [5]. It is originally used for the emotion value regression task. While in this paper, the backbone architecture would be applied to both regression and classification with different outputs.

It is composed of two convolutional blocks and two dense blocks. For one thing, as for the convolutional block, a convolutional layer, a max pooling layer and batch normalization are consecutive. The (the number of kernels, kernel size, stride) for convolutional layer are (32, 8, 1) and (16, 8, 1) separately, while the (kernel size, stride) for the max pooling layer are all in (4, 4). For another thing, the dense block includes a dropout and a fully connected layer. The intermedia neural number for the two dense blocks is 64 [5].

Fig. 2.
figure 2

Music Emotion Classification Model with dotted line representing the attention block and solid line representing the backbone architecture block

3 Evaluation

Datasets, metrics and experiment settings will be talked here. More, the method for making a large music emotion classification dataset is presented under the hope of helping other researchers to design much more powerful systems.

3.1 Dataset

Music Emotion Recognition tasks tend to use either categorical psychometrics or scalar/dimensional psychometrics for classification or regression [9]. Both music emotion representations are under the supporting of psychological theories [9]. Under categorical approaches, emotion tags are clustered into several classes. The well-known MIREX Audio Mood Classification Competition just uses this kind of psychometric [8]. While under continuous descriptors, a certain kind of emotion could be represented by a point in the Valence-Arousal (V-A) space [17]. Though there are two kinds of music descriptors, under the Circumflex Model of Affect [17], they could be transformed to each other.

Fig. 3.
figure 3

Mapping the selected music emotion tags on the Circumflex Model of Affect

Classification Task Dataset. For the reason that public music emotion classification datasets are small, containing only less than 1,000 clips, even the baseline deep learning neural network could not demonstrated its great power, not to mention the proposed channel-wise attention mechanism. Hence, under the guidance of the Circumflex Model of Affect by Russell [17], with the help of emotion related playlist (those which have been created intentionally and listened millions of times) on the mainstream music software, a large reliable dataset with thousands of music clips could be made with less human labor.

To begin with, a set of music emotion tags are chose according to human experience and psychology theory. Six tags, Stirring, Empowering, Angry, Somber, Peaceful and Upbeat are finally determined. Definitely, they are under the constrain of Circumflex Model of Affect, sparsely located on the model, which means that the gap between music emotion tags are large enough to make them separate well. Figure 3 will illustrate the relationship between the music tags and the Circumflex Model of Affect.

Next, searching the tag related playlists on popular music website such as NetEase cloud music and QQMusic, top played ones would be considered. By referencing to the comments of the playlists and humanly verifying each song, the final ones would be determined. After that, for each song, the first 5 s would be thrown away considering the emotion there might be different from the whole song, and then they would be cut into 30 s ones.

Finally, using this method, more than 4,000 music clips with sample rate of 44,100 are got.

Since annotated music excerpts are collected from website and are copyrighted, the dataset could not be made public. However, using the above mentioned method, researchers could make their own dataset easily.

Regression Task Dataset. As for the baseline architecture, the used continuous descriptor is song leveled, it uses an arousal value and a valence value to describe a 30 s music expert. Unfortunately, the dataset is not a public one. Mainstream public ones are all dynamically annotated, which means that they consider emotion variation in a music [1] and are annotated every once in a while. To mitigate this problem, we choose the averaged value of all annotations in a song to represent the song level descriptor.

For the dynamically annotated public dataset, we utilize a largest one, Emotional Analysis in Music (DEAM) [1]. It contains 1,802 songs including 1,744 45 s clips and 58 full length clips. The time resolution for annotation 2 Hz, meaning annotating per 500 ms. The annotated values are scaled in \( \left[ -1, +1 \right] \).

In our experiment, since baseline architecture is designed for 30 s clips, only the middle 30 s (from the 7th to the 36th second of the clip) audio is preserved. And, 58 full length clips are too long, using the song level averaged annotation to represent each segment is not a smart choice, therefore, they are thrown away. Finally, after processing, 1,744 clips are remained.

3.2 Metric

In notation, N, \(y^{i}\), \( \hat{y^{i}}\) corresponds to the number of samples, predicted label/value and real label/value separately, where \( i\, \in \left[ 0,\, N-1 \right] \).

Metric for Music Emotion Classification Task. Accuracy score, shorten as acc, represents the ratio of correctly classified samples to total number, could be written as:

$$\begin{aligned} acc\, \left( \hat{y^{i}},\, y^{i} \right) \, =\, \frac{1}{N}\sum _{i\, =\, 0}^{N-1}1 \left( \hat{y^{i}}= = y^{i} \right) .\ \end{aligned}$$
(4)

Confusion matrix shows more detail information than acc. It is much easier to see how the system confusing among them. Let \( \mathbf {C}\) to be matrix, \( \mathbf {C}_{i,j} \) means the proportion of sample observed in class i but classified into class j. Specifically, \( \mathbf {C}_{i,i} \) corresponds to the accuracy score for the i-th class.

Matric for Music Emotion Regression Task. Root Mean Square Error (RMSE) is a typical matric for regression tasks, meaning how far is the predicted value from the real one.

$$\begin{aligned} RMSE\, =\, \sqrt{\frac{1}{N} \sum _{i\, =\, 1}^{N-1}\left( \hat{y^{i}}- y^{i} \right) ^{2}}.\ \end{aligned}$$
(5)

3.3 Settings

As mentioned in Sect. 2, the backbone architecture followed from [5] will be used for both regression and classification tasks.

For each audio clip, after upsampling to 44,100 Hz, an Mel-spectrogram is extracted with 40 mel bands, 1024 sample long Hann window with no overlapping as input [5]. In baseline, it uses pitch shifting to argument the data. However, pitch is an emotion related feature [11, 15, 18]. Therefore, data argument method in baseline is not adapted.

While training, we use Cross Entrpy loss for classification task, Mean Square Error loss for regression task, Adam is the optimizer.

4 Experiments and Results

In this section, experiments conducted to validate the effectiveness of the proposed method will be presented. In notation, AudioNet represents the baseline, Layer1, Layer2, LayerALL means the location of the added channel-wise attention mechanism (i.e., Layer1 meaning adding the attention mechanism after the first layer).

4.1 Validating the Proposed Method

In the first set of experiments, we would like to validate the power of the proposed attention mechanism’s power and that of the multi-layer attention scheme by using the classification dataset. Since the baseline has two convolutional blocks, three architectures’performances will be evaluated, AudioNet, AudioNet_Layer1, AudioNet_Layer2, AudioNet_LayerALL.

Table 1. Overall and class level accuracy score for the baseline and attention added ones in different locations.

Experiment results will be seen in Table 1. As we can see, whether adding channel-wise attention after layer 1 or layer 2, the performance could be improved distinctly; this can verify channel-wise attention’s ability. When adding attention mechanism after all layers, the performance could be improved further, this could demonstrate the rationality of adding attention to multi layers.

It is interesting to find that the architecture adding attention to the later layer will demonstrate more power than that adding to the earlier one. The reason behind might be that earlier layers extract low-level represents while later ones extract class-specific features [7]. Emphasizing more on class-specific features will help more than extracting better common music characteristics, for example, better musical texture features will help more than pitch features in music emotion recognition [15].

4.2 Performance on Classification Task

Performance in overall between baseline and the proposed one has been demonstrate above, Table 2 gives more detail by using confusion matrix.

Table 2. Confusion matrix for baseline and the proposed

Except for the overall accuracy, in the six classes, five classes’ accuracy scores have been lift.

As seen in baseline’s result, \(\mathbf {C}_{Somber, Peaceful}\) and \(\mathbf {C}_{Peaceful, Somber} \) are both not small. Since Somber and Peaceful are both less arousal, the network tend to be confused with each other. If more attention is put on valence related features, this phenomenon could be eased to some extent. After adding channel-wise attention mechanism, though accuracy for Somber has been reduced by 0.026, that for Peaceful has been improved by 0.192. This illustrates channel-wise attention mechanism’s ability to re-weight and concentrate more on target-related feature maps.

As for Stirring, the baseline’s accuracy score for which is the lowest. It is easily misclassified into Angry or Empowering because they are all more arousal. After adding the attention mechanism, the accuracy score for it has been improved.

4.3 Performance on Regression Task

Since the baseline has been proposed for song level emotion detection, rather than dynamic one, and there is no such type of dataset, we processed dynamic annotations in public dataset to generate the corresponding song level one, detailed in Sect. 3.2. In experiment, we conduct two set of experiments for arousal regression and valence regression.

Table 3. RMSE for the baseline and the proposed

As seen in Table 3, for arousal regression, the proposed performs better with a obviously smaller RMSE, while for valence regression, the baseline performs slightly better.

5 Conclusion

In this paper, channel-wise attention mechanism is introduced and designed to make the network focus more on the emotion related feature maps. Experiment results have verify the utility of the proposed method on both classification and regression tasks. In future work, we will concentrate more on the attention scheme, such as designing more sophisticate and accurate channel descriptors or introducing spatial attention mechanism.