1 Introduction

Recent years have seen a rapid increase in the size of digital multimedia collections, such as music, videos and images. Because emotion is an important component in the human classification and retrieval of digital media, assigning emotional tags to multimedia data has been an active research area in recent decades [17, 20, 50, 51, 62].

Previous research on multimedia emotional tagging mainly recognizes the emotional tags from the multimedia content. They can be summarized into two groups according to the adopted emotional tags: discrete categories and continuous dimensions. The former annotates multimedia using discrete emotional categories, such as calmness, happiness and fear [18, 19, 23, 28, 30, 43, 48, 53, 54, 55, 56, 57, 58, 63]. The latter maps multimedia to continuous emotional dimensions, such as valence and arousal [1, 7, 13, 27, 37, 49, 66]. The framework of present research is as follows: first several audio or visual features are extracted from multimedia, then classification methods (e.g. Support Vector Machines (SVM)) or regression methods (e.g. Support Vector Regression (SVR)) are used to infer the multimedia data’s emotional tag as either an emotional category or an emotional value in terms of valence and arousal.

The assumption of most present research is that one medium only has one emotional tag or a point in emotional dimensional space. However, most multimedia data often induce a mixture of emotions of users [12]. For example, a shocking video may elicit both anger and sadness; a piece of music may be characterized by both dreamy and cheerful. Some emotions may appear together frequently, while others may not. For example, Fig. 1 is a beautiful scene image, which may induce a mixed emotion of relaxing, comfortable and happy, but it rarely induces disgust. Such phenomena of co-existent and mutual exclusive relationships among emotions should be considered in emotional tagging. One medium should be assigned to several emotional tags simultaneously. Thus, we formulate emotional tagging as a multi-label classification problem.

Fig. 1
figure 1

A beautiful scene image.The image may induce a mixture of emotions including relaxing, comfortable and happy

Presently, few researchers regard emotional tagging of multimedia as a multi-label classification problem, except for a small number of studies on emotion recognition from music data. Furthermore, present multi-label classification methods, which address label dependencies directly either ignore the label correlations or fix the relations as a pairwise or a subset of label combinations existing in the training data. They cannot effectively explore the co-existence and mutual exclusion relationships among emotional labels. In this paper, the dependencies among emotional tags are explored directly by a Bayesian Network (BN).

In this paper, a novel approach named MET (Multiple Emotional Tagging of multimedia data by exploiting emotion dependencies) is proposed. First, several commonly used multi-label classifiers are adopted to obtain the measurements of the emotional tags from the audio-visual content. Then a BN is automatically constructed to model the dependencies among emotional tags. After that, the constructed BN is employed to infer the true tags for a medium based on the measurements. We conduct experiments on a multiple emotion music data set and a multiple emotion video data set. Experimental results show that MET exploits the co-existence and mutual exclusion relationships among emotions successfully. Thus, our method can improve the performance of traditional multi-label classifiers.

2 Related work

2.1 Emotional tagging of multimedia

Emotional tagging of videos, images and music pieces have attracted more and more attention in recent years [17, 20, 50, 51, 62]. There are two kinds of emotional tags, the expected emotion and the actual emotion [13]. The expected emotion is contained in a multimedia data and intended to be communicated toward users from multimedia program directors. It is likely to be elicited from majority of the users while consuming that multimedia. It can be considered as a common emotion. In contrast, the actual emotion is the affective response of a particular user to multimedia data. It is context-dependent and subjective, and it may vary from one individual to another. It can be considered as an individualized emotion. Most current research focus on the expected emotion, which is also the focus of this paper. Among the three kinds of media, the study of emotion recognition from music pieces has been carried out most profoundly, since almost every music piece is created to convey emotion. Emotional tagging of images was first studied in Japan in the 1990s [38]. At that time, Japanese word Kansei was used instead of emotional or affective. Emotional tagging of videos originated from the beginning of this century by Chitra Dorai, who proposed Computational Media Aesthetics (CMA) [6].

Although music pieces, images and videos are in different modalities, the research of emotional tagging of these three media obeys a similar framework. First, several discrete emotional categories or continuous emotional dimensions are adopted to express emotions. Second, audio or visual features are extracted from multimedia. After that, classification methods or regression methods are used to assign the media with an emotional tag or a point in emotional dimension space.

To express emotional categories, besides six basic emotions (i.e. happiness, sadness, surprise, fear, disgust and anger), adjectives and adjective pairs, such as pleasing, boring, and irritating, are often used. A famous categorical approach, Hevner’s adjective checklist [14] is also adopted especially for music pieces. Some research tags the media into several discrete clusters using the clustering methods on the arousal and valence spaces [24, 65]. To express continuous emotional dimensions, valence and arousal are often used for video and music tagging [1, 7, 13, 27, 37, 49, 66], while aesthetics [4] or attractiveness  [2] is used for images.

Empirical research shows that the commonly used music features are timbre, rhythm, and harmony, which are associated with emotion perception of music [62]. For images, color, shape, and texture are extracted [17]. The video features contain both visual and audio features. The commonly used audio features include Mel-frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), and spectral flux etc [21]. Shot duration, visual excitement, lighting key and color energy are widely used visual features [48].

Several machine learning algorithms have been applied to learn the relationships between features and discrete emotional labels, such as Gaussian mixture models [23], DBN [1], SVM [3], Neural Network [8] and conditional random fields [59] etc. After training, the automatic model can be applied to recognize the emotion of the media. For the continuous emotional dimension modeling, support vector regression [25], multiple linear regression [60], or AdaBoost.RT [60] are used to learn regression models to predict the valence and arousal values of the media. Most existing works train two regressions for valence and arousal independently [26, 61]. A comprehensive overview on emotional tagging of music pieces, images and videos can be found in [17, 20, 50, 51, 62].

To the best of our knowledge, present research of emotional tagging of images and videos assumes that there is only one emotional tag or a point in emotional dimensional space for an image or a video. However, it is very hard to find one video or image that can only induce a high level of a single emotional category without the presence of other emotions either in day-to-day living or inside the laboratory. Take videos for examples, Gross et al. [12] developed a set of films to elicit eight emotion states. Based on their study, when the users watch amusement videos, they always feel amused, happy and surprised simultaneously. The videos that induce anger may also induce some degree of disgust, sadness, fear and surprise. The videos that induce disgust may also induce fear and surprise to some extent. However, the videos that induce anger and disgust may not induce high level of happiness. Those phenomena of co-existence and mutual exclusion for emotional categories are also revealed in [32]. Till now, there is little research considering multi-emotion tagging of images or videos [52]. Thus, in this paper, we treat emotional image and video tagging as a multi-label classification problem.

For emotion recognition from music, there exists a small number of studies considering assigning multiple emotion labels to a music piece [29, 34, 35, 45]. However, the methods used in these studies, such as multi-label SVM [22] and Multi-Label k Nearest Neighbours (MLkNN) [45], either ignore the label correlations or fix the relations as a pairwise or a subset labels combinations in the training data. They cannot effectively exploit the coexistent and mutual-exclusive relations among emotions. Thus, we propose a BN to systematically capture the dependencies among emotional tags, which extend the relations modeled by current multi-label classifiers.

Besides viewing the emotions in terms of categories, much research assumes emotions have a systematic, coherent, and meaningful structure that can be mapped to affective dimensions [33, 39, 40]. Among those dimensions, arousal-valence (pleasure or activation) are always used. There are certain relationships between the emotional categories and dimensions. For example, in Russell’s affective model, happiness always belongs to the first quadrant of the model, and anger and fear belong to the fourth quadrant clearly. By considering the relationships between arousal-valence and emotion categories, one certain emotional category may be assigned to one of the four quadrants. In each quadrant, the emotional categories may be distinguished further using the relationships among emotional categories and dimensions. Although the continuous emotional dimensions provide more information for the media, they are difficult to be labeled and evaluated. Thus, we discretize emotional dimensions into two categories, positive and negative valence, or high and low arousal. Then, the relationships among emotional categories as well as the relationships between emotional categories and emotional dimensions are both taken into consideration in this paper.

2.2 Multi-label classifications

Multi-label classification is the classification problem where one sample can be assigned to more than one target label simultaneously. Multi-label classification methods can be categorized into two different groups: problem transformation methods and algorithm adaptation methods. The former includes Binary Relevance (BR) [47], Label Power (LP) [47], Random k labelsets (RAkEL) [46], etc. They transform the multi-label classification task into one or more single-label classification tasks and then any traditional classification algorithms can be used. The latter consists of Binary Relevance k Nearest Neighbours(BRkNN)  [42], Multi-Label k Nearest Neighbours(MLkNN) [64], AdaBoost.MH [36] etc. They extend specific learning algorithms to handle multi-label data directly. A comprehensive overview of current research in multi-label classification can be found in [41].

Due to the large number of possible label sets, multi-label classification is rather challenging. Successfully exploring the coexistent and mutual exclusive relations inherent in multiple labels is the key to facilitate the learning process. Considering dependencies among labels, most present multi-label learning strategies can be categorized into three groups: methods ignoring label correlations, methods considering label correlations directly, and methods considering label correlations indirectly. The first group (i.e. BR) decomposes multi-label problem into multiple independent binary classification problems (one per category). Without considering the correlations among labels, the generalization ability of such method may be weak. The second group addresses the pairwise relations between labels(such as Calibrated Label Ranking (CLR)  [9]), or the fixed label combinations present in training data (such as LP), or a random subset of the combinations(such as RAkEL)). However, the relations among labels may be beyond pairwise, and cannot be expressed by a fixed subset of labels existing in training data. Thus, the second group may not capture the label relations effectively. The third group considers label dependencies with the help of features or hypothesis. Godbole and Sarawagi [11] stacked the outputs of BR along with the full original feature space into a separate meta classifier, creating a two-stage classification process. Read et al. [15] proposed the classifier chain model to link n classifier into a chain. The feature space of each classifier in the chain is extended with the label associations of all previous classifiers. Ghamrawi and McCallum [10] adopted conditional random field to capture the impact of an individual feature on the co-occurrence probability of a pair of labels. Sun et al. [44] proposed to construct a hyperedge for each label, and include all instances annotated with a common label into one hyperedge, thus capturing their joint similarity. Zhangs [67] proposed a Bayesian Network to model the dependencies among label errors, and then a binary classifier was constructed for each label combining the features and the parental labels, which were regarded as additional features. Huang et al. [16] modeled the label relations by a hypothesis reuse process. When the classifier of a certain label is learned, all trained hypotheses generated for other labels are taken into account via weighted combinations. With the help of features and hypothesis, these methods can model the flexible dependencies among labels to some extent, but their computation costs are usually much higher compared with the second group, which models the dependencies among labels directly.

Among the above, Zhang et al.’s work [67] is the most similar one to ours. Zhang et al. proposed to use a BN structure to encode the conditional dependencies of labels as well as the feature set: P1, λ2, ... , λ n |x), where x is the features and (λ1, λ2, ... , λ n ) are the multiple target labels, n is the number of labels. Since Zhang et al. thought directly modeling P1, λ2, ... , λ n |x) by Bayesian approach was intractable, they adopted an approximate method to model the dependencies among label errors, which was independent of features x. Based on the learned BN structure of errors, a binary classifier was constructed for each label λ i combining the features x and the parental labels pa i ), which were regarded as additional features.

Unlike Zhang et al.’s method, we propose a BN to systematically capture the dependencies among different labels, P1,..., λ n ), directly, without the help of features or hypothesis. The nodes of the BN represent the labels. The links and their parameters capture the probabilistic relations among labels. The label relationships encoded in a BN are more flexible than the pairwise or fixed subset of relationships used by the existing direct methods. The computation cost of our method is lower than that of indirect methods. Probabilistic reasoning model is used to infer the multiple labels with the largest probability in this paper.

Compared to related works, our contributions are as follows:

  1. 1.

    We propose a framework of multi-label multimedia emotional tagging, applicable to not only emotional tagging of music pieces, but also emotional tagging of images and videos. We are the first to formulate emotional tagging of images and videos as a multi-labeling problem.

  2. 2.

    We propose a novel method to automatically capture the dependencies among emotions directly with a BN and combine the captured emotion dependencies with their measurements to achieve accurate multi-emotion tagging of multimedia data.

3 Multiple emotional tagging methods

The framework of our approach is shown in Fig. 2, consisting of three modules: feature extraction, measurement extraction and multi-emotion relationship modeling by BN. The training phase of our approach includes training SVM and KNN in the traditional multi-label classification methods for measurement acquisition and training the BN to capture the semantic relationships among emotional tags. For measurement acquisition, we employ audio and visual features to represent the media, and then classify using traditional multi-label algorithms. Given the measurements, we infer the emotional tags of media through a probabilistic inference with the BN model. The details are provided as follows.

Fig. 2
figure 2

The framework of our proposed emotional tagging approach

3.1 Feature extraction

Here, we only focus on the music and video features. Due to copyrights, the music data set  [45] does not provide the original music clips, but 8 rhythmic features and 64 timbre features. The rhythmic features are derived by extracting periodic changes from a beat histogram. The timbre features consists of the first 13 MFCCs, spectral centroid, spectral rolloff and spectral flux and their means, standard deviations, mean standard deviations and standard deviations of standard deviation over all frames. We adopt these features in the following sections.

For our collected video data set, visual and audio features are extracted. For visual features, three features, named lighting key, color energy and visual excitement [48], are extracted from video clips. These features are powerful tools to establish the mood of a scene and they can affect the emotions of the viewer according to cinematography and psychology. For audio features, we do not use the same features as music, since the audio part of videos include not only background music, but also speech and other sounds. Thirty-one features which are widely used in video tagging field [21] are extracted, including average energy, average energy intensity, spectrum flux, Zero Crossing Rate (ZCR), standard deviation of ZCR, 12 Mel-frequency Cepstral Coefficients (MFCCs), log energy of MFCC, and the standard deviations of the above 13 MFCCs. The features are averaged over the whole clip. Therefore, a total of 34 features are acquired to represent each video signals. These visual and audio features are complementary for emotional tagging of videos. The details of features can be found in [21, 48].

3.2 Measurement acquisition

Four commonly used multi-label classification methods are adopted to obtain the measurements of the emotional tags. They are BR, RAkEL, BRkNN and MLkNN. The first two belong to problem transformation methods, and the last two belong to algorithm adaptation method. Below, we will briefly introduce the four methods.

Let \(D= \{\left( x_i,y_i\right)\}^m_{i=1}\) represent the training data, where \(x_i\in R^d\) is the feature, \(y_i \in \{ \lambda _j\}^n_{j=1}\) is the multiple target labels, n is the number of labels, and m is the number of training samples.

BR is the most widely-used problem transformation method. It considers each label independently. First, it changes original data set to n data sets, each data set D i for one label λ i . Then, any traditional classification algorithm can be used to obtain the classifiers h i using D i . For a new instance, each classifier h i outputs a binary label Z i  = h i (X fea ). Then, the combination of the labels predicted by n classifiers \(( \bigcup_{i=1}^{n}Z_i )\) is adopted as the final output. BR assumes the labels are independent, ignoring the correlations among those labels.

Another commonly used problem transformation method is LP, which considers each distinct label combination existing in the training set as a different class of a single-label classification task. Any traditional classification algorithm can be used to obtain the single-label classifier. A possible drawback of LP method is that some classes are associated with very few training samples which makes the learning difficult. To deal with the problem, RAkEL is proposed by breaking the initial set of labels into l random subsets, each subset has k labels, and then employing LP to train l corresponding classifiers. For a new instance, its labels are the combination of all the LP classifiers, which calculates the mean of these predictions for each label and outputs a final positive decision. RAkEL considers the randomly selected combinations of labels, but it does not capture the probabilistic relations among labels, and it cannot represent their coexistent and mutual exclusive relationships.

BRkNN is an algorithm adaptation method. It is conceptually equivalent to use Binary Relevance followed by KNN. BRkNN extends the kNN algorithm so that independent predictions are made for each label, following a single search of the k nearest neighbors [42]. In this case, the complexity of BRkNN is 1/n of that using BR and KNN directly. BRkNN does not consider the dependencies among labels.

MLkNN is another algorithm adaptation method based on BR. It uses maximum a posteriori principle to find the final label set based on prior and posterior probabilities of each k nearest neighbor labels. MLkNN does not consider the dependencies among labels.

The outputs of the above four methods are binary vectors, indicating whether a music piece or a video has a certain emotional tag or not. The binary vector is used as the measurement for the BN model in the following step.

3.3 Emotional relationship modeling by Bayesian network

As traditional tagging methods treat each emotional category individually but do not consider their dependencies in the training set, some valuable information may be lost. In order to model the semantic relationships among emotional categories, we utilize a BN model for inference of emotional tags. As a probabilistic graphical model, BN can effectively capture the dependencies among variables in data. In our work, each node of the BN is an emotional label, and the links and their conditional probabilities capture the probabilistic dependencies among emotions.

3.3.1 BN structure and parameters learning

The BN learning consists of structure learning and parameters learning respectively. The structure consists of the directed links among the nodes, while the parameters are the conditional probabilities of each node given its parents.

Given the data set of multiple target labels \(DL=\{\left(y_i\right)\}^m_{i=1}\), where \(y_i \in \{ \lambda _j\}^n_{j=1}\), the structure learning is to find a structure G that maximize a score function. In this work, we employ the Bayesian Information Criterion (BIC) score function which is defined as follows:

$$ Score(G) = \mathop{\max}\limits_{\theta}log(p(DL|G,\theta)) -\frac{Dim_G}{2} logm $$
(1)

where the first term is the log-likelihood function of parameters θ with respect to data DL and structure G, representing the fitness of the network to the data; the second term is a penalty relating to the complexity of the network, and Dim G is the number of independent parameters.

To learn the structure, we propose to employ our BN structure learning algorithm [5]. By exploiting the decomposition property of the BIC score function, this method allows learning an optimal BN structure efficiently and it guarantees to find the global optimum structure, independent of the initial structure. Furthermore, the algorithm provides an anytime valid solution, i.e., the algorithm can be stopped at any-time with a best current solution found so far and an upper bound to the global optimum. Representing state of the art method in BN structure learning, this method allows automatically capturing the relationships among emotions. Details of this algorithm can be found in [5].

After the BN structure is constructed, parameters can be learned from the training data. Learning the parameters in a BN means finding the most probable values \(\hat{\theta}\) for θ that can best explain the training data. Here, let Y i denotes a variable of BN and y i represents a generic state of Y i . Each variable has a state space \(\Omega_{Y_i}\), where \(y_i \in \Omega_{Y_i}\). Let θ ijk denote a probability parameter for BN, then,

$$ \theta _{ijk}=P\left (y_{i}^{k} \mid {pa^{j}\left ( Y_i \right )}\right ) $$
(2)

where i ∈ {1,...,n}, j ∈ {1,...,r i } and k ∈ {1,...,s i }. Here n denotes the number of variables (nodes in the BN), r i represents the number of the possible parent instantiations for variable Y i , and s i indicates the number of the state instantiations for Y i . Hence, \(y_{i}^{k}\) denotes the k t h state of variable Y i .

Based on the Markov condition, any node in a Bayesian network is conditionally independent of its non-descendants, given its parents. The joint probability distribution represented by BN can be denoted as: P(y) = P(y 1,...,y n ) = ∏  i P(y i |pa(Y i )). In this work, the “fitness” of parameters θ and training data D is quantified by the log likelihood function log(P(D|θ)), denoted as L D (θ). Assuming the training data are independent, based on the conditional independence assumptions in BN, the log likelihood function is shown in Eq. 3. where n ijk indicates the number of elements in D containing both \(y_{i}^k\) and \(pa^j(Y_i)\).

Because there is no label missing in training data in this work, Maximum Likelihood Estimation (MLE) method can be described as a constrained optimization problem, which is shown in Eq. 3.

$$ \begin{array}{rll}\label{eq3} MAX ~~~~L_D\left ( \theta \right ) &=& log\left (\prod_{i=1}^{n} \prod_{j=1}^{r_i} \prod_{k=1}^{s_i}\theta _{ijk}^{n_{ijk}} \right)\\ S.T ~~~~g_{ij}\left ( \theta \right ) &=& \sum_{k=1}^{s_i}\theta_{ijk}-1=0 \end{array} $$
(3)

where g ij imposes the constraint that the parameters of each node sums to 1 over all the states of that node. Solving the above equations, we can get \(\theta_{ijk}=\frac{n_{ijk}}{\sum_k n_{ijk} }\).

3.3.2 BN inference

During the BN inference, the posterior probability of categories can be estimated by combining the likelihood from measurement with the prior model. Let E i and M i , i ∈ {1,...n}, denote the variable and the corresponding measurements obtained by machine learning methods respectively. Then,

$$ \begin{array}{rll}\label{eq4} &&{\kern-24pt} P(E_1,...,E_n \mid M_1,...,M_n)\\ &=&\prod_{i=1}^{n}P(M_i \mid E_i)\prod_{i=1}^{n}P(E_i\mid pa(E_i))) \end{array} $$
(4)

The condition probability in the equation are learned from training set. In this work, the inferred tags are the emotion tag string (E 1,...,E n ) with the highest probability given M 1,...,M n . In practice, the belief propagation algorithm [31] is used to estimate the posterior probability of each category node efficiently.

4 Experiments and results

4.1 Experimental conditions

4.1.1 Data sets

Presently, there is only one available music data set with multiple emotional labels, and no multiple emotion image or video data set. Thus, in this work, we use two data sets: multiple emotion music data set [45] and multiple emotion video data set collected by us.

The music data set contains 593 songs categorized into one or more out of 6 classes of emotions: amazed-surprised (amazed), happy-pleased (happy), relaxing-calm (relaxing), quiet-still (quiet), sad-lonely (sad), and angry-fearful (angry). The duration of each music clip is 30 s and the sampling rate of speech is 22.05 kHz. The distribution of samples is presented in Table 1. Detailed information about the data set can be found in [45]. Since the music data set does not provide the valence and arousal labels, we only model the category relations using BN.

Table 1 Sample distribution in music data set

For our constructed multi-label emotion video data set, we first obtain 72 videos which last 8166 s overall from internet as the stimulus. The lengths of the videos vary from half minute to five minutes. The sampling rate of the speech and video are 44 kHz and 30 fps. We assume there is no temporal change or transition of emotional experience within a single clip because of its short duration. These videos are grouped into several playlists and each playlist contained six video shots. To reduce the interaction between two consecutive target videos, a relaxing video approximately 1–2 min in length was shown between two target videos. More than fifty healthy students were recruited to participate in the experiment to watch each playlist.

After watching each video shot, subjects were asked to report their actual experienced emotions using emotional valence and arousal that range from −2 to 2, implying negative to highly positive valence and calm to exciting arousal, respectively. Subjects also rated the intensity of the six basic emotional categories for the video, which ranged from 0 to 4, where 0 indicates no particular feeling and 4 indicates a strong feeling. The average intensities of the self-reported data were used as the ground truth emotional tags for the videos.

After data collection, a threshold is set to transform the intensity of emotional tag to a binary tag, which represents certain emotion is present or not. If the intensity is larger than the threshold, the tag is set to 1; otherwise, it is 0. The threshold of emotional categories is 0.2, and that for valence and arousal is 1. The sample distribution is presented in Table 2.

Table 2 Sample distribution in self-constructed video data set

4.1.2 Evaluation metrics

For two problem transformation method, BR and RAKEL, the SVM with a linear kernel is used as the basic classifier. The measurements obtained using BR, RAkEL, BRkNN, MLkNN methods are regarded as the input of the BN to infer the final emotions. 10-fold cross-validation is adopted. For each fold, the four traditional multi-label learning methods and the BN share the same training set and testing set.

The evaluation metric of multi-label classification is different from that of single label classification, since for each instance there are multiple labels which may be classified partly correctly or partly incorrectly. Thus, there are two kinds of commonly used metrics, example-based and label-based measures (see [41] for an explanation of both), evaluating the multi-label emotional tagging performance from the view of instances and labels respectively. We adopt both measures in this work. Let Y i denotes the true labels for instance i, which is a binary vector, and Z i is the predicted labels for instances i, m represents the number of the instances and n is the number of labels. The example-based measures: accuracy, precision, recall, F1-measure and subset accuracy are defined in Eqs. 59 [41], and the label-based measures: recall, precision and F1-measure, are defined in Eqs. 1012 [41].

$$\label{eq5} Accuracy = \frac{1}{m}\sum_{i=1}^{m}\begin{vmatrix} \frac{Y_i\bigcap Z_i}{Y_i\bigcup Z_i} \end{vmatrix} $$
(5)
$$\label{eq6} Precision = \frac{1}{m}\sum_{i=1}^{m}\frac{\begin{vmatrix} Y_i\bigcap Z_i \end{vmatrix}}{\begin{vmatrix} Z_i \end{vmatrix}} $$
(6)
$$\label{eq7} Recall = \frac{1}{m}\sum_{i=1}^{m}\frac{\begin{vmatrix} Y_i\bigcap Z_i \end{vmatrix}}{\begin{vmatrix} Y_i \end{vmatrix}} $$
(7)
$$\label{eq8} F_1 = \frac{1}{m}\sum_{i=1}^{m}\frac{2\begin{vmatrix} Y_i\bigcap Z_i \end{vmatrix}}{\begin{vmatrix} Y_i \end{vmatrix}+\begin{vmatrix} Z_i \end{vmatrix}} $$
(8)
$$\label{eq9} SubsetAccuracy = \frac{1}{m}\sum_{i=1}^{m}I(Y_i=Z_i) $$
(9)
$$\label{eq10} Precision,P_{micro} = \frac{\sum_{j=1}^{n}\sum_{i=1}^{m}Y^j_iZ^j_i}{\sum_{j=1}^{n}\sum_{i=1}^{m}Z^j_i} $$
(10)
$$\label{eq11} Recall,R_{micro} = \frac{\sum_{j=1}^{n}\sum_{i=1}^{m}Y^j_iZ^j_i}{\sum_{j=1}^{n}\sum_{i=1}^{m}Y^j_i} $$
(11)
$$\label{eq12} F_{1-micro}=\frac{2\sum_{j=1}^{n}\sum_{i=1}^{m}Y^j_iZ^j_i}{\sum_{j=1}^{n}\sum_{i=1}^{m}Y^j_i+\sum_{j=1}^{n}\sum_{i=1}^{m}Z^j_i} $$
(12)

4.2 Experimental Results and Analyses of Emotional Tagging of Music

We quantify the co-occurrence among different emotional tags using a conditional probability of P(B|A), where A is one emotional tag, and B is another emotional tag. P(B|A) therefore measures the probability of emotional tag B happens, given emotion A happens. Table 3 shows the condition probabilities between different emotions for the music data set. From the table, each music piece can display multiple emotions. For instance, quiet is often accompanied by relaxing and sad with high probability. From the table, we can find clearly two kinds of relationships among emotions, which are co-occurrent relationships and mutual exclusive relationships. For example, the probabilities of P(angry|relaxing) and P(amazed|relaxing) are 0.0265 and 0.0492, which indicate that relaxing rarely coexists with amazed and anger. P(happy|angry) and P(happy|sad) are 0.0635 and 0.006, which show happy rarely coexists with sad and angry. Quiet is always coexistent with sad as indicated by a high P(sad|quiet) of 0.7095.

Table 3 Dependencies among emotional labels for the music dataset

To systematically capture such relationships among emotions in the music data, we learnt a BN. Figure 3 shows the learned BN, where the shaded nodes are hidden nodes and they represent the true state we want to infer, and the unshaded nodes are the measurement nodes obtained from a traditional multi-labeling method. The links among the shaded nodes represent the dependencies among emotions. For example, the link from relaxing to angry and amazed demonstrates there are strong dependences between the two pairs. From the Table 3, we can see the probabilities of P(angry|relaxing) and P(amazed|relaxing) are 0.0265 and 0.0492, which indicates they are mutual exclusive relationship. Meanwhile, the link from quiet to sad shows the co-occurrent relationship because the probability of P(sad|quiet) is 0.7095 in the Table 3. They demonstrate that the BN can effectively capture the mutual exclusive and coexistent relations among emotional labels. These kinds of relations among labels are beyond the scope of those captured by commonly used multi-label learning methods.

Fig. 3
figure 3

The learned BN structure from music data set. The links among shaded nodes show the dependencies among emotions

Using the BN, we can then infer the true emotion labels by instantiating the measurement nodes with the emotion estimates obtained from a traditional multi-labeling classification method. The inference results are summarized in Table 4.

Table 4 Results of comparison experiments of our model and commonly used multi-label classifiers in music data set

Table 4 shows the performances of our approach and commonly used multi-label classifiers. From Table 4, we can obtain the following observations:

  1. 1.

    RAkEL performs the best among the four commonly used multi-label classifiers, which is consistent with the work [45] for music data set. The reason may be that RAkEL considers the relations of the randomly selected sub-combinations existing in the training label sets, where the other three methods ignore any relations among labels. It proves the importance of label relations for multi-label classification.

  2. 2.

    Our approach outperforms the four commonly used multi-label classifiers, since both example based and label based measures of our approach are better than those of four commonly used multi-label classifiers in most cases. It demonstrates the effectiveness of our approach, since it can more effectively capture the dependencies among emotional labels. Furthermore, our method increases the example based accuracy, example based F1, and the label based F1 in most cases. It indicates that our method not only improves the recognition accuracy, but also makes the recognition results more balanced.

  3. 3.

    By using BN, the improvements for the four commonly used multi label classifiers are different. The improvements of most measures for BRkNN method are highest and those for RAkEL are lowest. RAkEL already consider the label relations to some extent, and other three methods do not. Thus the enhancement due to the relations modeled by BN, for RAkEL is less than others.

4.3 Experimental results and analyses of emotional tagging of videos

We performed a similar study for multi-emotion tagging of the video data. Table 5 shows the condition probabilities between emotional labels from video data set. From Table 5, we can also find emotional videos can induce multiple emotions. For instance, surprise is present with a high probability when happiness is present. Some degree of fear and surprise are present given disgust. Disgust, sadness, fear and surprise are always present when the video induces anger. Disgust and surprise may appear when fear is present. This is consistent with previous study results described in [12]. Two kinds of relationships among emotions, which are co-occurent relationship and mutual exclusive relationship are shown clearly in the table. For example, the probability of P(valence|happiness) is 1 which means these two emotions occur together frequently and reflects the co-occurrent relationship. On the other hand, the probability of P(happiness|fear) is 0.0588, which means there is few samples of admixture emotion of happiness and fear and indicates mutual exclusive relationship. Besides, from the table, the arousal of all videos is evaluated as high arousal, which indicates the videos we collected aroused the interest of the subject. Because there is no low arousal video, it can not use algorithms to get the measurement for arousal node in the experiment. Thus, in the following experiments, we can only obtain 7 measurements, that is, 6 basic emotional tags and valence.

Table 5 Dependencies among emotional labels for video data set

Given the video data, we can then learn a BN to capture the relationships among the emotions. The learned BN is shown in Fig. 4. As discussed above, there is only high arousal video in the data set and the dependencies between arousal and other emotional tags are not very evident. So the arousal node is isolated in this structure. The links among the shaded nodes show the dependencies among emotions. For example, the link from fear to happiness demonstrates there is strong relationship between the pair. From the Table 5, we can see the probability of P(happiness|fear) is 0.0588, which indicates it is mutual exclusive relationship. The link from happiness to valence shows the co-occurrent relationship because the probability of P(valence|happiness) is 1 in the Table 5.

Fig. 4
figure 4

The learned BN structure from video data set. The links among shaded nodes show the dependencies among emotions

Table 6 shows the results of comparison experiments of our model and commonly used multi-label classifiers in video data set. From Table 6, we find that, among the four traditional methods, RAkEL again performs the best. Compared to those of traditional methods, MET shows significant performance improvement on most of the measures. The improvements of most measures for BRkNN and MLkNN method are the highest and those for RAkEL are the lowest. Similar to Section 4.2, these observations further prove that the importance of label relations for multi-label classification, and the effectiveness of our approach.

Table 6 Results of comparison experiment of our model and commonly used multi-label classifiers in video data set

5 Conclusions

Most current emotional tagging research tags the multimedia data with a single emotion, ignoring the dependencies among emotions. In this work, we propose a unified probabilistic framework for multiple emotion media tagging. First, the measurements are obtained using four traditional multi-label classification methods which are BR, RAkEL, BRkNN and MLkNN. Second, BN is used to automatically model the dependencies among emotional tags. The experimental results on two multi-label data sets show that our approach can effectively capture the co-occurrence and mutual exclusive relations among emotions, and thus, our approach outperforms other methods. The relations modeled by our approach are more flexible than pairwise or fixed subset labels captured by current multi-label learning methods.

Two data sets are adopted in this study: the multiple emotion music data set [45] and the multiple emotion video data set collected by us. The size of these two data sets is small, especially the multiple emotion video data set. The video data set is also constructed imbalanced, since it does not include any low arousal video. Large scale and balanced multi-label multimedia data set is a key requirement for research of multimedia emotional tagging. In the future, we will add low arousal videos and extend our multi-label emotion video data set.