1 Introduction

Recent years have seen a rapid increase in the size of digital video collections. Because emotion is an important component in the human’s classification and retrieval of digital videos, assigning emotional tags to videos has been an active research area in recent decades [35]. This tagging work is usually divided into two categories: explicit and implicit tagging [21]. Explicit tagging involves a user manually labeling a video’s emotional content based on his/her visual examination of the video. Implicit tagging, on the other hand, refers to assigning tags to videos based on an automatic analysis of a user’s spontaneous response while consuming the videos [21].

Although explicit tagging is a major method at present, it is time-consuming and brings users extra workload. However, implicit tagging labels videos based on the users’ spontaneous nonverbal response while watching the videos. Therefore, implicit tagging can overcome the above limitations of the explicit tagging.

Since most of the current theories of emotion [13] agree that physiological activity is an important component of emotional experience, and several studies have demonstrated the existence of specific physiological patterns associated with basic emotions [25], recognizing subjects’ emotion from physiological signals is one of the implicit video tagging methods [1, 29, 30]. There are many types of physiological signals, including Electroencephalography (EEG), Electrocardiography (ECG), Electromyography (EMG), and Galvanic skin resistance (GSR) etc. Present research has proved that physiological responses are potentially a valuable source of external user-based information for emotional video tagging. Physiological signals reflect unconscious changes in bodily functions, which are controlled by the Sympathetic Nervous System (SNS). These functions cannot be captured by other sensory channels or observer methods. However, physiological signals are susceptible to many artifacts, such as involuntary eye-movements, irregular muscle movements and environmental changes. These noises pose a significant challenge for signal processing and hinder the task of interpretation. In addition, subjects are required to wear complex apparatuses to obtain physiological signals, which may make some subjects feel uncomfortable.

Another implicit video tagging method is to recognize subjects’ emotion from their spontaneous visual behavior [2, 3, 8, 22, 24], such as facial expressions, since recent findings indicate that emotions are primarily communicated through facial expressions and other facial cues (smiles, chuckles, frown, etc.). When obtaining an implicit tag by facial information, no complex apparatus other than one standard visible camera is needed. Thus this approach is more easily applied in real life. Furthermore, facial information is not significantly disturbed by body conditions, subjects can move their bodies as they wish. This freedom of motion makes them feel comfortable to express their emotions. Although spontaneous visual behavior is prone to environmental noise originating from lighting conditions, and occlusion, etc., it is more convenient and unobtrusive. Thus, implicit tagging using spontaneous behavior is a good and more practical alternative to neuro-physiological methods. Present research has already demonstrated that facial expressions can be a promising source to exploit for video emotion tagging. Most researchers have used the recognized expression directly as the emotional tag of the videos. However, although facial expressions are the major visual manifestation of inner emotions, they are not always consistent, which are two different concepts. In addition, facial expressions are more easier to be annotated than inner emotions. Extensive research in recent years on facial expression recognition has been conducted and much progress has been made in this area. Thus in this paper, we propose a new implicit tagging method by inferring subjects’ inner emotions through a probabilistic model capturing the relations between outer facial expressions and the inner emotions, which is more feasible than the previous work that directly infers the inner emotion from the video/images or the methods that simply take the outer facial expressions as inner emotions. We assume the spontaneous facial expression reflects, to certain degree, the user’s actual emotion as a result of watching a video. The expression hence positively correlates with user’s emotion.

Furthermore, there are two kinds of emotional tags, the expected emotion and the actual emotion [7]. The expected emotion is contained in a video and intended to be communicated toward users from video program directors. It is likely to be elicited from majority of the users while watching that video. It can be considered as a common emotion. In contrast, the actual emotion is the affective response of a particular user to a video. It is context-dependent and subjective, and it may vary from one individual to another. It can be considered as an individualized emotion. Most present implicit tagging research has not considered both tags. In this research, we infer these two kinds of emotional tags and use both of them to tag videos.

Our tagging method consists of several steps. First, the eyes in the onset and apex expression images are located. Head motion features are computed from the coordinates of the eyes in the onset and apex frames, and face appearance features are extracted using the Active Appearance Model (AAM) [6]. Then, the subjects’ spontaneous expressions are recognized using a set of binary Bayesian Network (BN) classifiers and Bayesian networks capturing the relations among appearance features (called structured BN) respectively. After that, the common emotional tags of videos are further inferred from the recognized expressions using a BN with three discrete nodes by considering the relations between the outer facial expressions, individualized emotions and the common emotions. The novelties of this work lie in the explicitly modeling the relationships between a video’s emotional tag, the user’s internal emotion, and the facial expression as well as in leveraging such relationships for more effective video emotion tagging. Through this model, we can indirectly infer a video’s emotional content instead of directly treating subject’s expression as the video emotional tag as being done by the existing implicit video emotion tagging methods.

The outline of this paper is as follows. First, in Section 2 we introduce previous work related to implicit emotion tagging by using physiological signals and spontaneous behaviors. Then, our proposed implicit emotion tagging approach is explained in detail in Section 3. The experiments and analyses of facial expression recognition and emotion tagging are described in Section 4. Finally, some discussions and conclusions are presented in Section 5.

2 Related work

An increasing number of researchers have studied emotional video tagging from subjects’ spontaneous responses. Vinciarelli et al. [21] is the first to present the disadvantages of explicit tagging and introduce the concept, implementation and main problems of implicit Human-Centered tagging. Currently, implicit emotion tagging of videos mainly uses physiological signals or subjects’ spontaneous visual behavior. In this section, the related studies are briefly reviewed.

2.1 Affective video content analyses using physiological signals

Several researchers have focused on implicit tagging using physiological signals, which could reflect subtle variations in the human body. Money et al. [17, 18] investigated whether users’ physiological responses, such as galvanic skin response (GSR), respiration, Blood Volume Pulse (BVP), Heart Rate (HR) and Skin Temperature (ST), can serve as summaries of affective video content. They collected 10 subjects’ physiological responses during watching three films and two award-winning TV shows. Experimental results showed the potential of the physiological signals as external user-based information for affective video content summaries. They [19] further proposed Entertainment-Led Video Summaries (ELVIS) to identify the most entertaining sub-segments of videos based on their previous study.

Soleymani et al. [29, 30] analyzed the relationships between subjects’ physiological responses, the subject’s emotional valence as well as arousal, and the emotional content of the videos. A dataset of 64 different scenes from eight movies were shown to eight participants. The experimental results demonstrated that besides multimedia features, subjects’ physiological responses (such as GSR, EMG, blood pressure, respiration and ST) could be used to rank video scenes according to their emotional content. Moreover, they [9] implemented an affect-based multimedia retrieval system by using both implicit and explicit tagging methods. Soleymani et al. further [11] constructed two multimodal datasets for implicit tagging. One is DEAP (Database for Emotion Analysis using Physiological Signals) [11], in which EEG and peripheral physiological signals, including GSR, respiration, ST, ECG, BVP, EMG and electrooculogram (EOG), were collected from 32 participants during their watching of 40 one-minute long excerpts of music videos. Frontal face videos were also recorded from 22 among 32 participants. The other database is MAHNOB-HCI [32], in which face videos, audio signals, eye gaze, and peripheral/central nervous system physiological signals of 27 subjects were recorded during two experiments. In the first experiment, subjects selfreported their felt emotions to 20 emotion-induced videos using arousal, valence, dominance and predictability as well as emotional keywords. In the second experiment, subjects assessed agreement or disagreement of the displayed tags with the short videos or images.

While these two pioneer groups investigated many kinds of physiological signals as the implicit feedback, other researchers focused only on one or two kinds of physiological signals. For example, Canini et al. [4] investigated the relationship between GSR and affective video features for the arousal dimension. Using correlation analysis on a dataset of 8 subjects watching 4 video clips, they found a certain dynamic correlation between arousal, derived from measures of GSR during film viewing, and specific multimedia features in both audio and video domains. Smeaton et al. [27] proposed to detect film highlights from viewers’ HR and GSR. By comparing the physiological peaks and the emotional tags of films on a database of 6 films viewed by 16 participants, they concluded that subjects’ physiological peaks and emotional tags are highly correlated and that music-rich segments of a film do act as a catalyst in stimulating viewer response. Toyosawa et al. [33] proposed to extract attentive shots with the help of subjects’ heart rate and heart rate variability.

Two researcher groups considered event-related potential (ERP) as subjects’ implicit feedback. One [10] attempted to validate video tags using an N400 ERP on a dataset with 17 subjects, each recording for 98 trials. The experimental results showed a significant difference in N400 activation between matching and non-matching tag. Koelstra et al. [12] also found robust correlations between arousal and valence and the frequency powers of EEG activity. The other [36] attempted to perform implicit emotion multi-media tagging through a brain-computer interface system based on a P300 ERP. 24 video clips (four clips were chosen for each of the six basic emotional categories (i.e. joy, sadness, surprise, disgust, fear, and anger)) and 6 basic facial expression images were displayed to eight subjects. The experimental results showed that their system can successfully perform implicit emotion tagging and naive subjects who have not participated in training phase can also use it efficiently.

Instead of using contact and intrusive physiological signals, Krzywicki et al. [14] adopted facial thermal signatures, a nonconstant and non-intrusive physiological signal to analyze affective content of films. They examined the relationship between facial thermal signatures and emotion-eliciting video clips on a dataset of 10 subjects viewing three film clips that were selected to elicit sadness and anger. By comparing the distribution of temperatures with the summarized video clip events, they concluded that changes in the global temperature are consistent with changes in stimuli and that different regions exhibit different thermal pattern in response to stimuli.

Other than analyzing affective content of videos, Arapakis et al. [1] predicted the topic relevance between query and retrieved results by analyzing implicit feedback, which includes facial expression and peripheral physiological signals such as GSR and ST. Their results showed that the prediction of topic relevance is feasible, and the implicit feedback can benefit from the incorporation of affective features.

These studies described above have indicated the potential of using physiological signals for the implicit emotion tagging of videos. However, to acquire physiological signals, subjects are required to wear several contact apparatuses, which may make them feel uncomfortable and hinder the real application of this method. Furthermore, some research also indicates that the accuracy of current emotion detection method from physiological signals is not superior to multimedia content analysis or high enough to replace the self-reports [31]. The improvement or new methods are needed to meet the requirement in a real application.

2.2 Affective video content analyses using spontaneous visual behavior

Several researchers have turned to affective video content analyses according to human spontaneous visual behavior, since it can be measured using non-contact and non-intrusive techniques, and easily applied in real life. Hideo Joho et al. [3, 8] proposed to detect personal highlights in videos by analyzing viewers’ facial activities. The experimental results on a dataset of 10 participants watching eight video clips suggested that compared with the activity in the lower part, the activity in the upper part of face tended to be more indicative of personal highlights.

Peng et al. [22, 24] proposed to fuse users’ eye movements (like blink or saccade) and facial expressions (positive or negative) for home video summarization. Their experimental results on 8 subjects watching 5 video clips, demonstrated the feasibility of both eye movements and facial expressions for video summarization application. They [23] also proposed and integrated an interest meter module into a video summarization system, and achieved good performance.

Liu et al. [15] proposed an implicit video multiple emotion tagging method by exploiting the relations among multiple expressions, and the relations between outer expressions and inner emotions. The experimental results on the NVIE database demonstrated that multi-expression recognition considering the relations among expressions improved the recognition performance. The tagging performance considering the relations between expression and emotion outperformed the traditional expression-based implicit video emotion tagging methods.

Other than focusing on subjects’ whole facial activity, Kok-Meng Ong [20] analyzed affective video content from viewers’ pupil sizes and gazing points. Experimental results on 6 subjects watching 3 videos showed the effectiveness of their approach.

Instead of affective content analysis of videos, Ioannis Arapakis et al. [2, 3] proposed a video search interface that predicts the topical relevance by incorporating facial expressions and click-through action into user profiling and facilitating the generation of meaningful recommendations of unseen videos. The experiment on 24 subjects demonstrated the potential of multi-modal interaction for improving the performance of recommendation.

Although all the studies described above explored visual behavior to analyze the affective content of a video, their purposes (i.e. summarization [8, 22, 24], recommendation [2, 3], tagging [15]) and the used modalities (i.e. facial expression [2, 3, 8, 22, 24], click-through action [2, 3], eye movements [3, 22, 24]) are not the same. The facial expression classifiers used in the related work are eMotion (a facial expression recognition software) [2, 3], Support Vector Machine (SVM) [22, 24] and Bayesian networks [8, 15].

These studies illustrate the development of methods for using spontaneous visual behavior in the implicit tagging of videos. However, the assumptions made by the above studies is that the expressions displayed by the subjects were the same as their internal feelings when they watched the videos. For this reason, most researchers have used the recognized expression directly as the emotional tag of the videos. However, research has indicated that internal feelings and displayed facial behaviors are related, but not always the same [5] because some emotions are not always expressed in our daily life. Furthermore, few research has paid attention to common emotion and individualized emotion. Therefore, in this paper we propose emotion tagging of videos by inferring videos’ common emotions and users’ individualized emotions from users’ expressions. Furthermore, the data sets used in previous studies were small, with the number of the subjects ranging from 6 to 32. Thus, a much larger emotion database named NVIE database [34] is constructed, in which the facial videos of 128 subjects were recorded when they watched emotional videos in three types of illumination conditions (i.e., front, left and right).

Compared with the most related work [15], we find that paper [15] focuses on modeling the co-occurrence and mutually exclusive relationships among different facial expressions for improved video emotion tagging. It does not differentiate the subject’s internal emotion from the video common emotion, treating them as the same. This paper, on the other hand, explicitly models the differences and the relationships between a video’s common emotion and the user’s internal emotion. These two works, hence, addressed different problems.

3 Implicit emotion tagging approach

Figure 1 gives the overview of our implicit emotion tagging approach. It consists of two components: expression recognition model and video emotion tagging model based on recognized facial expressions. Details for each component are discussed below.

Fig. 1
figure 1

Framework of our method

3.1 Facial expression recognition

Facial expression recognition includes facial feature extraction, feature selection, and expression recognition.

3.1.1 Facial feature extraction

Two kinds of features are extracted: head motion features and facial appearance features. Two head motion features, including the translational speed of head motion and the head’s rotational speed, are calculated. First, the subject’s eyes are located automatically by using eye location method based on AdaBoost and Haar features [16]. Then, head motion features are calculated from the coordinates of the eyes in the onset and apex frames as follows:

$$ Speed_{m}=\frac{\sqrt{\left(\vphantom{C_{y}^{apex}-C_{y}^{onset}}C_{x}^{apex}-C_{x}^{onset}\right)^{2}+\left(C_{y}^{apex}-C_{y}^{onset}\right)^{2}}}{Time} $$
(1)
$$ Speed_{r}=\frac{\left|{\arctan\left(\frac{R_{y}^{apex}-L_{y}^{apex}}{R_{x}^{apex}-L_{x}^{apex}}\right)-\arctan\left(\frac{R_{y}^{onset}-L_{y}^{onset}}{R_{x}^{onset}-L_{x}^{onset}}\right)}\right|}{Time} $$
(2)

where \((C_{x}, C_{y})\) represents the coordinate of the center of the two eyes, \(Time=frame_{apex}-frame_{onset}\), and \((L_{x}, L_{y})\) and \((R_{x}, R_{y})\) represent respectively the coordinates of the left and right eye locations. In (1), \(Speed_{m}\) is the speed of the head motion. In (2), \(Speed_{r}\) represents the rotational speed of the head. \(Speed_{m}\) and \(Speed_{r}\) are both scalars.

Besides motion features, facial appearance features are also extracted. Since the AAM captures information about both appearance and shape [6], we use AAM to extract visible features from the apex expressional images. The AAM tools from [26] are used to extract the AAM features here.

All apex images were rotated to arrange the two eyes in a horizontal line and then normalized to \(400\times 400\) grayscale images with the center of the two eyes at (200,160). The face was labeled with 61 points as shown in Fig. 2. One third of the apex images were selected to build the appearance model. This model was then applied to the remaining images to obtain their appearance parameters as the appearance feature. Here, AAMs are trained in a person-independent manner. Finally, a 30-dimension feature vector was extracted from each of the apex images using the AAM algorithm.

Fig. 2
figure 2

Distribution of AAM points

3.1.2 Feature selection

In order to select distinctive features for each Naive BN classifier of each expression category, the F-test statistic [37] is used for feature selection. Like the Fisher criterion, F-statistic is the ratio of between group variance to within-group-variance. The significance of all features can be ranked by sorting their corresponding F-test statistics in descending order. The F-test statistic of feature X is calculated by using the following equation:

$$ F(X)=\left(\frac{\sum_{c=1}^{N}n_{c}(\overline{x_{c}}-\overline{x})^{2}}{\sum_{c=1}^{N}(n_{c}-1)\sigma_{c}^{2}}\right) \left (\frac{n-N}{N-1}\right) $$
(3)

where N is the number of classes, \(n_{c}\) is the number of samples of class c, n is total number of samples, \(\overline {x_{c}}\) is the average of feature X within class c, \(\overline {x}\) is the global average of feature X, and \(\sigma _{c}^{2}\) is the variance within class c. According to the calculated F-statistics of all features, features are selected from high to low.

3.1.3 Expression recognition

Expression recognition through Naive BNs

Given the selected facial features, we propose to use naive BNs to recognize facial expression due to its simplicity. BN is a probabilistic graphical model (PGM) that encodes the causal probabilistic relationships of a set of random variables via a directed acyclic graph (DAG), where the nodes represent the random variables and the edges represent the conditional dependencies between variables. Compared with other commonly used deterministic classifiers such as the SVM, BN is simple and can effectively model the vagueness and uncertainties with the affective states and facial features. In addition, BN offers principled inference method to perform classification.

In order to select discriminative features for each kind of expression, we construct N binary BNs instead of one multi-class BN. It means that, the N-class expression recognition problem is solved by N BN classifiers for each kind of facial expressions as shown in Fig. 3a.

Fig. 3
figure 3

Expression recognition model and a simple BN model

Each BN consists two nodes, feature node F and category node C as shown in Fig. 3b. The former is a continuous node, and the latter is a discrete node with two states (1 and 0) representing the recognition result being expression \(C_{i}\) and not \(C_{i}\) respectively. Given the BN’s structure, the BN parameters, i.e., the prior probability of C, P(C), and the conditional probability, \(P(F|C)\), are learnt from the training data though maximum likelihood (ML) estimation. After training, the posterior probability \(P(C_{i}=1|F)\) of a testing sample is calculated according to (4):

$$ P(C_{i}=1|F)=\frac{P(C_{i}=1,F)}{P(F)}=\frac{P(F|C_{i}=1)P(C_{i}=1)}{P(F)} $$
(4)

After all the posterior probabilities for each expression have been calculated, the final recognized expression can be obtained as follows:

$$ RecExp^{*}= \underset{C}{\arg\max}P(C_{i}=1|F) $$
(5)

Expression recognition through modeling the structure of the feature points using BN

Instead of using Naive BN, we propose another sets of BN to capture the structure of the feature points embedded in N-class expressions (called structured BN), as shown in in Fig. 4. Each node of the BN is a geometric feature (i.e. the coordinates of feature points and the head motions), and the links and their conditional probabilities capture the probabilistic dependencies among all the geometric features.

Fig. 4
figure 4

Expression recognition through modeling their geometric features

The BN learning consists of structure learning and parameter learning respectively. The structure consists of the directed links among the nodes, while the parameters are the conditional probabilities of each node given its parents. The structure learning is to find a structure G that maximize a score function. In this work, we employ the Bayesian Information Criterion (BIC) score function which is defined as follows:

$$ Score(G) = \max\limits_{\theta}log(p(DL|G,\theta)) -\frac{Dim_{G}}{2} logm $$
(6)

where the first term is the log-likelihood function of parameters \(\theta \) with respect to data DL and structure G, representing the fitness of the network to the data; the second term is a penalty relating to the complexity of the network, and \(Dim_{G}\) is the number of independent parameters. After the BN structure is constructed, parameters can be learned from the training data. Because a complete training data is provided in this work, Maximum Likelihood Estimation (MLE) method is used to estimate the parameters.

In this work, N models \(\Theta _{c}, c = 1,\cdots ,N\) are established during training, where N is the number of expression categories. After training, the learned BNs capture the muscle movement pattern for N-class expressions respectively.

During testing, the samples are classified into the cth expression according to

$$\begin{array}{@{}rcl@{}} c^{\star}& = & \underset{c\in[1,n]}{\arg\max} \frac{P(E_{T}|\Theta_{c})}{Complexity(M_{c})} \notag\\ &= &\underset{c\in[1,n]}{\arg\max}\frac{ \prod_{i=1}^{M}P_{c}(F_{i}|pa(F_{i}))}{Complexity(M_{c})} \notag\\ &\propto &\underset{c\in[1,n]}{\arg\max} \sum_{i=1}^{M}log(P_{c}(F_{i}|pa(F_{i}))) - log(Complexity(M_{c})) \end{array} $$
(7)

where \(E_{T}\) represents the features of a sample, \(P(E_{T}|\Theta _{c})\) denotes the likelihood of the sample given the cth model, M represents the dimensions of the features that is the number of nodes, \(F_{i}\) is the ith node in the BN, and \(pa(F_{i})\) denotes the parent nodes of \(F_{i}\), and \(M_{c}\) stands for cth model and \(Complexity(M_{c})\) represents the complexity of \(M_{c}\). Since different models may have different spatial structures, the model likelihood \(P(E_{T}|\Theta _{c})\) will be divided by the model complexity for balance. We use the total number of the links as the model complexity.

3.2 Emotion tagging of videos

The emotion elicitation process can be captured by a generative process using a video to induce user’s emotion, which, in turn, causes certain facial expression on the user’s face as an external manifestation of the user’s internal emotion. To model this generative process and to capture the inherent causal relationships between a video’s emotion content, user’s internal emotion, and user facial expression, we propose to use another BN, as shown in Fig. 5, for video emotion tagging.

Fig. 5
figure 5

Common emotion (CEmo) recognition based on recognized expressions (RecExp) and subjects’ individualized emotions (IEmo)

As a graphical model, BN can effectively captures the causal relationships among the random variables. It is therefore a natural choice to capture and model the natural and inherent relationships between the common emotion of the video, the specific emotion of the user, and the user’s facial expression. Moreover, BN also allows rigorous inference of video emotion tag from the recognized facial expressions. This BN includes three discrete nodes and links. The nodes respectively represent the common emotion tag (CEmo), the individualized emotion tag (IEmo) and the recognized expression (RecExp), while the links capture the causal relationships among the nodes. Each node has N states, representing N classes.

Given the BN in Fig. 5, a similar maximum likelihood estimation process is used to estimate its parameters, i.e., the conditional probabilities of each node including \(P(CEmo)\) (the prior probability of the common emotion), \(P(IEmo |CEmo)\) (the conditional probabilities of the subjects’ individualized emotion state given the video’s common emotion tag) and \(P(RecExp|IEmo)\) (the conditional probability of the training sample’s recognized expression state given its individualized emotion state). During testing, the posterior probabilities of the video’s individualized emotion tag \(IEmo^{*}\) and common emotion tag \(CEmo^{*}\) are inferred using the following equations:

$$\begin{array}{@{}rcl@{}} IEmo^{*}&=&\underset{IEmo}{\arg\max} P(IEmo|RecExp) \notag\\ &=&\underset{IEmo}{\arg\max}\sum_{CEmo}^{}P(CEmo)P(IEmo|CEmo)P(RecExp|IEmo) \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} CEmo^{*}&=&\underset{CEmo}{\arg\max} P(CEmo|RecExp) \notag\\ &=&\underset{CEmo}{\arg\max}\sum_{IEmo}^{}P(CEmo)P(IEmo|CEmo)P(RecExp|IEmo) \end{array} $$
(9)

4 Implicit emotion tagging experiments

4.1 Databases for implicit video tagging

As mentioned in Section 2.1, Soleymani et al. constructed two multimodal datasets for implicit tagging, DEAP and MAHNOB-HCI. Both consist of the facial images of subjects when they watch videos. However, neither of them has facial expression annotation. Thus, due to the fact that facial expression annotation is both onerous and subjective, and that we do not have the necessary expertise to do an objective expression annotation at present, we don’t use these two databases in this work. The NVIE (Natural Visible and Infrared facial Expression) database [34] is another multimodal database for facial expression recognition and emotion inference. NVIE contains both posed expressions and video-elicited spontaneous expressions of more than 100 subjects under three different illumination directions. During the spontaneous expression collection experiments, the participants offered the self-report to the stimuli video according to their emotion experiences in six basic emotion categories, named happiness, disgust, fear, sadness, surprise and anger, which can be regarded as the individualized emotional tags. The common emotional tags are determined by a majority rule. In addition, the NVIE database provides the facial expression annotations of both apex facial images and image sequences in six categories. This database is therefore suitable for our implicit video emotion tagging experiments. The database consists the samples where the expression positively correlates with user’s emotion, does not contain cases where the user’s expression negatively correlates with their actual emotion. The construction details of the NVIE database can be found in [34]. Appendix presents the information of stimulus videos and subjects.

For the purpose of this study, the facial image sequences whose emotion categories and expression categories are happiness, disgust, fear, surprise, sadness and anger, and whose average evaluation values of the self-report data are larger than 1 are selected from the NVIE database. Thus, six expression and emotion categories are considered in this paper. Ultimately, we selected 1154 samples and the annotations of their expressions, individualized emotions and videos’ common emotional tags as shown in Table 1, in which a total of 32 videos (including 6 happiness videos, 6 disgust videos, 5 fear videos, 7 surprise videos, 4 anger videos, 4 sadness videos) are watched by these subjects. Besides, the confusion relations between the subjects’ expressions, individualized emotional tags, and common emotional tags are summarized in Table 2.

Table 1 The information of the selected samples
Table 2 Relations between the samples’ expressions, individualized emotions and common emotions

From Table 2, it is clear that, although there are high consistencies between facial expressions, individualized emotions, and common emotions, there still exist some discrepancies, especially for some negative emotion or expression states such as anger. This suggests that while outer facial expressions can well reflect our inner individualized or common emotions, they are not completely the same. Furthermore, the table also shows the differences between a subject’s individual emotion and the video’s common emotion. This means the same video with the same common tag may invoke different individual emotions for different people. If we can exploit these relationships effectively, it may be helpful in emotion reasoning or video tagging from facial expressions.

4.2 Experimental conditions

To select the best dimension of features for the naive BN classifiers, we employ a model selection strategy and 10-fold cross validation. First, all the samples are divided into ten parts. One of them is used as the test set, and the remaining are used as the training set. We apply a 10-fold cross validation on the training set to choose the features that achieve the highest accuracy rate on the validation set. After that, the selected features and the constructed BNs are used on the test set to classify facial expressions.

In order to evaluate the effectiveness of our approach from different aspects, two commonly used parameters, the precision and the \(F_{1}\) score [28], are adopted, which are defined as follows:

$$ Precision(C_{i})=\frac{TP(C_{i})}{TP(C_{i})+FP(C_{i})} $$
(10)
$$ F_{1}(C_{i}) =\frac{2\times TP(C_{i})}{2\times TP(C_{i}) + FN(C_{i}) +FP(C_{i})} $$
(11)

where, TP (true positive) represents the number of samples correctly labeled as belonging to the positive class \(C_{i}\), FP (false positive) is the number of samples incorrectly labeled as belonging to the positive class \(C_{i}\), and FN (false negative) is the number of the samples which are not labeled as belonging to the positive class \(C_{i}\) but should have been.

Our work focuses on emotional tagging using facial expressions, while the current related works mentioned in Section 2.2, explored facial expression, click-through action, or eye movements for video summarization, recommendation and tagging. The purposes and modalities of related works are not exactly the same as our work. Therefore, we cannot directly compare our work with these works. Through analyses, we find that the facial expression classifiers used in the related work are eMotion (a facial expression recognition software) [2, 3], SVM [22, 24] and Bayesian networks [8, 15]. The Bayesian networks used in [8] are similar to our structured BN. Therefore, the experimental results using structured BN can be regarded as the comparison with [8].

4.3 Experimental results and analyses

4.3.1 Experimental results and analyses of expression recognition

According to the expression classification model described in Section 3.1.3, two comparative experiments using only the AAM features and the combination of AAM features and head motion features are performed to recognize the outer expressions. The classification precisions and the \(F_{1}\) scores of the two experiments are shown in Table 3.

Table 3 Expression recognition results with (AAM+HM) and without (AAM) head motion features

From Table 3, we can find the following two phenomenons: (1) For both expression recognition methods, the overall precisions by using the AAM features and AAM + DHM features are 0.602 vs. 0.640 and 0.530 vs. 0.563 respectively. It means that head motion features improve the overall classification results more than 3 %. The average \(F_{1}\) scores of the classifiers with head motion are also higher than those without head motion, which indicates that the head motion is useful for spontaneous facial expression recognition. (2) The recognition rate of happiness is high, and the recognition rates of the negative expressions are relatively low. The reason may be that, it is easier to elicit the positive expressions than negative expressions by using the video-based emotion elicitation method [5].

For the Naive BN classifiers, the selection probabilities of all the 32 features including the head motion features (feature ID: 31–32) over the ten folds are shown in Fig. 6. From Fig. 6, we can conclude that: (1) For happiness, disgust, and fear, the head motion features are selected. It means that the head motion features are helpful for distinguishing these expressions when naive BNs are used. This conclusion can also be reflected in the recognition results given in Table 3. (2) The selected feature numbers for different expressions are different. For happiness, disgust, and fear, more than half of the features are selected, while for the other three expressions, only a few features are selected, especially for anger and sadness. This proves that the discriminative features for different expressions are different.

Fig. 6
figure 6

Feature selection results of expression recognition using Naive BN

For the structured BN, the learned BNs are showed in Fig. 7. From the figure, we can find that: (1) The learned structure for the six expressions are different. It may indicate that the appearance features’ relations embedded in different expressions are different. (2) For most expressions, one or two head motion features are dependent with AAM features. It may confirm that, head motions are related to facial appearance for expression manifestations.

Fig. 7
figure 7

The learned BN structures for six expressions using AAM and head motion features; a Happiness; b Disgust; c Fear; d Surprise; e Anger; f Sadness

4.3.2 Experimental results and analyses of emotion tagging

Based on the recognized facial expressions, the subjects’ individualized emotion states as well as the video’s common emotional tags are inferred by a 3-node BN model. Tables 4 and 5 present the precisions and \(F_{1}\) scores of tagging results. From the tables, we can find that both the precisions and \(F_{1}\) score of the individualized tagging are lower than those of the common tagging. It illustrates the difficulty with personalized video emotion tagging, since individualized emotions are context dependent, subjective and complex.

Table 4 Individualized emotion tagging results based on AAM and AAM+HM features
Table 5 Common emotion tagging results based on AAM and AAM+HM features

To further validate the effectiveness of our proposed tagging method, comparative tagging experiments, which recognize the common and individualized emotional tags directly from the facial features, are conducted. The classification models are similar to the expression classification models described in Section 3.1.3, where the original expression labels of the samples are replaced by the common and individualized emotional tags.

The precisions and \(F_{1}\) scores of the experiment results are shown in Tables 4 and 5. Comparisons and corresponding conclusions are listed as follows:

  • By comparing the results of directly inferring tag from image features to our method, we can find that the video’s common emotion tagging and subjects’ individualized emotion recognition results when considering relations among the outer expression, individualized emotional tags and the videos’ common emotional tags are superior to those without considering these relationships in terms of both precision and \(F_{1}\) score. It proves the effectiveness of our proposed implicit emotion tagging method.

  • Comparing the results of using and without using head motion features, it is clear that head motion features improve the overall tagging results in terms of both the precision and the \(F_{1}\) score. As for the specific categories, head motion features can improve the precision of happiness, disgust, sadness and especially fear.

5 Conclusions

Emotion tagging of videos has been an active research area in recent decades. Implicit tagging using audiences’ spontaneous response has become increasingly attractive, and preliminary research has been performed because of its potential applications. In this paper, we propose an implicit video tagging method from the subjects’ spontaneous facial expression. To recognize facial expressions, a set of binary Naive BNs and structured BNs are employed. The common and individual emotional tags of a video are then inferred from the recognized facial expressions through a 3-node BN by explicitly modeling the relations among the outer facial expressions, the individualized emotional tags and the common emotional tags. The results show that head motion features improve the overall performance of the spontaneous expression recognition, the subjects’ individualized emotion tagging and the videos’ common emotion tagging. The captured relations among the outer facial expressions, individualized emotional tags and the common emotional tags are helpful for implicit video tagging. However, the performance improvement is minor and incremental. This may be due to the fact that the relationships vary with subject and with emotion. We shall further investigate this issue in the future. Our method requires the dataset must have simultaneous annotations for facial expression, audiences’ inner emotion and videos emotion tags. This, unfortunately, is not the case for the existing databases except NVIE database. For example, both DEAP and MAHNOB-HCI databases do not have facial expression annotations. To use these databases or any existing databases requires us to provide the missing annotations. Annotation of any database is an onerous and time-consuming task. Furthermore, annotation requires the necessary expertise in order to provide accurate and objective labels. We currently do not have such an expertise. Thus, in this paper, we only evaluate our method on NVIE database. We will perform further evaluation on another database in the future.

The existing implicit tagging work regards the subject’s facial expressions as the video’s emotional tags directly, and has rarely considered both individualized emotional tag and common emotional tag. Compared with these work, we are the first to propose a BN classifier to systematically capture the differences and relations among the outer facial expressions, subjects’ inner individualized emotion states, and the videos’ common emotion categories. We find that the emotion tagging results based on facial expression recognition and BN inference considering these three items’ relations are better than the results of direct individualized emotion tagging or videos’ common emotion tagging from facial images.