1 Introduction

News can reach audiences through different communication media, such as newspapers, magazines, radio, television, and Internet [61]. However, over the last years, a considerable volume of digital innovations have introduced a new set of influences on the public’s news habits. More and more people have expressed a clear preference for getting their news on a screen, especially in the form of videos [47]. In fact, news can be delivered more quickly and accessed more easily through videos. In this scenario, even though television still remains the dominant screen, other popular digital devices, such as smartphones, tablets and computers have increasingly attracted more viewers [47].

The news media constitute a particular type of discourse and a specific kind of sociocultural practice [70], which became an important part of public life. Therefore, it is of great importance to understand how the news industry works and how it influences the social world. Since a news program constitutes a particular type of discourse, discourse analysis techniques [11, 14, 56] have been applied to analyze its structure at different levels of description, considering aspects such as the topics addressed, enunciation schemes and its stylistic or rhetorical dimensions [12]. Discourse analysis is the area of linguistics that focuses on the structure of language in enunciation acts [11, 29]. It is interesting both in the complexity of structures that operate at that level and in the insights it offers about how personality, relationships, and community identity are revealed through patterns of language use [4].

Commonly, discourses have been analyzed without the support of computational tools, such as automated annotation software and video analytics toolboxes [66]. However, with the constant and fast development of areas as computational linguistics, sentiment analysis, information retrieval and computer vision, novel methods have been proposed to support the discourse analysis, especially of multimedia content (e.g. news videos) [4, 10, 14, 15, 18, 20, 27, 34, 57, 76]. Importantly, computer-aided methods appear as complementary tools, providing the analyst with a much better understanding of language use.

According to Stegmeier [66], the use of computational tools in discourse analysis allows the combination of qualitative approaches with quantitative ones, contributing to handle the following aspects: (1) number of documents (corpus size): large databases may be analyzed when computational tools are provided, what would not be possible otherwise; (2) documents enriching (corpus quality): the corpus of data may be enriched with additional information (metadata provided from annotation processes); and (3) automatic pattern detection: computational models and statistical measures may be applied to assist in the automatic detection of patterns and in the description of the significance of these findings.

As a step toward this goal, we present a novel multimodal approach to support discourse analysis of news videos, by estimating tension levels along the news narrative, which are fundamental cues to reveal the distinct communication patterns used by the news industry. Among other things, those patterns may be sometimes used to shape public opinion, promote commercial products and services, publicize individuals, or support other interests [36].

Our key observation is that, by combining audio and visual cues extracted from news participants (e. g., reporters, anchors, among others), as well as textual cues obtained from the closed caption and speech transcriptions of the news narrative, it is possible to estimate tension levels (polarities), as illustrated in Fig. 1. The proposed approach is based on robust computational methods for: (1) emotion recognition from facial expressions [5, 40], (2) field size estimation [14], (3) extraction of audio speech features (e.g., chroma features, Mel Frequency Cepstral Coefficients and spectral features) [22] and (4) sentiment analysis of textual information [54].

Fig. 1
figure 1

Overview of the proposed approach: combination of audio, textual and visual cues to estimate tension levels (polarities) along the news narrative

To the best of our knowledge, no other approach explores different information modalities (audio, textual and visual) to measure tension levels in news videos. This is, in turn, the main contribution of this work, whose approach arises as a promising tool to be used in several domains, such as journalism, advertising and marketing [66, 69]. By using our solution, for example, media analysts may perform the semiodiscoursive analysis of not only verbal but also of non-verbal languages, which are manifested along the news narrative through facial expressions and gestures from news participants (e. g., reporters, anchors, among others) [17]. Additionally, by using tension levels as input, alternative news summarization and classification algorithms may be developed or even novel video advertising strategies may be created, by considering the inclusion of video ads at points of the news narrative where the tension is low.

The multimodal approach presented in this paper builds on our previous work [56, 57] with (1) an updated and more comprehensive discussion of related work, (2) the detailed description of all steps and capabilities of the proposed approach, (3) some improvements in the visual analysis step with respect to the way that faces are detected, their corresponding emotions are recognized and a participant’s field size is determined, (4) an important alteration in the way that tension curves for distinct information modalities are computed and later combined, (5) a new set of experiments and (6) a detailed analysis of the performance of our approach.

The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 covers the proposed approach. Experimental results and discussions are presented in Section 4, followed by the conclusions and suggestions for future work in Section 5.

2 Related work

Over the last years, with the emergence of different social media platforms, such as, Youtube, Facebook and Flickr, significant efforts have been made to develop methods to mine opinions and identify affective information from multimodal features (e.g. audio, textual and visual features) extracted from video documents [9, 44, 50, 58, 75]. In this section, we initially review the state-of-the-art on multimodal sentiment analysis methods, which tackle problems that are related to the one addressed in this work. Research in this area has attracted the attention of both academia and industry and led to the creation of innovative intelligent systems [58]. Afterwards, we describe works that have applied sentiment analysis methods to news documents delivered through different communication media.

2.1 Multimodal sentiment analysis

Multimodal sentiment analysis is the emotion, attitudes and opinions analysis using multimedia content [64]. The importance of this is the existence of many other information sources than just the text, such as the images and the audio in televisions programs and online videos, for example. Then, a multimodal method can use various media besides text, such as audio and video, in order to increase the accuracy of sentiment classification by using emotional content analyzers. The integration of those resources allow us to combine the results obtained from the sentiment analysis in textual metadata, usually determined by the polarity and intensity of lexical dictionaries, with emotional audio signal classification and emotional content analysis from videos based on the postures, gestures and facial expressions. In this context, [3] presents the use of machine learning using a neural network and multimodal features in order to perform sentiment analysis in a microblogging content which contained short texts and, in some cases, one image. The proposed approach obtained good classification results by using of efficient models to deal with syntactic and semantic similarities between words and also unsupervised learning of robust visual features which were obtained through the partial observations in images modified by occlusions or noise.

Poria et al. [58] presented an innovative approach for multimodal sentiment analysis which consists of collecting sentiment of videos on the Web through a model that fuses audio, visual and textual modalities as information resources, obtaining a precision of around 80%, representing a increase of more than 20% of precision when comparing to all the state-of-the-art systems.

In online news, we can also analyze the comments left by the users. Then, in the comments, there are some phrases that may have highly emotional content, however it does not has any keyword to detect that emotion. The study presented by [75] made an analysis in audiovisual comments in order to detect emotions in the facial expression of the users. From the audiovisual information, they were able to extract emotions from the video and audio simultaneously, allowing the classification of the client experience as positive, negative or neutral.

Maynard et al. [44] describes an approach for sentiment analysis based on social media content, combining opinion mining in text and in multimedia resources (e.g. images, videos). They focused on entity and event recognition to help archivists in selecting material for inclusion in social media preserving the community memories and organizing them into semantic categories. The approach was also able to solve ambiguity and to provide more contextual information. They use a rule based approach for the text, concerning issues inherent to social media, such as grammatically incorrect text, the use of profanity and sarcasm. Besides the new combination of tools for text opinion mining and multimedia resources, Natural Language Processing (NLP) tools have been adapted for opinion mining in social media.

2.2 Tension and sentiment analysis in news

Many people read online news from websites of the great communication portals. These websites need to create effective strategies to draw people attention to their content. Recent efforts have explored sentiment analysis techniques to examine news articles or to create new applications [54].

In this context, [33] focused in sentiment analysis in news articles. They first collected the news and, by analyzing the positive and negative content, they could find that the majority of news has a negative subject such as corruption, robbery, rape, among others. After that, they analyzed that news websites usually gives prominence to negative news while positive news tends to be less emphasized. Thus, the main goal of this work was to provide a platform of positive environment by finding news with positive sentiment and creating a platform which highlights them. To accomplish this, they extracted news articles from online news portals and identified the positive and negative sentiments in its content by using a hybrid approach combining two classifiers: NaÃ\({}^{^{-}} \)ve Bayes and decision table. By doing this, they could improve the classification performance.

In [38], the authors proposed a bootstrapping semi-supervised algorithm in order to analyze news comment in People’s Daily, the biggest newspaper group in China. This approach groups the target sentiment value, the extraction of the lexicon features from dictionary and the sentiment prediction in a unified structure about news. The lexicon consists of a set of Chinese words that can act as strong or weak clues to subjectivity. The main goal of this is to help the political scientists when performing a quantitative analysis of the emotional tension of politics. Differently than other work in literature, in this work the time information was considered by using a hierarchical Bayesian model.

In 2012, [39] presented an approach to build a German news corpus about politics for opinion mining. This corpus were trained using a state-of-art technique for association rules learning in order to correlate the news title with the polarity of the news comments left by the reader. The association rules learning was used as a support for a minimally supervised machine learning framework, obtaining a negative sentiment for 86.2% of the news. They also shows that the use of high tensions in the headlines can be a strategy to improve the content popularity.

Most of these methods have been developed for English and are difficult to generalize for other languages in order to make cross-cultural comparisons. In this context, authors in [6] explored state-of-the-art translator machines in order to perform sentiment analysis in the English translation of a foreign language text. The experiments indicate that the sentiment polarity extracted by the method were statistically correlated on news sources, in nine languages independent of the translator, after applying normalization techniques.

In [13], the authors analyzes the tension forms between what is presented in the news regarding objectivity and the television aestheticization of intense human suffering as a part of the media strategies in TV programs. This work used as a case study the bombing of Baghdad in 2003 during the Iraq war in long shot in an almost literary narrative that marketed a horror aesthetic. This paper argues that this aesthetics of emotion and suffering generates, at the same time, an attempt to preserve the status of objectivity and impartiality while subjecting the audience to choose sides on the fact.

Note that the vast majority of works aforementioned refer to sentiment analysis approaches. Therefore, even though those works aim at computing a specific kind of affective information, they do not address the particular problem of estimating tension levels in news videos. As a matter of fact, sentiment and tension are different concepts. More specifically, while a sentiment is traditionally understood as a polarity measure (positive, neutral or negative) that can be detected through affective states left by the user’s opinion on a certain content [64], a tension level is used to infer at how extent a content can affect the users emotions through distinct polarities (e.g., high and low) [11, 14, 56]. That said, tension levels are estimated from the news content itself, while sentiment polarities are frequently derived from users opinions.

Usually, a sentiment refers to affective states with either positive polarity (e.g., joy and surprise), neutral polarity (objective content) or negative polarity (e.g. anger, disgust, fear, and sadness) [58]. On the other hand, a tension level is expressed as high or low [57]. In news stories, that contain high tension facts, there is a perception that narrative induces a sense of conflict, violence, tragedy and death (homicide), revealing world problems that can produce a negative affective state on the viewer, especially whether the scenario of the fact is close to the public. Low tension news, in turn, can also be tragic, but as the scenario of the event may be more distant from the everyday life of the public, it leads to a somewhat more neutral affective state. Other examples of low tension news are those in which the subject is purely informative, with low spectacularization, such as sporting events, celebrations, cooking tips, technological advances, among others. In this paper, sentiment analysis methods are used to extract affective information from the textual data assigned to the news video. In this case, we map sentiments with positive and neutral polarities to the low tension level. Sentiments with negative polarity, in opposite, are considered as cues regarding a high tension level.

3 The proposed multimodal approach

This section describes the proposed approach to estimate tension levels in news videos, which is divided in five main steps as illustrated in Fig. 2. The first step is responsible for extracting the elementary data of a video, namely, its audio signal and its list of frames. The second step performs the visual analysis of each image frame by applying robust methods for emotion recognition from facial expressions [5, 40] and field size estimation [14]. In the third step, our approach analyzes the audio signal and extracts audio features, such as, chroma features, Mel Frequency Cepstral Coefficients and spectral features. Moreover, this step is responsible for transcribing the audio signal into text by using the IBM WatsonTM Speech-to-Text service [28, 51]. The fourth step, in turn, performs the sentiment analysis of the speech transcriptions of the news narrative [2]. As a result of the second, third and fourth steps, three different curves are computed, one for each information modality (audio, images and text). Those curves contain tension levels along time, whose values are modeled as polarities (high or low tension). Finally, the fifth step consists of combining those three curves to obtain a global tension curve for the news video.Footnote 1 The five steps aforementioned are described in the following.

Fig. 2
figure 2

Overview of the proposed approach to estimate tension levels in news videos

3.1 Elementary data extraction

The first step of our approach consists of extracting the elementary data of a news video (audio signal, image frames and closed caption). To accomplish this task, we use the FFmpeg multimedia framework [7]. Specifically, we obtain for each input video its corresponding audio track as a stereo (2 channels) .WAV file, by considering a sampling rate of 44.1 kHz and a sample size of 16 bits (CD audio quality).

The image frames extracted are represented in the RGB color space and are additionally converted to grayscale and resized to 480 x 360 pixels, so that they can be properly used by the proposed methodology for emotion recognition from facial expressions [5, 40], which requires this conversion. Furthermore, we resized for the same resolution also to make sure the size of the images not affect the results.

3.2 Visual analysis

The second step of the proposed approach performs the visual analysis of the image frames by initially detecting human faces on them [71] and, subsequently, by applying robust methods for emotion recognition from facial expressions [5, 40] and field size estimation [14]. Through the analysis of those visual cues, we estimate a curve τv(⋅) representing the tension levels along the news narrative for each one-second interval.

3.2.1 Face detection

Faces in an image frame usually entice the viewer attention and consist of an important semantic feature that may be used to measure the tension level in the news narrative. By employing the real-time Viola and Jones face detection method [71], we obtain the face information in each frame, including the number of faces, their sizes and positions.

In order to improve the robustness of our face detection approach, the Viola and Jones method has been implemented by considering two different classifiers, specifically, one for the face detection itself named primary classifier and another one for eyes detection, named secondary classifier. Initially, the primary classifier works on the whole frame and provides a list of frame regions that were classified as potential faces. In the following, each one of those regions are analyzed by the secondary classifier in order to remove eventual false positives.

Considering that size and position of a face usually reflect its importance and may affect not only the viewer attention [32, 43], but also the tension level assigned to the time slot represented by the frame, we use those both visual cues to select a single face in the frame, which may be considered as the most relevant according to those features. That face is then used as a reference to estimate the tension level in the news narrative by applying on it methods for emotion recognition from facial expressions [5, 40] and field size estimation [14].

More specifically, for each detected face i of a given image frame k, we compute a relevance measure ρi(k) by using (1) and (2):

$$ \rho_{i}(k) = {\sum}^{w_{i}}_{x={x^{i}_{o}}}{\sum}^{h_{i}}_{y={y^{i}_{o}}}g(x,y), $$
(1)
$$ g(x,y) = e^{-\frac{1}{2} \cdot \left[\left( \frac{x - x_{c}}{\sigma_{x}}\right)^{2} + \left( \frac{y - y_{c}}{\sigma_{y}}\right)^{2}\right]}, $$
(2)

in which \(({x^{i}_{o}},{y^{i}_{o}})\) are the origin coordinates of the i-th face (the left bottom corner of the face), wi and hi are the width and height of the i-th face, respectively, g(x,y) is a two-dimensional gaussian function used to weigh the face’s position in the k-th frame and (xc,yc), (σx,σy) are the center coordinates and the standard deviations in the x and y directions of the gaussian function, respectively. Importantly, the center coordinates of the gaussian function correspond to the frame center coordinates and if no faces are detected in the k-th frame, no tension level is estimated from it. A given face i is considered as the reference face of a frame k, if it contains the highest relevance measure ρi(k) in that frame.

3.2.2 Emotion recognition from facial expressions

Here we describe how we use the [5] method in order to obtain the emotion of each reference face obtained in Section 3.2.1.

Then, for each reference face determined, our approach recognizes eight basic emotions from the corresponding facial expression also considered in [41], namely: happiness, surprise and neutral, which define the low tension polarity, as well as fear, anger, sadness, disgust and contempt which define the high tension polarity. The emotions were divided as low tension and high tension according to the Geneva Wheel presented in [60]. In this study, they classified happiness and surprise as positive valence emotions (i.e. those derived from positive situations or items [45]) and fear, anger, sadness, disgust and contempt as negative emotions.

In order to predict these emotions, we apply the methodology proposed by [5], which is based on Gabor Filters. Gabor filters have been commonly used in the literature for edge detection and for extracting texture characteristics from an image in many pattern recognition applications [21, 35, 53, 77]. Then, this can be used to differentiate between facial expressions depicted in images.

More specifically, the [5] approach applies Gabor Filter and, after that, they used a machine learning approach, namely Support Vector Machine (SVM) in order to predict the basic emotions. To accomplish this, they used a training set defined by Tr = {(f1,a1),(f2,a2),...,(fn,an)}, where ai is the emotion previously annotated by humans for the recognized face fi. Each face fi is represented by Gabor Filters. By doing this, the SVM learns a model in order to predict the basic emotions. In order to train and evaluate this model, the authors used the CK+ dataset [42]. This dataset has 4,830 facial expressions from 210 adult faces. These faces are annotated with the basic emotions. In our approach, we use the tool Emotime and it is available at https://github.com/luca-m/emotime.

3.2.3 Field size estimation

Once the emotion of a reference face has been recognized, our approach additionally determines the field size assigned to that face [14]. The field size refers to how much of an individual and his/her surrounding area is visible within the camera’s field of view [65], being determined by two factors: (i) the distance of the individual from the camera and (ii) the focal length of the lens used. This concept is usually applied in filmmaking and video production. In order to measure the field size, we have computed the proportion of the detected face as presented in Fig. 3.

Fig. 3
figure 3

Basic types of field sizes for an individual

According to [11], the field size greatly affects the narrative power of a newscast, since the way an individual is visualized in a scene may guide and influence the viewers. There are several standardized field sizes [11], the names of which are commonly derived from varying camera-individual distances while not changing the lens. Six types of field sizes are considered in this work, namely: Close-up (effect of intimacy), Medium Close-up (effect of personalization), Medium Shot (effect of sociability), American Shot (effect of sociability), Full Shot (effect of public space) and Long Shot (effect of public space). Those field sizes are illustrated in Fig. 3.

Usually, the field size of an individual is only qualitatively defined [65]. Therefore, to obtain a quantitative measure of an individual’s field size at a given moment of the news narrative, we apply the method proposed in [14]. More specifically, we compute the ratio ϕ between the reference face area and the complete area of the image plane, which is then used as visual cue to determine the field size. As shown in Fig. 3, each field size has a specific range of possible values for ϕ. Those ranges were successfully validated in [14], achieving an overall accuracy as high as 95%.

The field size assigned to a reference face of a given image frame k defines a weighting factor ωk, for ωk = 1,...,6, as illustrated in Fig. 3. Basically, this factor is used to weigh the tension level previously estimated for the reference face from the emotion recognition method. Note that the larger is the ratio ϕ, the larger is the value of ωk and, consequently, the higher is the influence of the tension level of the k-th frame in the calculation of the tension level of a news video. This approach is in accordance with the postulates of [11] which they have shown that the higher is field size the higher tension levels is. Furthermore, the visual attention of a video viewer is directly proportional to the field size of the reference face [14, 32].

3.2.4 Visual tension curve computation

At the end of the visual analysis step, our approach computes the visual tension curve τv(⋅) representing the tension levels along the news narrative for each one-second interval.

Let H and L denote the sets of image frames in an one-second interval, whose reference faces emotions were assigned with high tension and low tension polarities, respectively. Given H and L, our approach computes the parameters η and λ in (3) and (4), respectively, which capture the influences of the individuals’ field sizes in the tension levels estimated:

$$ \eta = \sum\limits_{k \in H}\omega_{k}, $$
(3)
$$ \lambda = \sum\limits_{k \in L}\omega_{k}. $$
(4)

As mentioned in the previous section, ωk is the weighting factor used to weigh the tension level estimated for a reference face at an image frame k (ωk = 1,...,6). Thus, we compute the visual tension curve τv(⋅) according to (5):

$$ \tau_{v}(\cdot) = \left\{\begin{array}{lll} -1 & , \text{if } \eta > \lambda \\ +1 & , \text{otherwise}, \end{array}\right. $$
(5)

in which the scalar values -1 and + 1 denote the high and low tension polarities, respectively, for each one-second interval.

3.3 Audio analysis

The audio analysis is performed in the third step of the proposed approach. Here we have two goals, the first is to compute a function τa(⋅) which estimates the tension in audio for each five-second. The second goal is to obtain the audio transcriptions which will be used in the Textual Analysis step.

We use pyAudioAnalysis API [22] in order to infer the valence of the analyzed audio. This API implements a set of audio features such as chroma features, Mel Frequency Cepstral Coefficients and spectral features. By using these audio features, this API can train a regression model in order to infer a score s from -1 to + 1 which represents a positive valence when s > 0 and negative valence, otherwise.

Then, we use this API in order to infer the valence of the analyzed audio. After that, we are able to compute the τa(⋅) curve as presented in the (6). Then, by using (6), we assign to an audio signal a high tension level when its corresponding valence is negative (i.e. s < 0), otherwise, we assign a low tension level:

$$ \tau_{a}(\cdot) = \left\{\begin{array}{ll} -1 & , \text{if } s < 0 \\ +1 & , \text{otherwise}. \end{array}\right. $$
(6)

In order to transcribe an audio signal, we use the IBM WatsonTM Speech-to-Text service, which obtains the corresponding text for every five-second audio interval. By doing this, we are able to estimate the tension according to the textual information, as described in the next section.

3.4 Textual analysis

In the textual analysis step, we use automatic speech recognition from the audio signal in order to perform a sentiment analysis of the text. By doing this, we are able to compute the curve τt(⋅) for each five-second interval.

After transcribing the text with IBM Watson Speech-to-Text service [28, 51], we have used 16 sentiment analysis methods, namely: AFINN [52], EmoLex [59], Happiness Index [16], OpinionFinder [73], NRC Hashtag [48], Opinion Lexicon [30], PANAS-t [25], SASA [72], SANN [55], Senticnet [8], Sentiment140 [49], Sentistrength [68], SentiWordNet [19], SO-CAL [67], Stanford Deep Learning [62], Umigon [37] and Vader [31]. These methods are considered the state of art and they are all implemented in iFeel sentiment analysis software [2], which returns the sentence polarity from these methods (positive, negative or neutral). Since those methods support just English sentences, when the video is not in English, we automatically translate each sentence into English using IBM WatsonTM Language Translator service, according to Fig. 4. Note that we just did not use the Emoticons [24] and Emoticons DS [26] methods from the iFeel tool because they are exclusively based on emoticons and this kind of information is not present in the text obtained from the news.

Fig. 4
figure 4

Calculating of sentiment scores from text sentences obtained from automatic speech recognition

After that, we create a vector representing the 16 scores, one for each sentiment analysis method. Once the vector is created, we compute the curve τt(⋅) by doing a majority voting of these estimations, as presented in Fig. 4 and (7):

$$ \tau_{t}(\cdot) = \left\{\begin{array}{llll} -1 & , \text{if } n_{neg} > (n_{pos}+n_{neu}) \\ +1 & , \text{otherwise}, \end{array}\right. $$
(7)

in which nneg, npos and nneu are the numbers of methods which assigned, respectively, negative, positive and neutral polarities to the text. Then, in other words, by analyzing the text, there is a high tension (τt(⋅) = − 1) when the majority of methods assign a negative polarity on it, otherwise, it is considered a low tension text (τt(⋅) = + 1).

3.5 Tension levels estimation

Finally, in this step our approach combines the three modalities of curves (τv(⋅), τt(⋅) and τa(⋅)) into a single global tension curve τg(⋅) representing the news tension for each five-second interval.

To accomplish this, as the visual tension curve τv(⋅) produces a tension score for each one second, we need to combine the scores during five seconds. Them, let nh be the number of times which τv(⋅) = − 1 during five seconds and nl be the number of times that τv(⋅) = + 1 in the same interval. We combine these results according to (8), producing an curve τv5(⋅):

$$ \tau_{v5}(\cdot) = \left\{\begin{array}{lll} -1 & , \text{if } n_{h} > n_{l} \\ +1 & , \text{otherwise}. \end{array}\right. $$
(8)

Finally, our approach obtains the curve τg(⋅) by combining the curves τv5(⋅), τt(⋅) and τa(⋅) by using majority voting. More formally, let nh be the number of curves that assigned a high tension score and nl be the number of curves that assigned a low tension score. Thus, τg(⋅) is obtained according to (9):

$$ \tau_{g}(\cdot) = \left\{\begin{array}{llll} -1 & , \text{if } n_{h} > n_{l} \\ +1 & , \text{otherwise}. \end{array}\right. $$
(9)

4 Experimental results

To evaluate the accuracy and applicability of our approach, we performed experiments with challenging datasets, which are described in the next subsection.

Initially, in Section 4.1, we present the datasets, including the features and code description, to guarantee the work reproducibility. In Section 4.2, in turn, we present the evaluation methodology that consists in measuring the accuracy and statistical significance of our results. In Section 4.3, we present a thorough analysis of our approach by analyzing (1) each modality performance; (2) the performance of our multimodal approach; and (3) the accuracy of our approach taking into account the agreement of the modalities. Finally, in Section 4.4 we present results of News Rover Sentiment dataset when applying the multimodal approach proposed in this paper, including comparison it with the results obtained by the baseline [18] on the same dataset.

4.1 Datasets

We have evaluated our approach by using two different datasets. The first dataset has 960 news videos obtained from 51 exhibitions of four TV newscasts, three Brazilian news programs (namely Jornal da Band, Jornal da Record and Jornal Nacional) and an American television news (CNN). All video frames were converted to grayscale and resized to 480 × 360 pixels. This first dataset, namely ‘Piim News’, is detailed in Table 1.

Table 1 Numbers of videos per newscast and data collection period for the Piim News dataset.

Since each news story is more likely to have the same tension, to evaluate those news videos, we divided them in a way that each video has a single news story. After this, the shortest video was 3 seconds long and, the longest, 815 seconds. Thus, to evaluate the proposed approach, we asked 7 workers to annotates tensions of the videos. The tension levels considered for annotation were High and Low Tension.

To avoid bias, each video was annotated by, at least, 3 workers. Figure 5 shows the agreement level obtained in the annotation process. Among the 960 analyzed videos, all contributors have annotated the same tension level to 11.15% of the videos. On the other hand, there was few videos (9.69% of them) in which there is 42.86% of agreement (3 of 7 workers). Among the 960 videos on the dataset, were annotated 619 videos as Low Tension and 341 were considered High Tension.

Fig. 5
figure 5

Inter-workers agreement in the videos annotation process

We compare our approach with the baseline [18], using the same dataset as them (see Fig. 9 and Section 4.4). This second dataset, namely ‘News Rover Sentiment’, regards 991 videos from the US TV newscast CNN. They were manually annotated using Amazon Mechanical Turk [1]. The videos were recorded and processed between August 13, 2013 and December 25, 2013, and the length of the videos used in the study was between 4 and 15 seconds long.

4.1.1 On reproducibility

The methods and datasets we have used in our experiments are freely available. The data, videos and code from News Rover Sentiment dataset can be required at http://www.ee.columbia.edu/ln/dvmm/newsrover/sentimentdataset/. The data, videos and code from the Piim News dataset can be required at http://www.icwsm.org/2016/datasets/datasets/ and it is available concerning the study [57]. The Visual Analysis implementation is based on the work presented by [5, 40] and it is available at https://github.com/luca-m/emotime/. The pyAudioAnalysis framework used in Audio Analysis is freely available at https://github.com/tyiannak/pyAudioAnalysis/. The iFeel Web system used in Textual Analysis is freely available at http://blackbird.dcc.ufmg.br:1210/. Hence, we guarantee the reproducibility of our results, which can be used to improve our proposed approach and other future lines of research.

4.2 Evaluation methodology

Our evaluation has three main goals, (1) to evaluate the performance of our multimodal approach; (2) study the impact of each information modality; and (3) to analyze how close our approach is to a baseline. In this section, we present the metric and procedures to perform these evaluations.

To evaluate the effectiveness of the proposed approach, we use the accuracy metric. The accuracy of an experiment is its ability to differentiate the low and high tension levels correctly, according to the definitions below:

  • True Positive (TP) = number of videos correctly classified as low tension;

  • False Positive (FP) = number of videos incorrectly classified as low tension;

  • True Negative (TN) = number of videos correctly classified as high tension;

  • True Positive (FN) = number of videos incorrectly classified as high tension.

More specifically, we compute the accuracy by using (10):

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN}. $$
(10)

We compare our approach with a baseline of multimodal sentiment analysis [18] that implements a supervised method to infer the sentiment of news. To accomplish this, they provide the sentiment score according to the detected face, the audio and transcribed text. As they did not provide a way to combine those modalities, in order to compare with our approach, we combined the baseline modalities in the same way as our approach, using majority voting.

As proposed approach produces tension levels in each information modalities, we changed the output of our method to produce sentiment polarities instead of tension levels to compare it with the baseline. To accomplish this, we converted low tension videos into positive polarities and high tension videos into negative polarities. The content of a news video was considered as neutral when the visual and the textual modality expressed at the same time neutral polarity.

By doing this comparison, our goal is to see how close our approach is comparing to a supervised one. Note that our method has the advantage of being unsupervised and, because of that, we do not need any manual labeling effort in order to estimate the tension level. Our approach is unsupervised, however, it can use as inputs data generated by supervised and/or unsupervised learning methods. Note that, as the data generated by supervised methods were pretrained with different datasets, our method did not need any manual labeling effort.

In order to evaluate our method, we performed a 5-fold cross-validation [46]. In this procedure, each sample was randomly divided into five parts in which, in each run, one part was the test set and, for the supervised baseline, the remaining parts were used as the training set.

Finally, during the evaluations, to ensure whether the performance differences are statistically significant, we use Student’s T-test. Then, we consider significant differences in which the value of p is less than 0.05.

4.3 Performance analysis

In this section, we perform an analysis in order to understand better how our multimodal approach can help to infer tension in news. We first analyze how each modality performs in comparison to our multimodal approach. Figure 6 presents the accuracy for each modality (Visual, Audio and Text) and for our multimodal approach in Piim News dataset. In addition, to better understand how each modality compares to each other, Table 2 presents a pairwise comparison by using Student’s T-test. The observed differences are indicated by symbols << and >> if they are statistically significant with p < 0.05. Otherwise, the symbols < and > were used.

Fig. 6
figure 6

Classification accuracy (%) for each information modality when using our approach in Piim News dataset

Table 2 Pairwise comparison for each modality in our approach by using Student’s T-test

Analyzing the Fig. 6 we can see that, when comparing to audio and visual modality, the multimodal approach has the best results with statistically significant gains when comparing to the Visual modality (see Table 2). Our textual approach could achieve a good performance being statistically tied to our multimodal approach when we consider the whole videos dataset. Remember that the Piim News videos are larger and, consequently, with more textual content.

In order to better understand our results, we analyze here the performance when specific modalities agreed with each other on the estimation. Then, Table 3 presents the accuracy and the quantity of instances when 2 and 3 modalities agreed on each other. Then, as expected, when all the information modalities agreed on the tension levels estimation we can see that our method is more accurate.

Table 3 Multimodal approach accuracy and corresponding relative and absolute numbers of videos from Piim News dataset, when 2 and 3 information modalities point out the same tension levels

According to Fig. 7, when we consider the videos in which all modalities assigned the same tension level (639 videos), the accuracy of our multimodal approach is greater than any single modality. First, as expected, when all the modalities agree on the tension estimation we can see that our method is more accurate.

Fig. 7
figure 7

Classification accuracy (%) for each information modality when using our approach when all modalities assigned the same tension level (639 videos) from the Piim News dataset

Aiming at making a deeper analysis, we analyze also the videos in which just two modalities agreed on the estimation. To accomplish this, in Table 4 we present the accuracy when a modality agreed with another. As we can see, when visual and audio agreed each other it obtained the lowest accuracy, but it happened with just 19 videos. As textual modality was the best modality (see Fig. 6), when combined, it could help our method to achieve good results.

Table 4 Accuracy (%) and amount of videos instances when two specific modalities agreed on each other

As the textual modality are computed by combining many sentiment analysis methods, in Fig. 8 we present the accuracy of all sentiment analysis methods in comparison to our combined approach (in red) for the Piim News and News Rover Sentiment datasets. Our method was top 5 best method in both dataset, with an accuracy of 66.25% in Piim News dataset and, 52.77%, for the News Rover Sentiment dataset. The best method, for the Piim News dataset, was SENTISTRENGTH and, for the News Rover Sentiment dataset, was PANAST. As we can observe, our method is the stablest, being the only method which was top 5 in both datasets.

Fig. 8
figure 8

Textual sentiment analysis performance of the text obtained from speech recognition in Piim News (a) and News Rover Sentiment dataset (b)

4.4 Baseline comparison

We first present in Fig. 9 the accuracy of our proposed approach and the baseline approach for the Text, Visual and Audio modality as well as the multimodal method using the News Rover Sentiment dataset. Since our method and the baseline used majority voting to combine, we can see how this combination technique performs in different methods.

Fig. 9
figure 9

Comparison between the proposed approach and the baseline regarding the News Rover Sentiment dataset

We here also performed a pairwise comparison of each modalities of our method (see Table 5) as well as a pairwise comparison of our method with the baseline (see Table 6). In these tables, if the result difference was statistically significant we use the symbols >> or <<, otherwise, we use the symbols > and <.

Table 5 Pairwise comparison of accuracy levels regarding the information modalities used by the proposed approach
Table 6 Pairwise comparison of accuracy levels regarding the information modalities used by the proposed approach and the baseline

According to Table 5, our textual approach outperformed our multimodal method on Piim News dataset and, on the other hand, the multimodal approach was better than the textual mode on baseline dataset (see Table 6). Note that, videos from the Piim News dataset are larger and, consequently, with more textual content than videos from the baseline dataset. Then, the textual modality alone could take advantage of large texts in Piim News dataset. However, in videos with fewer texts, as in baseline dataset, our multimodal approach could improve the result.

Analyzing the modalities of our proposed approach, the best results were using the multimodal method, being statistically significant when comparing to the Visual and Textual modality (see Table 5). Analyzing the baseline method, we can observe that the Visual modality could reach a result as good as the multimodal (statistically tied). This result shows that majority voting can be a good approach in order to combine modalities. This technique was able to maintain or even improve the result of a single modality.

Our multimodal approach, even being unsupervised, reach a result very close to the best baseline result. Note that, differently than our baseline, as our approach is unsupervised, we do not need previously labeled videos in order to train the dataset. Because of that, we can say that our approach is less costly to implement with an accuracy closer to a supervised method.

Taking each individual modality into account, the visual and audio modalities from the baseline performed better than our approach. Similarly, as a supervised approach, those results were expected. However, by analyzing the textual modality, we can see that our approach performed well being statistically better than the baseline textual modality. Then, we can conclude that our textual modality approach, combining different sentiment analysis methods, can be a good alternative in order to infer the sentiment on newscasts. Furthermore, as we can see in Figs. 67 and 9 the multimodal method, when does not reach the best results, this method maintain a result very close to the best modality. Then, even with slight increases in some cases, the multimodal method is more stable maintaining a good result regardless the dataset used.

5 Concluding remarks

Newscasters express their emotions, providing evidences about tension of the speech generated by the news in order to legitimize the reported fact. Those patterns can be extracted from multiple sources of evidences according to facial expressions, field size, voice tone, and vocabulary used in their speech. In this sense, this work presented a method to infer the tension of news videos taking into account multiple sources of evidences: visual, audio and transcribed text.

These method can have a high applicability. For example, providing tension levels as input, alternative news summarization and classification algorithms may be developed or even novel video advertising strategies may be created, by considering the inclusion of video ads at points of the news narrative where the tension is low. In addition, this method can help in the discourse analysis of news videos.

In this study we have shown that our approach can reach an accuracy close to a supervised method but without the need of a labeling effort. By experiments, we have shown that our approach were better than the baseline in 44% of the dataset. We also show the importance of textual sentiment analysis methods in this task as well as how our approach, by combining all the textual methods can provide a more stable result than using just one sentiment analysis method.

As future work, we intend to evaluate how this approach can help news summarization methods as well as advertising recommendations. In addition, we intend to propose methods to generate tension curve per newscaster or subject, as well as to analyze the impact in the tension level when different newscasters present news stories of the same theme. Furthermore, we intend to investigate how to online infer tension levels of news videos. Moreover, here we assumed that all the modalities have the same importance on the combination. This combination has it advantages such as (i) it is easy to implement; (ii) it models low level interactions among modalities; and (iii) we are able to maintain a simple fusion mechanism (i.e., majority voting). However, as future work, we intend to analyze different combination methods (e.g. stacking, autoenconders [63, 74]) as well as a joint model including all the modalities.