Keywords

1 Introduction

Human speech is a complex signal, featuring a plethora of information beyond the spoken words. In addition to the linguistic content, a speech signal tells the listener a lot about the speaker – such as their age, gender, native language, motivations and emotions. It is important for a human-machine interaction (HCI) system to recognise these contexts correctly, to be able to respond in accordance. Today, we are continuously surrounded by human-machine interfaces. A virtual assistant in a handheld device has no longer remained a science fiction, but is simply an everyday reality. There is, therefore, a growing interest in the field of affective computing, to make the machines ‘understand’ human speech in its entirety, i. e., including the featured emotions and contexts.

Broadly speaking, there are three types of databases used in affect research. Early research utilised acted speech data, which typically featured a highly exaggerated affect behaviours, far from the natural ones (e. g., EmoDB [1, 12]). In another data collection strategy, the participants are made to converse in a laboratory environment. While the behaviours collected are mostly natural and spontaneous, the collected data is typically clean and unaffected by the real-life effects such as noise (e. g., RECOLA [16]). The third, ‘in-the-wild’ databases refer to the data collected in a non-laboratory, the everyday, unpredictable noisy environments. However, the so-called ‘in-the-wild’ databases mostly feature the recordings collected in an identical real-life settings, with very similar acoustic disruptions. This has direct implications on the trained models, limiting their generalisability. Also, most of these databases suffer from the phenomenon called ‘observer’s paradox’ or ‘one-way mirror dilemma’ – where the participants are typically well aware of being recorded right from the beginning of the recordings – which affects featured affect behaviours [19]. In this contribution, we test, for the first time, the hypothesis that the models trained on a closer-to-real-life database is likely to generalise better [14].

While there have been transfer learning studies on affect [2,3,4, 11], there is hardly any research on generalisability of time-continuous affect recognising models for the real-life or in-the-wild datasets. To this end, we first introduce the two databases used in this study in Sect. 2. We describe our experiments in detail in Sect. 3. After this, we present our findings in Sect. 4 before we conclude the paper in Sect. 5.

2 Databases

To test which of the two affect recognising models generalises better – i. e., whether the one trained on a ‘more’ in-the-wild data or the one using database collected under relatively restrained or ‘laboratory’-like settings – we use two prominent benchmark databases, namely the ‘Automatic Sentiment Analysis in the Wild’ (SEWA) corpus used in the AVEC 2017 challenge and the ‘Graz Real-life Affect in the Street and Supermarket’ (GRAS\(^{2}\)) corpus.

The SEWA database features video chat recordings of the participants discussing the commercials they just watched. The recordings were collected using the standard webcams and computers from the participants’ homes or offices. The data collection took place over the internet using a video-chat interface specifically designed for this task. The recordings feature spontaneous affect behaviours, real-life noises and delays due to connectivity and hardware problems. The participants dominated the conversations more or less the equally.

The GRAS\(^{2}\) database features audiovisual recordings of the conversations with the unsuspecting participants from a first-person point of view in a busy shopping mall. The participants were made aware of being recorded only half way through the conversations, and were requested to sign a consent form agreeing to release the recordings for research purposes. The database, thus, features spontaneous and ‘more’ authentic affective behaviours, as they are relatively more observer’s paradox-free. Because the conversations were totally spontaneous, the durations of the conversations vary widely (standard deviation = 56.3 s). Also the extent to which the participants dominate the conversations, i. e., relative durations of the subject’s speech and the speech by the student research assistant collecting the data, varies widely. Unfortunately, the student research assistants dominate many of the conversations. The sections of the recordings where the participants read the documents before signing the consent form hardly feature subject’s speech. The recordings also contain dynamically varying noise, including the impact sounds, bustle, background music, and background speech. There are only 28 conversations available. All these factors combine to make the this database a lot more ‘in-the-wild’ and the affect tracking task lot more challenging. The corpus was used previously in a research study establishing correlation between an eye-contact and the speech [6], and another study on time-continuous authentic affect recognition in-the-wild [13].

3 Experimental Design

3.1 Data Splits

We split both the SEWA and GRAS\(^{2}\) corpus into training, validation and test sets in a roughly similar 2:1:1 ratio, in terms of both the number of files in a split and the cumulative duration of the audio clips. We use the same splits used in the AVEC 2017 challenge [15] when running our experiments (Fig. 1) on the SEWA database. The splits are made such that a participant-independent model can be trained, i. e., no participant is present in more than one split. The splits on GRAS\(^{2}\) are made such that each split features a different student assistant likewise, i. e., no student assistant is present in more than one split. The statistics for the three splits are presented in Table 1.

Fig. 1.
figure 1

Entire experimental design pipeline.

Table 1. Duration statistics for the SEWA and GRAS\(^{2}\) data splits.

3.2 Feature Engineering

We need the features from the two databases such that they are compatible with one another, the two ideally share a common feature space. Because we are interested in predicting time-continuous signals of emotion dimensions, the features should also ideally capture the temporal dynamics of the varying low-level descriptor (LLD) space. The features should ideally be robust to noise.

We generate the bags of audio words (BoAW) features using our own openXBOW toolkit [17] by vector quantising the ‘enhanced Geneva Minimalistic Acoustic Parameter Set’ (eGeMAPS) [5] low level descriptors (LLDs) extracted using our openSMILE toolkit [7]. This feature set is quite popular in the affective computing field already; we have used these exact features for establishing a baseline model performance for the AVEC 2017 challenge as the challenge organisers. The eGeMAPS LLDs is a minimalistic set of acoustic parameters, particularly tailor-made for affective vocalisation and voice research, consisting of only 23 LLDs. To capture the temporal dynamics of the individual parameters and LLD types, we extract BoAW features based on these LLDs. The BoAW approach generates a sparse fixed length histogram representation of the quantised features in time, thus capturing the temporal dynamics of the LLD vectors, while remaining noise-robust due to its inherent sparsity and the quantisation step [13, 17, 18].

However, the eGeMAPS LLDs are drastically different for the two databases in terms of their value ranges. Because the critical statistics – such as the mean, the variance, the maximum and the minimum value – are radically different (some with even the opposite signs), the statistics computed on one database cannot be reliably be used to standardise or normalise the other database such that they share a common feature space. Furthermore, the codebook used in the AVEC 2017 challenge utilises a random sampling of the SEWA eGeMAPS LLD vectors. For transfer learning experiments however, we ideally should not generate the codebook by sampling only one of the two databases; a codebook that is likely to represent one dataset better. It is imperative to use an identical codebook to vector quantise the two databases that is completely data-independent – especially when the ranges of feature values are drastically different. It is only then that we can independently assess generalisability of the trained models objectively, free from effect of the codebook better representing temporal dynamics in one dataset over the other.

We thus generate a codebook of size 1000, independent of the two databases, consisting of 23-length LLDs. An array of shape \(1000\times 23\), populated with random samples from a normal distribution (mean = .5, standard deviation = .1) is used as a codebook matrix. We preprocess the LLDs by scaling and offsetting all of the data splits, using the offsets and the scaling factors that normalise the respective training split in the range [0, 1]. We then vector quantise all of the LLDs to the randomised codebook generated with 10 soft assignments for every LLD. We compute the distribution of the assignments in a moving window of 6 s, with a hop size of 0.1 s – similar to how AVEC 2017 features were generated [15].

3.3 Gold Standard Generation

We use the gold standard arousal and valence values of the AVEC 2017 challenge when training using the SEWA database [15]. We generate the gold standard for the GRAS\(^{2}\) database using the same algorithm as of SEWA. The gold standard used in our previous studies on GRAS\(^{2}\) differs only in that, we previously did not compensate for annotator-specific mean annotation standard deviations [13].

We use the modified Evaluator Weighted Estimator (EWE) method to generate the gold standards, one per subject per emotion dimension. The goal of the EWE metric is to take into account the reliability of the individual annotators, signified by the weight \(r_k\) for every annotation \(y_k\). This confidence value is computed by quantifying extent to which the annotations by that annotator agree with the rest of the annotations. The gold standard, \(y_{EWE}\) is defined as:

$$\begin{aligned} y_{EWE_{n}}= \frac{1}{\sum _{k=1}^{K} r_{k}} \sum _{k=1}^{K} r_{k}y_{n,k}, \end{aligned}$$
(1)

where \(y_{n,k}\) is an annotation by the annotator k \((k\in \mathbb N, 1\le k\le K)\) at instant n \((n\in \mathbb N, 1\le n\le N)\) contributing to the annotation sequence \(y_k\). The symbol \(r_{k}\) is the corresponding annotator-specific weight. The lower bound for \(r_{k}\) is set to 0. In [8], the weight \(r_{k}\) is defined to be normalised cross-correlation between \(y_{k}\) and the averaged annotation sequence \(\bar{y}_{n}\). The gold standards used in both the AVEC 2017 baseline paper [15] and the GRAS\(^{2}\)-based affect recognition study [13] redefined the weight \(r_{k}\) such that it gets strongly influenced by the total number of annotations \(y_{k}\) is in agreement with, and also by the extent to which they agree, by simply averaging the pair-wise correlations. The weights are lower bounded to 0 as usual. They are then normalised such that they sum to 1.

$$\begin{aligned} \begin{aligned} r''_{k_{i},k_{j}}=\frac{\sum _{n=1}^{N} \left( y_{n,k_{i}}-\mu _{k_{i}}\right) \left( y_{n,k_{j}}-\mu _{k_{j}}\right) }{\sqrt{\sum _{n=1}^{N} \left( y_{n,k_{i}}-\mu _{k_{i}}\right) ^{2}} \sqrt{\sum _{n=1}^{N} \left( y_{n,k_{j}}-\mu _{k_{j}}\right) ^{2}} }, \text { where: } \mu _{k}= \frac{1}{N}\sum _{n'=1}^{N} y_{n',k}, \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} r'_{k_{i}} = {\left\{ \begin{array}{ll} \frac{1}{K} {{\mathop {\sum }\nolimits _{k_{j}=1}^{K}} r''_{k_{i},k_{j}}} &{} \quad \text {if } {{\mathop {\sum }\nolimits _{k_{j}=1}^{K}} r''_{k_{i},k_{j}}} >0\\ 0 &{} \quad \text {if } {{\mathop {\sum }\nolimits _{k_{j}=1}^{K}} r''_{k_{i},k_{j}}} \le 0\\ \end{array}\right. }, \qquad \quad r_{k_{i}} = \frac{r'_{k_{i}}}{{\mathop {\sum }\nolimits _{k_{j}=1}^{K}} r'_{k_{j}}}. \end{aligned}$$
(3)

3.4 Annotator Lag Compensation

To compensate for the reaction time of the annotators, we delay the feature vectors in time [10]. We use the delay value of 2.2 s, based on our previous grid search analysis on SEWA corpus [15]. In this study, we remove the repeating feature vectors at the beginning of every sample sequence introduced due to the lag compensating function used in AVEC 2017. We find that there is minute to no difference in performance because of removal of erroneously repeating feature vectors. This is expected, since the number of removed features (=22, in case of annotator lag compensation of 2.2 s) is less than 2% of the total number of feature vectors for an average SEWA audio recording. Though it does not improve or deteriorate the performance of the models, we note this addition to our preprocessing steps in comparison with the AVEC 2017 workflow [15], for the sake of correctness and completeness.

3.5 Regression Models

For the new BoAW feature sets generated using a randomised codebook, we first generate baseline regression results by training support vector machine (SVM)-based regression models (SVR) using a linear kernel with complexity values, \(C = [2^{-15},2^{-14},...,2^{0}]\), just as was done when establishing the AVEC 2017 challenge baseline. We also experiment with additional C values in the range \([10^{-8},...,10^{-5}]\) as the GRAS\(^{2}\)-trained arousal model was found to perform well for \(C\in [2^{-15}, 2^{-7}]\) . We ran regression models using simple feedforward neural networks (FFFN) and the double-stacked and a single-stacked recurrent neural network (RNN) with gated recurrent units (GRUs) in cascade with FFNNs. To train a GRU-based model, we used feature sequences of length 60, corresponding to 6 s. We experimented with several configurations for the network topologies (with 20 to 100 GRU nodes, 10 to 50-node layered FFNNs) , activation function permutations (selu, tanh, linear), feature lengths (60,80), learning rates (0.001 to 0.01 in the steps of 0.003), and optimisers (rmsprop, adam, adagrad, and adamax).

3.6 Post-processing

We post-process the predictions using the equation:

$$\begin{aligned} \displaystyle Y_{new}=(Y_{orig}-\mu _{2})\frac{\sigma _{1}}{\sigma _{2}}+\mu _{1}, \end{aligned}$$
(4)

where \(Y_{orig}\) is the primary prediction, \(Y_{new}\) is the post-processed prediction, \(\mu _{1}\), \(\sigma _{1}\), \(\mu _{2}\), \(\sigma _{2}\) are the mean and standard deviation of the training label sequence and the model’s prediction on the training data respectively [20].

4 Results and Discussions

All of the models we trained (SVRs, GRU-RNNs and FFNNs) performed reasonably well, so long as the test split and the training splits came from the same database, with concordance correlation coefficient (CCC) [9] close to 0.25 on an average. Of these, only the SVR-based models trained on GRAS\(^{2}\) arousal annotations could reasonably make predictions in the transfer learning experiments (Table 2). The models otherwise mostly fail to generalise to a different dataset, with CCC values close to zero. For these transfer learning experiments from SEWA to GRAS\(^{2}\), and vice versa, following are our key findings.

Table 2. Performance of the models in the transfer learning experiments for the arousal dimension. The models were trained only using the training split of the GRAS\(^{2}\) database, and were tested on the remaining data splits of GRAS\(^{2}\) and the entire SEWA German database. We note the performance on the individual data-splits of the SEWA database, to get better understanding of the coincidental data disparities and similarities between the two databases, and how the performance varies across splits with change in the complexity values. Interestingly enough, the similar SVR-based models trained on SEWA did not perform well on GRAS\(^{2}\) database.

4.1 Neural Networks Tended to Overfit to the Primary Database

We observed the neural network-based models tended to overfit to the database they were trained on. The predictions were reasonably good for the test and validation splits of the same database that the training split came from. While performance on the same primary database depends also on the random initialisation of its weights and biases, the models invariably failed to make reasonable predictions on a different database (CCC close to zero).

4.2 Valence Tracking Learnings Were Not Generalisable Beyond the Database

A valence prediction is a particularly a harder problem as compared to an arousal prediction [13, 16, 18]. We observed that the models could predict the valence dimension for the validation and test splits of the same database (CCC as high as 0.42), but the prediction models tend to overfit to the database. This observation was irrespective of the type of model used, and the direction of transfer learning (i. e., whether SEWA to GRAS\(^{2}\), or GRAS\(^{2}\) to SEWA).

4.3 GRAS\(^{2}\)-trained SVR-Based Arousal Tracking was Reasonably Generalised

Interestingly though, an SVR-based arousal prediction models trained on GRAS\(^{2}\) alone faired reasonably well on SEWA database with CCC values as high as 0.222 over the complete SEWA database – despite SEWA database being twice the size of GRAS\(^{2}\). In the interest of reproducibility of the experiments presented in this paper, the complexity values and the corresponding performance values for the different models are as indicated in Table 2. We note that, out of the three SEWA splits, the model performs the worst on its training data split, which also is the most diversified split out of the three splits Table 1.

Despite having a lot smaller training set, the GRAS\(^{2}\) to SEWA model transfer learning for the arousal prediction worked reasonably well. SEWA to GRAS\(^{2}\) transfer learning however does not quite work (again, CCC close to zero), despite the training split having twice as much the data to train the model on, with an identical model parameters. We speculate that the SEWA database is not as in-the-wild as GRAS\(^{2}\). GRAS\(^{2}\) features also the random background speech, bustle, impact sounds, background music, and even the long non-speech sections. There exist emotion dimension labels for even these non-speech/rare-speech sections which the model needs to learn, which in itself is a challenging task. Such more in-the-wild nature of the data manifests itself in lot more challenging training instances that help model to learn arousal predictions with more nuances.

5 Conclusions and Future Work

We present a first-of-its-kind transfer learning study on the speech-based time-continuous in-the-wild affect recognising models. To this end, we used a novel BoAW approach that uses a novel data-independent randomised codebook. The GRAS\(^{2}\) database – featuring relatively more observer’s paradox-free affective behaviours, and a lot more data diversity in terms of conversation durations, acoustic events, noise dynamics, spontaneity of the featured affective behaviours – proved to be highly effective in training a more generalised arousal tracking model than the SEWA database, despite its smaller size. As for the valence dimension, none of the databases were effective enough in training a better-generalised valence tracking model. Furthermore, none of our neural network-based models could predict emotion dimensions (both arousal and valence) on a different database through transfer learning. All these models were observed to perform well on unseen data from the databases they were trained on.

The new BoAW paradigm of using the data-independent randomised codebooks helps one project dissimilar databases onto a common normalised feature space, while also inherently capturing the temporal dynamics of the LLDs; the technique which can be further developed and fine-tuned. We intend to investigate effect of different randomisation strategies (sampling from differently skewed distribution, or uniform or different normal distributions), also the codebook size and the number of assignments on the model performance.

We would like to also extend on this work by adding more in-the-wild databases. Our findings on better generalisability of the GRAS\(^{2}\)-trained arousal tracking model encourage us to use more of such databases that are free from the observer’s paradox. Unfortunately, there are no other observer’s paradox-free databases to work with, that are publicly available today. We plan to therefore collect new data using a similar data collection strategy used to build GRAS\(^{2}\). The next logical step is to add other prominent affect recognition databases – such as RECOLA [16]. This will culminate into an exhaustive study on affect-related databases on their effectiveness in training the most-generalised, real-life time-continuous affect recognisers.