Keywords

1 Introduction

The motivations for detecting stress from speech range from it being a non-intrusive way to detect stress, to ranking emergency calls [7], or improve speech recognition systems, since it is known that environmentally induced stress leads to fails on speech recognition systems [13]. Public Speaking is said to be “the most common adult phobia” [18], showing the relevance of a tool to improve public speaking. In VOCEFootnote 1, we target developing such a tool, by developing algorithms to identify emotional stress from live speech. In particular, VOCE corpus comes mainly from public speaking events that occur within academic context, like presentations of coursework or research seminars. The envisioned coaching application requires detecting emotional stress in live speech in near real time, to give the user timely feedback, which requires adapting the computational costs to the limited memory and computational resources to use. Decreasing the number of features used for classification reduces the amount of data to collect, the amount of features to be extracted and the complexity of the classifier, impacting a reduction in the memory and computational resources used. Additionally, feature selection can increase the classifier’s accuracy [12]. Thus, in this paper, we focus on identifying these reduced feature sets based on their performance as stress discriminators.

In this work, we start from the fusion of two feature sets: the group of features extracted using the openSMILE toolkit  [25], and the group of TEO-based features, to be detailed on Sect. 4.2. We filter these feature sets with Mutual Information (MI) and then use a branch-and-bound wrapper to explore the space of possible feature sets. Finally, we analyse the best feature sets chosen on various branches for the most frequently chosen feature categories.

2 Related Work

The importance of suprasegmental acoustic phenomena that can be taken as global emotion features is highlighted in [28], like “hyper-clear speech, pauses inside words, syllable lengthening, off-talk, disfluency cues, inspiration, expiration, mouth noise, laughter, crying, unintelligible voice”. These features have been mainly annotated by hand, and automatic extraction is not straightforward, though possible in some cases.

Stress recognition from speech is a specific case of emotion recognition. The Fundamental Frequency, F0, is the most consensual feature for stress discrimination [8, 14, 22, 31], but several metrics for energy and formant changes have been proposed, often represented by Mel-Frequency Cepstral Coefficients (MFCCs) [7, 21, 31]. Frequency and amplitude perturbations – Jitter and Shimmer –, and other measures of voice quality, like Noise to Harmonics Ratio and Subharmonics to Harmonics Ratio [26, 28] have also been used. Teager Energy Operator-based features have also shown to perform well in speech under stress [31], and we shall look at them in detail in this work.

TEO-based features have been shown to increase recognition robustness with car noise [10, 15]. In [17], TEO-based features reached the best performance for stressed speech discrimination outdoor, but not indoor. They also have been used to do voiced-unvoiced classification [19]. In the latter work, the advantages of TEO are enunciated: because only three samples are needed for the energy computation at each time instant, it is nearly instantaneous. Therefore, this time resolution allows to capture energy fluctuation, and also a robust AM-FM estimation in noisy environment. [6] uses Teager Energy Operator in the development of a system for hypernasal speech detection. In this work, we shall look into the discrimination power of TEO-based speech features for stress discrimination in public speaking.

3 Speech Corpus and Data Annotation

The VOCE corpus [2] currently consists of 38 raw recordings from voluntaries aged 19 to 49. Data is recorded in an ecological environment, concretely during academic presentationsFootnote 2. Speech was automatically segmented into utterances, according to a process described in [5].

Annotation into stressed or neutral classes was performed per speaker, based on the mean heart rate [4]. Utterances on the third quartile of mean heart rates for that speaker are annotated as stressed, while the remaining ones are annotated as neutral.

Using an ecologically collected corpus imposes an unavoidable trade-off between the quality of the recording and the spontaneity of the speaker. Higher quality of the recording not only allows for more reliable feature extraction, in general, but also impacts the performance of the segmentation algorithms we use to split the speech into sentence-like units – utterances –, and to do text transcription, necessary for the extraction of TEO features. For these reasons, we chose only 21 raw recordings for this work.

Table 1. Dataset demographic data. PSE: Public Speaking Experience, 1 – 5: 1 - little experience, 5 - large experience.

For these speakers, 1457 valid utterances were obtainedFootnote 3. The set of utterances is divided into 15 speakers (507 utterances) for training and 6 speakers (442 utterances) for testing. Since the number of stressed utterances corresponds to approximately 1/4 of the total, we randomly down-sampled the train data in order to balance the two classes, which led to the mentioned 507 utterances. During feature selection, the classifier was trained on 354 utterances and tested on 153 utterances. These utterances belonged to the train set. Table 1 characterises the dataset concerning age, gender, public speaking experience, and the number of utterances consideredFootnote 4.

We performed outlier detection on each feature using the Hampel identifier [20] with t = 10. The outliers were then replaced by the mean value of the feature excluding outliers, and feature values were scaled to the interval [0,1].

4 Methodology

Figures 1(a) and (b) illustrate the workflow for speech segmentation and feature selection, respectively. In this work, we start from the fusion of two feature sets: the group of features extracted using the openSMILE toolkit [25], and the group of TEO-based features, to be detailed on Sect. 4.2. We filter these feature sets with Mutual Information and then use a branch-and-bound wrapper to explore the space of possible feature sets. We then analyse the best feature sets chosen on various branches for the most frequently chosen feature categories.

Fig. 1.
figure 1

Workflow for the speech segmentation and the feature selection process.

4.1 Acoustic-Prosodic Features

OpenSMILE extracts a set of 128 low-level features at the frame level from the speech signal, known as low-level descriptors (LLD) [11]. Statistical functionals are then applied over the LLD in order to compute values for longer segments, providing a total of 6125 features at the segment level [25]. These features and their extraction processes are described in [9, 24].

The openSMILE toolkit is capable of extracting a very wide range of acoustic-prosodic features and has been applied with success in a number of paralinguistic classification tasks [23]. It has been used in the scope of this study to extract a feature vector containing 6125 speech features, by applying segment-level statistics (means, moments, distances) over a set of energy, spectral and voicing related frame-level features.

4.2 Teager Energy Operator Features

The following TEO-Based features were extracted: Normalized TEO autocorrelation envelope and Critical Band Based TEO Autocorrelation Envelope as in [31]. The literature where Normalized TEO Autocorrelation Envelope and Critical Band Based TEO Autocorrelation are presented targets the feature extraction for small voiced parts usually called “tokens” [31]. To work equivalently, we did a phone recognition with the delimitation of each phone [1] and used only voiced sounds. These correspond to phones represented by the portuguese SAMPA symbols ‘i’, ‘e’, ‘E’, ‘a’, ‘6’, ‘O’, ‘o’, ‘u’, ‘@’, ‘i\(\sim \)’, ‘e\(\sim \)’, ‘6\(\sim \)’, ‘o\(\sim \)’, ‘u\(\sim \)’, ‘aw’, ‘aj’, ‘6\(\sim \)j\(\sim \)’, ‘v’, ‘z’, ‘Z’, ‘b’, ‘d’, ‘g’, ‘m’, ‘n’, ‘J’, ‘r’, ‘R’, ‘l’, ‘L’ [29, Chap. IV.B].

These features are extracted per frame. The length of each frame is about 10ms, depending on the feature to extract. Each phone usually contains many frames and each utterance has normally many phones. Therefore, since we want to have values per utterance, we consider each feature extracted for all phones and apply statistics to it. These statistics are: mean, standard deviation, skewness, kurtosis, first quartile, median, third quartile, and inter-quartile range. This process is also illustrated in Fig. 1(a). The first two columns in Table 2 summarise the feature types considered in this workFootnote 5.

Table 2. Feature Types: Id, Name, Number of features of each type selected for MI, Number of features of each type chosen for the Best Sets: T.A.1, T.A.2, G.A., Se., Sp., and Comb.

5 Searching for the Best Feature Sets

As already stated, we apply one filter to reduce the dimensionality from initially 6285 functional (OS) plus TEO features before applying the wrapper with a Support Vector Machine (SVM) classifier with radial basis function kernel and C=100Footnote 6, using python library scikit-learn.

5.1 Filter: Mutual Information

There are several metrics and algorithms to compute the relevance of features on a dataset, and the choice of this metric may hugely impact the final subset of features. However, since there is a lack of a priori knowledge about filter metric adequacy to specific datasets [30], we based our choice on the work of Sun and Li et al. [27], which showed good results in terms of classification for Mutual Information (MI), a metric that measures the mutual dependence between two random variables.

Since MI is based on the probability distribution of discrete variables and our features have continuous values, we had to define a binning. We (1) defined five binning possibilities: 50, 100, 250, 500 or 1000 bins; (2) computed MI for each feature and each binarisation possibility; (3) kept features for which the MI value belonged to the higher quartile for all binarisation options. Their distribution per feature type corresponds to the third column in Table 2.

5.2 Wrapper

Feature selection has been widely studied and, as result, a large number of algorithms have been proposed. These algorithms can be categorized into three groups: filter, wrapper and embedded [16]. Wrapper algorithms find the final solution using a learning algorithm as part of the evaluation criteria. The main idea of these methods is to use the learning algorithm as a black-box to guide the search for the optimal solution. The learning is applied to every candidate solution and the goodness of the subset is given according to the performance of the learning algorithm. Due to the learning algorithm being directly used on the process of selecting features, these methods tend to find better solutions. Nonetheless, the final solution only applies for the selected learning algorithm, since using a different one will most likely result on a different final solution. These methods have higher computational cost as they require training and classifying data for each candidate solution.

We designed a branch and bound wrapper to search the space of feature sets obtained from the MI filter for the combination of features that deliver the best classifier performance. This wrapper starts by searching all combinations of sets up to 10 features, keeping all that are within 1.5 % accuracy of the best solution found so far. Larger feature sets are obtained by expanding the previously kept solutions with blocks of features not yet in the sets. Every time a feature subset is tested with a classification algorithm, a score is produced, which is the accuracy, in this case. Subsets are kept and expanded if the expansion improves the previous accuracy. This search runs until the work list of feature sets with new combinations empties. This wrapper provides a better exploration of the feature set space than traditional forward and backward wrappers. Since the search space for our wrapper is much bigger than for most wrapper methods, we used parallel programming techniques to improve the throughput of the algorithm, using python’sMultiprocessing package.

6 Results

The Mutual Information filter selected 487 features, distributed into types as described in the third column of Table 2. After choosing the best 280 feature sets with training accuracies below \(85\,\%\) from 20 processors, we looked at their distribution by feature types, which is on Fig. 2.

Among these 280 feature sets we looked for the ones having the best scores in each of the considered metricsFootnote 7: Train Accuracy, Generalisation Accuracy, Sensitivity (Se), Specificity (Sp)Footnote 8, and a Combined Metric defined as

$$\begin{aligned} \mathrm {Combined Metric} = \frac{(\mathrm {Se}+\mathrm {Sp})}{2}-|\mathrm {Sp}-\mathrm {Se}|. \end{aligned}$$
(1)

The need for this metric follows from the fact that it is our goal not only to have a good generalisation accuracy, but also to have high sensitivity and high specificity at the same time. This is relevant since, as we have an imbalanced test set, with much more neutral utterances than stressed utterances, it can happen that high generalisation results are due to high values of true positives, while true negatives are neglected – which is the kind of scenario we want to avoid. On Table 3, each line corresponds to the best feature subset for which the metric specified in the first column was found to be maximum. The two last lines correspond to baseline results, meaning the classification for the whole set of features and for the set of MI filtered features.

Columns T.A.1, T.A.2, G.A., Se., Sp., and Comb, in Table 2, correspond to the best feature sets, according to each of these metrics, as exposed in Table 3. Each of the Columns in Table 2 signs the number of features of each type (each line corresponds to a feature type).

Table 3. Metrics for the Best Subsets as percentage
Fig. 2.
figure 2

Heatmap for feature type frequencies on each subset.

Table 3 bears the following information:

  • The sets of best train accuracy do not correspond to the ones with best generalisation accuracy. Actually, these have the second worse generalisation results among these sets.

  • The set of best generalisation accuracy, as well as the set of best specificity, although having very good generalisation accuracies have very low sensitivities. This is the kind of imbalance we want to avoid.

  • The same train accuracy can have sets of very different quality. We see that for train accuracy 81.70 % we have the best generalisation accuracy, the best sensitivity and the best combined metric. Looking at the other columns in the table we see that only the line for Combined Metric has acceptable results in sensitivity and specificity.

  • These best reduced sets often achieve better results than both the complete set and the filtered set, having much smaller sets, which is very good for the envisioned real-time public speaking coaching application.

7 Discussion

The set of features selected by the Mutual Information filter are, grosso modo, the ones reported in the literature for other languages (e.g., [14, 32]). Those encompass pitch information, mostly final movements of pitch, audio spectral differences, voice quality features (jitter, shimmer, and harmonics-to-noise-ratio) and TEO features, the latter usually described as very robust across gender and languages. As for PCMs and MFCCs, these features are very transversal in speech processing tasks and highly informative for a wide range of tasks, not surprising, thus, for stress detection as well. The features selected by Mutual Information filter give us a more complete characterization of stress predictors. From these set the ones that are systematically chosen in the best features sets using the wrapper are mostly TEO, MFCCs and audio spectral differences. TEO and MFCCs features are also reported by [32], for English and Mandarin, as the most informative ones, even more than pitch itself.

8 Conclusions

We have used a corpus of ecologically collected speech to search for the best speech features that discriminate stress. Starting from 6125 features extracted with openSMILE toolkit and 160 Teager Energy features, we used a Mutual Information filter to obtain a reduced subset for stress detection. Next, we searched for the best feature set using a branch and bound wrapper with SVM classifiers.

Our results provide further evidence that the features resulting from the Mutual Information filtering process are robust for stress detection tasks, independently of the language, and highlight the importance of voice quality features for stress prediction, mostly high jitter and shimmer and low harmonics to noise ratio, parameters typically associated with creaky voice.

Our best result compares well with work done by [10, 32], although direct comparisons are hard to establish due to different corpora, segmentations, and metrics used in the studies.