Humans express and recognize emotions using multiple channels in contextually flexible ways (Cowen & Keltner, 2021; Kret et al., 2020; Neal & Chartrand, 2011; Niedenthal et al., 2009; Nummenmaa et al., 2014). These channels include facial movements (Coles et al., 2019; Ekman, 1993; Namba et al., 2022; Wood et al., 2016), body language (C. Ferrari et al., 2022; Poyo Solanas et al., 2020; Reed et al., 2020; Wallbott, 1998), and the tone and content of speech (Bachorowski & Owren, 1995; Beukeboom, 2009; Hawk et al., 2009; Ponsot et al., 2018). Context – both the physical and human environment – also plays a key role (Greenaway et al., 2018; Ngo & Isaacowitz, 2015; Whitesell & Harter, 1996).

Prior research focusing on each individual channel of affective information has advanced a mechanistic understanding of emotion. However, this approach limits generalizability to real-world contexts where different channels of information naturally interact (Yarkoni, 2022). This relatively non-naturalistic tradition of affective research stems in part from technical barriers related to analyzing emotion in more naturalistic contexts. Here, we introduce how deep learning could be applied to overcome these barriers. By understanding its promises and being mindful of its limitations, researchers may use deep learning to advance a more naturalistic affective science.

Current Practices in Affective Research

Before we introduce deep learning applications for affective research, we discuss current practices and challenges. We focus on three topics: quantifying behavior, optimizing stimuli, and modeling affective processes.

Researchers commonly quantify behavior through manual annotation. For instance, they annotate the activation of facial muscles (e.g., cheek raiser) on participants’ faces using the Facial Action Coding System (Ekman & Friesen, 1978; Girard et al., 2013; Kilbride & Yarczower, 1983). Others manually measure the joint angles of mannequin figures and point-light displays to study body language (Atkinson et al., 2004; Coulson, 2004; Roether et al., 2009; Thoresen et al., 2012).

However, manual annotation is time-consuming (Cohn et al., 2007). This limits the quantity and frequency of annotation. For instance, it would be infeasible to annotate every frame in a large set of videos. As a result, prior research has disproportionately used small samples of static, artificial stimuli (Aviezer et al., 2008, 2012; Benitez-Quiroz et al., 2018; Cowen et al., 2021; McHugh et al., 2010).

Existing research also uses computational tools for quantifying behaviors. For instance, computational models of faces and facial expressions help reveal diagnostic features that people use to infer emotions from faces (Blanz & Vetter, 1999; Jack & Schyns, 2017; Martinez, 2017). Using digital equipment, researchers measure vocal features of speech such as amplitude and frequency (Scherer, 1995, 2003). Models that link these vocal features to emotions further enable researchers to manipulate emotional vocalizations during conversations in real time (Arias et al., 2021). Researchers also investigate the emotional content of speech by computing the frequencies of emotion laden words (e.g., words that commonly express happiness) to perform sentiment analysis (Crossley et al., 2017; Pennebaker & Francis, 1999).

These computational tools work well with highly controlled stimuli (e.g., high quality audiovisual recordings). However, they struggle with naturalistic stimuli, such as real-world conversations where speakers talk over each other in noisy environments. More generally, many of these computational models are based on theory-driven features (e.g., facial/vocal features, or words that researchers think might be associated with emotions), which could miss important emotional features that researchers do not anticipate.

The challenge of quantifying behaviors leads to the further challenge of optimizing naturalistic stimuli that portray these behaviors. Representative sampling is necessary to make inferences from samples to populations. This is widely understood by psychologists, and the field is making increasing efforts to recruit participants from more diverse populations (Barrett, 2020; Henrich et al., 2010; Rad et al., 2018). However, it is less widely appreciated that the need for representative sampling also applies to stimuli (Brunswik, 1955).

The lack of tools for quantifying behavior in naturalistic stimuli (e.g., video recording of participants’ emotional responses in conversations) makes it difficult to systematically select and manipulate stimuli. As a result, much prior research relies on manual selection (e.g., selecting recordings based on basic emotions) and manual manipulation (e.g., changing the joint angles of point-light displays based on hypotheses). These methods introduce researchers’ preconceived beliefs into experimental designs, and may lead to conclusions that favor those beliefs.

The technical barriers to quantifying behavior and optimizing stimuli contribute to a third challenge, modeling naturalistic affective processes. The mind integrates different channels of affective information in complex and contextual manners. For instance, these integrations may be nonlinear (e.g., when paired with a high-pitched vocalization, both wide and squinting eyes could signal frustration). Different subsets of information streams may be integrated at different stages (e.g., identity and context first, and then with facial expressions). Common linear modeling approaches cannot fully capture these complex processes.

The Promise of Deep Learning for Affective Research

Machine learning is an umbrella term for the practice of training computer algorithms to discover patterns in and make predictions about data (Table 1). Deep learning is a subset of machine learning based on deep neural networks (DNNs) (Rumelhart et al., 1986). DNNs consist of networks of artificial neurons, roughly akin to neurons, and connections, representing synapses. By optimizing the connection (i.e., weights) between neurons the model learns a mapping between the inputs and the outputs that minimizes the prediction errors during the training process (Fig. 1). There are a wide variety of DNN architectures used for solving different computational problems (Table 1).

Table 1 Comparing deep learning with other machine learning methods
Fig. 1
figure 1

The structure and training process of a DNN. A. The basic components of a DNN. B. The computations performed inside a neuron. C. The training process for minimizing loss (prediction error) using stochastic gradient descent via backpropagation

Here, we describe how DNNs could help address the challenges of behavior quantification, stimuli optimization, and cognitive modeling (Fig. 2). First, many pre-trained DNN models can be readily applied to quantify different channels of affective behavior and do not require any additional model training (Table 2). These models have four distinct advantages over manual annotations and existing computational models.

Fig. 2
figure 2

Applications of DNNs for advancing naturalistic affective research. A. DNNs provide a more scalable way to quantify behavior of study participants and stimulus targets in naturalistic contexts. B. DNN-based quantifications can support better experimentation by facilitating naturalistic stimulus selection and manipulation. C. DNNs are capable of capturing interactive and nonlinear effects, ideal for modeling cognitive/neural mechanisms underlying the subjective experience, physiological responses, and the recognition and expressions of emotions

Table 2 Examples of pre-trained DNN models and DNN architectures for quantifying naturalistic behavior

First, many pre-trained DNNs are efficient to use. For instance, some face annotation DNNs can quantify action units, facial key points, and head poses across thousands of frames of a video in a few minutes (Baltrusaitis et al., 2018; Benitez-Quiroz et al., 2016). This speed advantage creates new possibilities. For instance, using these tools, researchers could predict participants’ subjective experience of emotions in real time based on the behavioral quantifications of them from video recordings (Li et al., 2021) and use these predictions to time exactly when to introduce experimental manipulations (Fig. 2A).

Second, using pre-trained DNNs reduces costs. For instance, to study body language of real people in social interactions, traditionally researchers need to acquire expensive devices such as motion capture suits or camera systems (Hart et al., 2018; Zane et al., 2019). In comparison, pre-trained DNNs can annotate body poses and joint positions based on ordinary video recordings (Kocabas et al., 2020, 2021; Rempe et al., 2021). These pre-trained and lightweight DNN models provide accessible alternatives to a broader range of researchers.

Third, pre-trained DNNs for behavioral quantification are well suited for complex, real-world contexts. For instance, many pre-trained DNNs are available for naturalistic speech analysis, including the separation of overlapping speech sources, conversion of speech to text, and the quantification of these text in terms of their meaning (Chernykh & Prikhodko, 2018; Lutati et al., 2022a; C. Wang et al., 2022). They offer researchers more powerful tools to investigate how people communicate emotions in real-world conversations and large text corpora across cultures and languages (Ghosal et al., 2019; Poria et al., 2019; Thornton et al., 2022).

Fourth, pre-trained DNNs for behavioral quantification are flexible to use. For instance, there are a range of pre-trained DNNs for quantifying contexts (physical and human environment). At one extreme, researchers can combine multiple models to quantify different elements in the context, such as the interacting partners’ behaviors and the objects present (Bhat et al., 2020). At the other extreme, researchers can extract a global description of the scene (Krishna et al., 2017). The ability to quantify context in these different operational levels makes it possible to study the effects of context on emotion recognition, emotion expression, and the subjective, physiological, and neural components of emotion more precisely.

Applying deep learning to quantify behavior benefits stimulus optimization efforts as well (Fig. 2B). Imagine a case in which researchers are investigating how storytelling evokes emotional experiences. Selecting storytelling videos that are representative of the diverse storytelling that people encounter in daily life will facilitate a more generalizable conclusion (Fig. 2B, left). The multi-channel behavioral quantifications from DNNs can help achieve this goal. Specifically, researchers could first scrape a large number of real-world storytelling videos from the internet; then quantify multi-channel information in each video (e.g., face, body, speech, context) using deep learning models; and finally, apply the maximum variation sampling procedure (Patton, 1990) to select a subset of stimuli from every part of the psychological space.

To better understand the causal relation between different channels of information and affective responses, researchers may wish to manipulate stimuli beyond selecting them (Fig. 2B right). Deep learning models can also manipulate naturalistic stimuli realistically and in real time (Xu, Hong, et al., 2022). For instance, researchers could manipulate the facial expressions of participants as they spoke to each other over a video call and measure how one participant’s manipulated facial expressions influence the other partner’s subjective experience of emotions or physiological responses. This would allow for controlled experiments on naturalistic conversations through the medium of the conversation itself, rather than imposing an external intrusion upon it (e.g., prompts to change conversation topics).

Researchers can also use deep learning to achieve experimental control over naturalistic stimuli by synthesizing novel stimuli that never existed in the real world (Balakrishnan et al., 2018; Daube et al., 2021; Guo et al., 2023; Liu et al., 2021; Masood et al., 2023; Pumarola et al., 2018b; Ren & Wang, 2022; Roebel & Bous, 2022; Schyns et al., 2023; Wang et al., 2018; Yu et al., 2019). These tools can generate high-quality, realistic images, audios, and videos of any combination of features that the researchers might be interested in, some even in real time. This can provide an unprecedented level of control to researchers while still retaining naturalism in the stimuli.

Finally, deep learning could advance a computational understanding of naturalistic affective processes in the mind and brain (Fig. 2C). Many researchers have already applied deep learning to cognitive modeling, such as how information is represented in the visual cortex (Cichy & Kaiser, 2019; Dobs et al., 2022; Khaligh-Razavi & Kriegeskorte, 2014; Kohoutová et al., 2020; Konkle & Alvarez, 2022; Mehta et al., 2020; Perconti & Plebe, 2020; Richards et al., 2019; Saxe et al., 2021; Su et al., 2020).

Research has started applying deep learning to model affective cognition (Kragel et al., 2019; Thornton et al., 2023). Three qualities of DNNs make them a promising avenue for advancing a naturalistic understanding of affective processes. First, by virtue of their nonlinear activation functions and multi-layered structure (Fig. 1), DNN models excel at discovering complex interactions among both observable variables (e.g., affective behaviors) and latent variables. Given the importance of latent variables (e.g., emotions) and the complex interactions between behaviors and contexts, this feature is essential for building realistic cognitive models of affective processing.

Second, DNN models can predict multidimensional dependent variables in a single integrated model. Unlike common regression-based models, which typically have scalar outputs, the dependent variables in DNN models can be scalars, vectors, or multidimensional arrays. Moreover, DNN models can capitalize on the structure of the data, modeling both spatial relationships (e.g., via convolutions) and temporal relationships (e.g., via recurrence). Although one can find these individual pieces in other bespoke statistical models (Table 1), arguably nothing rivals the flexibility of deep learning at combining them into a single computationally efficient package. Since both the inputs (others’ naturalistic behavior) and outputs of affective processes (subjective, physiological, and neural components of emotions) are frequently multidimensional and complexly structured in time and space, this flexibility makes deep learning useful for affective modeling.

Third, DNN models provide a useful framework for simulating causal effects. For instance, to understand how different types of affective behaviors (e.g., face and voice) interact to express emotions, one can manipulate a DNN’s architecture so that different cues are allowed to interact in different layers of the models. These manipulations are impossible to do in the human brain as one cannot simply rewire it at will. Deep learning can also be embedded within embodied agents (Arulkumaran et al., 2017) so that researchers can use them to study how affective processes shape action and decision-making as agents learn to causally manipulate their environment.

Limitations of Deep Learning for Affective Research

Despite its promise, deep learning is not a magic box. Understanding the limitations of DNNs will help affective scientists use them effectively. Here, we describe the limitations of DNNs for behavior quantification, stimuli optimization, and cognitive modeling.

First, although the accuracy of pre-trained DNNs for annotating affective behavior is relatively high, many of them have yet to achieve human-level accuracy. For instance, the accuracy of estimating three-dimensional facial expressions and body movements is constrained by the two-dimensional inputs that these models are trained on (images and videos). However, given the fast pace of deep learning improvements (Fig. 3), there is reason to be optimistic about improvement in this regard.

Fig. 3
figure 3

Improvements of DNNs for behavioral quantification over time. Title indicates the behavior channel and the corresponding benchmark dataset that the models were evaluated on. X-axis indicates the year the models were published. Y-axis indicates the metric for measuring model performance. Data reflect benchmarks reported on paperswithcode.com (Papers with Code, n.d.)

Second, the benefits of using pre-trained DNNs for behavioral annotation vary with the context. For instance, many of these models annotate only a subset of behavioral features that researchers might be interested in, such as only 20 out of 46 facial action units. The performance of these models may be significantly reduced in certain situations. For instance, DNN audio source separation may fail when the quality of audio recordings is low. These models also struggle to generalize. For instance, a facial expression classification model that performs well in the conditions it was trained on (e.g., frontal, well-lit, adult faces), may perform poorly when applied to different conditions (Cohn et al., 2019). Careful accuracy and bias auditing should be part of any study relying on deep learning as an objective quantification tool.

Third, DNNs are susceptible to social biases. These biases may result from the composition of the training dataset (e.g., having more samples for certain ethnicities than others), the bias of the humans who provided the training labels (e.g., stereotyped associations), and/or the architecture of the algorithm itself (Mehrabi et al., 2021; Shankar et al., 2017). For instance, some algorithms wrongly assign more negative emotions to Black men’s faces than White men’s faces, reflecting a stereotype widespread in the US (Kim et al., 2021; Rhue, 2018; Schmitz et al., 2022).

Since the application of stimulus optimization uses outputs from pre-trained DNNs, the above limitations of the quantification models can carry through to influence stimulus selection. For instance, if a voice quantification model has been overtrained on male versus female speech, it may represent male voices as more distinct from each other than female voices. Applying maximum variation sampling based on these quantifications might thus lead to over-sampling of male speech.

Maximum variation sampling is also susceptible to class imbalance (Van Calster et al., 2019). For instance, if the initial stimulus set has significantly more positive valence stories than negative ones, then more positive stories will be selected more frequently. However, both issues with stimulus selection can be mitigated with stratified maximum variation sampling (e.g., applying the procedure to male and female speech separately, or positive and negative stories separately) (Lin et al., 2021, 2022; Lin & Thornton, 2023).

While some applications of deep learning are approaching maturity (i.e., achieving high-level accuracy), such as behavior quantification (Fig. 3), others are just emerging, such as manipulating and synthesizing realistic stimuli in real time. At present, applying DNNs for manipulating and synthesizing stimuli are most reliable for features that the models have been exposed to (Hosseini et al., 2017; Papernot et al., 2016). For instance, if an algorithm has never been trained on the motion of jumping, its synthesis of people jumping with joy is unlikely to look realistic.

When applying DNNs for affective modeling, in addition to the overall level of accuracy, researchers should carefully consider the types of errors that DNNs make, which may be systematically different from those of humans. Researchers should also caution equating deep learning models to the human mind and brain. Correlations between human performance and DNNs do not indicate that the two systems share similar causal mechanisms (Bowers et al., 2022; Schyns et al., 2022).

Finally, researchers should also distinguish between the inherent versus current limitations of DNNs. For instance, many existing DNN models are trained on aggregate-level data and thus cannot represent individual differences in affective processes. However, with the proper inputs (e.g., individual-level perceptions with individual difference measures), DNNs could in principle model individual differences in affective processes.

Besides the limitations highlighted for each of the three applications above, we have summarized the most prominent limitations of deep learning in general, alongside potential mitigation strategies for each of them (Table 3).

Table 3 Limitations of deep learning

Conclusion

In this review, we have provided a brief introduction to how deep learning could be applied to tackle challenges in affective science. We focused on three main applications: behavior quantification, stimuli optimization, and affective modeling. These applications can advance naturalistic research on the verbal and nonverbal expressions of emotions, the recognition of emotions, and the subjective, physiological, and neural components of affective experiences. We encourage interested readers to explore other works that provide detailed primers on how to use these tools to their fullest (Pang et al., 2020; Thomas et al., 2022; Urban & Gates, 2021; Yang & Molano-Mazón, 2021). With deep learning tools in hand, they will stand poised to substantially expand our understanding of emotion in more naturalistic contexts.