Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Clinical psychiatry has long utilized so-called mental state signs to assist in the diagnosis of mental disorders. In assessing the presence of a psychiatric disorder, particular attention is given to appearance, behavior, speech, thought, perception, insight into illness and, of key relevance to this chapter, mood and facial affect [1]. The latter contributes to delineating depressive illnesses such as melancholia from non-melancholic depression, and is typically assessed by clinicians (e.g., psychiatrists). However, such clinical observation is, by virtue of its non-standardized nature, subjective and thus may not reliably capture the presence or absence of specific disorders. As more objective means of quantifying facial affect and emotion emerge, driven largely by advances in affective computing, the diagnosis of disorders characterized by specific affective signs—such as melancholia—will be made more valid. Prior to examining the potential utility of these emerging technologies in depressive illness, we broadly overview conceptual models of emotion and depression classification, with an emphasis on the relevance of affect in such disorders.

2 Emotion and Affect

The subjective affective experience of an individual, variably referred to as one’s ‘passions’, ‘emotions’, ‘feelings’, or ‘moods’ have, to this day, largely eluded definition. Over a century ago, James [2] suggested that emotions evoke a certain “phenomenal quality”; referring to the notion that emotions are “sensation-like mental states” [3], and were in effect seen as ‘intuitions’ in response to emotion-eliciting events. Others (e.g., [4]) challenged James’s notion that emotions were reflex-like in nature (e.g., fear in response to a bear in the woods), and instead positioned emotion as an appraisal-based process (e.g., feeling happy or sad towards an object). Throughout the 1960s, appraisal theories of emotion dominated [5], but even to this day there continues to be an ongoing philosophical debate over what constitutes an emotional state [6]. The fundamental biological and psychological processes underlying the above theories of emotion generation and/or appraisal are referred to as the ‘evolutionary core’ [3]—that is, a set of discrete emotion mechanisms aligned to one’s “basic” emotions.

There have been several dominant theories in the field of psychology that offer insight into such basic emotions. Beginning with McDougall [7], emotion was defined as a small set of adaptive processes, so-called emotional instincts, and included the behaviorally well-defined emotions of flight, repulsion, curiosity, pugnacity, self-abasement, self-assertion, and the parental instinct [8]. Most modern variants of discrete emotion theory correspond at least partially to McDougall’s model. For instance, Tomkins [911], who is credited with the development of “affect theory”, considered there to be nine independent affects: two positive (enjoyment/joy and interest/excitement), one neutral (surprise/startle), and six negative affects (anger/rage, disgust, dissmell [similar to distaste], distress/anguish, fear/terror, and shame/humiliation). Likewise, Izard [12] considered fear, anger, shame, contempt, disgust, guilt, distress, interest, surprise, and joy to be primary emotions, and with various combinations of these giving rise to tertiary emotions (e.g., such as affection). While the basic emotions were initially believed to be largely subjective states of mind, Ekman [13] postulates that seven basic emotions correspond near-universally to observable facial expressions, namely anger, disgust, fear, happiness, sadness, surprise, and contempt. Such categorical definitions of emotion have received substantial empirical support, but there is also a question as to whether affect can be conceptualized dimensionally [14]; that is that emotions exist along interrelated continuums of, for example, arousal and valence [15].

It has been suggested that discrete (i.e., categorical) emotion theory provides a more accurate index of the current, momentary experience of emotion, while dimensional perspectives may be most relevant to temporal emotional experience, such as mood states [16]. Rather than being mutually exclusive, however, it is likely that both viewpoints are pertinent to affective states such as depression. The work of Ekman in particular [17], which highlights the importance of emotional expressions, is especially relevant for quantifying affect. Indeed, ‘affect display’ refers to the externally displayed (i.e., observable) affect of an individual, through facial, vocal or gestural means. When affect is in line with the subjective mood state of an individual, it is termed ‘congruent affect’, but when subjective states and affect is misaligned it is referred to as ‘incongruent affect’. As will become apparent throughout this chapter, affect in depression is typically congruent with the individual’s self-reported, subjective mood state; for instance, subjective low mood or sadness is often identifiable in the depressed patient through observation of a flat and/or non-reactive affect.

2.1 What Precisely Is an Affective Disorder?

Historical descriptions of disordered affective states date back over 2000 years. Hippocrates described the existence of “melancholia” as a disease-like state arising from an excess of black bile (this being one of four ‘humors’, or bodily fluids, that were thought to directly influence health and temperament). Current day formulations of melancholia position it as the prototypical depressive disease, principally of biological and genetic origin [18], which presents with characteristic clinical features such as psychomotor slowing (i.e., physical slowing, concentration impairment), anergia, anhedonia, diurnal mood variation (mood worse in the morning), early morning wakening, and appetite and weight loss. Despite evidence for the existence of melancholia as a distinct condition, it was largely abandoned in psychiatric circles in 1980 with the introduction of the Diagnostic and Statistical Manual of Mental Disorders (DSM), which popularized ‘symptom-centric’ models of affective illness. Its third edition (DSM-III) [19] brought three psychiatric disorders under the diagnostic umbrella of the ‘affective disorders’, including major depression (subsuming melancholia), bipolar disorder (depression and/or hypomania or mania) and dysthymia (chronic low mood, defined as having fewer symptoms than major depression). Under the DSM framework, symptom reports by patients guide diagnostic decisions, which in isolation are thought to contribute to misdiagnosis, thus impacting on appropriate management [18]. Hence, while the affective disorders began with melancholia as observable illness states, they are now characterized by the severity of symptoms. It is argued that this has contributed to an era of insipidity in depression research, and is of limited clinical import. Here, we argue that there is scope for departure away from dimensional accounts of depression (e.g., major depression), back to more refined categorical depressive subtypes (e.g., melancholic vs. non-melancholic conditions).

3 The Case for Objectively Informed Classification of Depression Subtypes

The classification of depression has long been contentious, with numerous typologies being proposed since the beginning of the twentieth century. So-called ‘simple typologies’ conceptualized depression as comprising anywhere between one and five categories. The father of modern psychiatry, Emil Kraepelin, distinguished between dementia praecox (now schizophrenia) and manic-depressive insanity [20]. All major affective disturbances, namely mania and depression, were thought by Kraepelin to be part of the same illness (manic-depressive insanity). In 1926, British psychiatrist Edward Mapother at the Maudsley Hospital in London claimed [21] that there was only one form of depression, and did not distinguish between depressive subtypes—and hence saw depression as varying along a continuum. The influential theories of Kraepelin and Mapother laid the foundations for a move toward unitary models of depression. Sir Aubrey Lewis, an eminent psychiatrist at the same institute as Mapother, also viewed all depressions as essentially the same (and hence proposed a single category, “depressive illness”), but conceded that there are likely differences between individuals regarding its causation: specifically depressions that are more hereditary versus those caused by environmental factors [22].

The “separatists” opposed the unitary view of depression from the mid-twentieth century, arguing that depression could be classified into different types [23]. Such categorical views principally saw the diagnosis and classification of depression according to a “binary” model—differentiating those of ‘constitutional origin’ (i.e., caused from within the individual) versus those that were reactions to environmental stressors. There have been several examples in the literature that support the existence of different depressive subtypes, which is typically achieved by identifying differences between groups on some metric (e.g., clinical or self-report ratings of an illness variable). Kiloh and Garside [24] quantified the independence of neurotic and endogenous (synonymous with melancholic) depressives through analysis of reported symptoms and clinical variables. Features such as psychomotor retardation and concentration difficulties correlated with the diagnosis of endogenous depression, whereas ‘reactivity of depression’ (by which it is assumed to mean reactivity of affect), irritability and variability of illness were correlated with neurotic depression. Similarly, a ‘point of rarity’ was identified between endogenous and neurotic depressives in a series of papers from the Newcastle school [25, 26]. Here, clinical ratings of depressive symptoms and signs, along with personality and anxiety, were shown to differentiate endogenous and neurotic depressive subgroups. Again, psychomotor change was more prevalent in the endogenous group than the neurotic group. Despite strong support for the notion that endogenous depression is a categorically distinct entity (one either has it or does not), and that neurotic depression consisted of symptoms varying dimensionally [23], it was abandoned under the DSM system in favor of a dimensional approach to diagnosis.

‘Melancholia’ was retained in DSM-III (and subsequent iterations of DSM), albeit in much-diluted form, as a ‘specifier’ diagnosis [19]. The retaining of a melancholic specifier allows for diagnosis of major depression with melancholic features, but has been criticized for its focus on symptom expression (a severity framework), while also explicitly disregarding differing aetiological contributions [18]. In the DSM, melancholia is diagnosed by the presence of additional symptoms of anhedonia (reduced interest or reactivity to previously pleasurable events) and psychomotor slowing, amongst others, which are present in almost all individuals with clinically significant depression [27]. Observable (not just reported) psychomotor disturbance has been proposed as a specific diagnostic marker of melancholia, which aligns with some of the earliest definitions of the disorder, from classical antiquity, where “symptoms … were not part of the concept” ([6] p. 298). We therefore developed a clinician-rated scale (the CORE) to measure psychomotor disturbances in depressed patients [28]. The tool allows rating of 18 clinical signs (observable features) across the domains of non-interactiveness (including features such as emotional non-reactivity and inattentiveness), retardation (including slowed motor movements and facial immobility), and agitation (including facial apprehension and motor agitation). Clinically diagnosed melancholia (still arguably the “gold standard” in diagnosing the condition) was associated with “substantial” CORE scores. In these studies a score of >8 defined “substantial” psychomotor disturbances [27], with such scores being representative of those with melancholia but not those with non-melancholic depression. Despite the high sensitivity in detecting melancholia, such systems—much like clinical diagnosis—require extensive training and exposure to appropriate clinical populations (i.e., in services where those with melancholia are likely to present) to be of any benefit. Several investigators, including our own research team, have thus sought to clarify biological correlates of melancholia in the hope that any identified perturbations will eventually assist in its diagnosis.

Throughout the late 1970s and early 1980s there were many investigations into disturbances of hypothalamic-pituitary-adrenal (HPA) axis function in depressive illness [29]. The HPA axis plays a central role in regulating homeostasis of many physical systems, including the metabolic, reproductive, immune, and central nervous systems [30]. Insights into the function of this system in depression has been achieved by challenging patients with the synthetic corticosteroid, dexamethasone (DEX), and then observing fluctuations in plasma cortisol—referred to as the DEX suppression test or DST. It was seen as a watershed for psychiatry when Carroll and colleagues [31] reported that 48 % of those with primary depression were DEX ‘non-suppressors’, compared to only 2 % of other psychiatric patients. Carroll et al. [32] subsequently demonstrated that the DST had utility in detecting melancholic depression, with sensitivity of 67 % and specificity of 96 %. This measure hence appeared to be highly informative in depressive subtyping (i.e., nearly all non-melancholic patients were correctly classified as not having melancholia). Despite such promising early findings, the DST lost favor as a diagnostic tool in the mid-1980s after the DSM-III was introduced—given it lacked sensitivity in later studies using broader diagnostic criteria [33]. Since then, disruptions across other cognitive and biological systems have been identified with specificity to the melancholic phenotype, and include working memory impairments and disturbances in sleep architecture [34]. Our team also recently identified a neurobiological signature for melancholia [35]—which was not observed in non-melancholic depression—involving disrupted integration of brain regions supporting interoception and attention.

While such research has been of key importance in understanding the underlying causes of melancholia, their use as diagnostic tools will continue to be limited given their invasiveness, relatively high cost, and difficulty of access (e.g., in rural areas). Facial imaging has the potential to overcome these barriers and become an important tool in the diagnosis of melancholia. In the following sections we overview methodological advances in facial imaging research, and highlight the utility of the methods in contributing to an objectively-informed diagnostic tool for depressive disorders.

4 Methodological Considerations for Quantifying Facial Affect in Depression

Significant advances have been made for inferring affect using facial imaging over the past two decades [3640]. In this section we discuss the general methodological approaches of affect recognition based on facial analysis. A typical facial analysis system has the following main components: face and fiducial points (facial parts location) detection, feature extraction, and classification. Figure 1 depicts the main components of such a system. Once the image has been captured, face detection algorithms, such as the popular Viola-Jones (VJ) detector [41], are used to locate the face. Next, fiducial points are inferred using parametric models such as the widely used Active Appearance Models (AAM) [42]. Once the location of facial parts such as the eyes and mouth are known, facial features are computed. Features can be extracted either on a holistic level or on individual parts of the face. The individual features computed in the latter case are concatenated to construct the final feature vector. Further, dimensionality reduction methods such as Principal Component Analysis (PCA) can be applied to the feature vector. This provides a compact and less noisy feature representation capable of higher discrimination (e.g., between features). The classifier then categorizes a given face into differing affective states (such as depressed/non-depressed, various emotion types, and continuous valence/arousal labels etc). The components of a facial affect recognition (FAR) system are discussed in detail in the following sections.

Fig. 1
figure 1

A typical FAR system

4.1 Face and Fiducial Points Detection

One of the most widely used face detectors is the classic VJ face detector [41]. It is based on a cascade-boosting framework [43] in which haar-like features are extracted using an integral image. The use of an integral image leads to near real-time execution of the face detection method. The cascade classifier scans an image using a sub-window at different scales and localizes the regions that are labeled as faces by the classifier. The Adaboost algorithm is also applied for selecting discriminative haar-features during the training phase of the classifier. The cascade classifier consists of various weak classifiers, which together act as a strong classifier. The use of weak classifiers early on allows fast rejection of patches that do not resemble faces. The open source computer vision library, OpenCV, contains an implementation of the VJ object detector. For multi-view face detection, multiple models are learned for different head poses [44]. These select features using forward feature selection before training the cascade classifier.

There are several other facial detection methods, including energy-based models which infer the face location and head pose simultaneously [45], and vector boosting-based methods where a tree representation is proposed for dividing the face space into smaller subspaces [46]. The power of non-rigid deformable models stems from the low-dimensional representation of the shape and texture of a face they provide. One of the earliest deformable model methods is the Active Contour Model [47]. The Active Shape Model (ASM) algorithm [48] models the shape of an object and has been used extensively in face tracking. During the training process, the landmark points of all input samples are aligned into a common co-ordinate frame using Procrustes analysis. Following this, the model is computed by applying PCA over the shapes. The shape of the model can be controlled/changed via parameters of deformation. Then, fitting of the model can be performed on a new image using an iterative method, which calculates the best match for the model boundary and hence decides the new location for the model points.

4.2 Active Appearance Models

The Active Appearance Models (AAM) are an extension of the ASM. However, they not only model the shape but also consider the grey-level appearance. During the training process the grey-level appearance is modeled by warping each training sample image using a triangulation algorithm that aligns it to the mean shape. The grey-level information is then sampled and PCA is applied to the samples. Hence, this model can then be fitted to a new image using an optimization algorithm that uses the difference between intensities of the learnt model and the reference image. There are a number of AAM fitting algorithms that can be broadly classified into two classes: generative and discriminative fitting. In generative fitting (Fixed Jacobian [42], Project Out Inverse Compositional [49], Simultaneous Inverse Compositional [50]), minimization/maximization of some measure of fitness between the model’s texture and warped image region is applied to the image. In discriminative fitting (Iterative Error Bound Minimization [51], Haar-like Feature Based Iterative Discriminative Method [52]) a relationship is learned between the features and the parameters, by using the features extracted from parameter settings, which are perturbed from their optimal setting in each image. The disadvantage of AAM is their limited generalizability to unobserved subjects. AAM's can be classified as subject-dependent or subject-independent. Subject-dependent AAM is best suited to scenarios when the train and test images have the same subjects. Subject-independent modeling is for situations where the subjects in the train and test set are different. In one of the earliest works in automatic depression analysis, McIntyre et al. [53] and Cohn et al. [54] used subject-dependent AAM for facial part detection. In such studies, the facial points were used as feature descriptors for learning a classifier.

Constrained Local Models (CLM) [55] are an extension of the AAM algorithms. The texture is divided into blocks. This helps in generalization and better subject-independent performance. Subject-dependent AAM methods perform better than subject-independent CLM [56]. However, the current state-of-art descriptors compensate for small errors introduced by subject-independent CLM [56]. Another limitation of both AAM and CLM is their requirement of large volumes of labeled data representing different scenarios, such as illumination, pose and expression during training. However, despite the advantages, labeling fiducial points is a manually laborious and erroneous task.

4.3 Pictorial Structure

In the Pictorial Structure (PS) framework [57], an object is represented as a graph with n vertices V = {v 1,…,vn} for the parts and a set of edges E, where each (v i , v j ) ∈ E pair encodes the spatial relationship between parts i and j. For a given image I, PS learns two models. The first one learns the evidence of each part as an appearance model, where each part is parameterized by its location (x, y), orientation θ, scale s, and foreshortening. All of these parameters (together referred to as D) are learned from exemplars and produce a likelihood model for I. The second model learns the kinematic constraints between each pair of parts in a prior configuration model. L is the parts configuration. Given the two models, the posterior distribution over the whole set of part locations is:

$$ p\Big(L\left|I,D\left)\ \alpha\ p\right(I\left|L,D\Big)\right.\right. $$
(1)

where p(I|L, D) measures the likelihood of representing I in a particular configuration and p(L|D) is the kinematic prior configuration. A major problem of this framework is the low contribution of the occluded parts, resulting in either erroneous or missed detection of these parts, leading to inaccurate pose estimation. Everingham and colleagues [58] proposed a PS based fiducial point detector, which is initialized using the VJ face detector. Recently, Zhu and Ramanan [59] proposed an extension to the PS framework by adding mixtures representing different face poses. This latter study performed face and fiducial point detection and head pose inference in the one framework. The face detector performs better than the VJ face detector. The disadvantage of the PS based method as used by Everingham et al. [58] is that it requires initialization from a face detector like VJ. However, Zhu and Ramanan [59] overcome this limitation by using multiple pose as mixture detectors.

Selecting the appropriate face and fiducial point detector is problem driven. For example, in the case of affect analysis, it is desirable that the system should generalize over subjects. Joshi et al. [60] used the PS framework of Everingham and colleagues [58] for fiducial points detection. Even though the Mixture of Pictorial Structures (MoPS) framework performs better than the PS method, our own research group [60] prefer the use of PS. We argue that the inference time for MoPS is substantially longer than that obtained in the Everingham et al. [58] study, which matters when analyzing long duration depression video clips. Furthermore, such work can utilize spatio-temporal descriptors (Local Binary Pattern in Three Orthogonal Planes (LBP-TOP) [61]) that compensate for algorithm error. On the other hand, applications such as facial performance transfer [62] require accurate fiducial points. Asthana and colleagues [62] use subject dependent AAM models as they are more accurate as compared to CLM and subject-independent AAM.

4.4 Facial Descriptors

Once the face and facial parts location is identified and aligned, feature descriptors are computed for extracting information for learning classifiers. FAR techniques can be segregated on the basis of the type of descriptors used, broadly classified as either geometric- or appearance-based features. Geometric features [6365] correspond to facial points and to the location of different facial parts. Appearance features generally correspond to face texture information [61, 66, 67]. Furthermore, facial feature descriptors can be divided on the basis of temporal information (i.e., spatio-temporal descriptors [61] and frame-based descriptors [67]). A popular method for modeling geometric features is based on Facial Animation Parameters (FAP), defined in the MPEG-4 video-coding standard. Lavagetto and Pockaj [68] present a method for synthesizing facial animations using FAP and Facial Definition Parameters (FDP). Similarly, Sebe and colleagues [69] use the Piecewise Beizier Volume Deformation (PBVD) tracker for tracking facial parts. The motion information between two consecutive frames is measured using template matching.

Various classifiers can also be compared, allowing insight to be gained as to their utility in depression classification. Asthana et al. [70] compute geometric features by fitting AAM models on input faces. They compared various AAM fitting techniques and experimented on a Cohn-Kanabe (CK+) database. One of the limitations of this work is that it required manual initialization of facial parts. Dhall and colleagues [63] propose the use of the geometric descriptor algorithm, “Emotion Image” (EI). This feature constructs a visual map based on an undirected map derived from a facial points detector. EIs of two faces (images) is compared using the Structural Similarity Index Metric (SSIM) of Wang et al. [71], allowing computation of their similarity, and is applied to the problem of expression based album creation. The discriminative ability of EI is dependent on fiducial point detection quality, which may introduce some errors when the fiducial points detection is not very accurate. Valstar and Pantic [72] showed that the performance of geometric features is similar to that of the appearance features. However, the limitation of geometric features comes from their dependence on accurate facial parts location information. Facial parts detection is relatively accurate on lab-controlled scenario data; however, it is still an open problem for images in real-world conditions. If there is an error in the facial parts detection, the error generally propagates in the geometric feature representation. Chew et al. [56] argue that appearance descriptors are able to compensate error produced by facial parts detectors to some extent. Popular appearance descriptors are described below.

4.4.1 Local Binary Patterns (LBP)

The LBP family of descriptors has been extensively used in computer vision for texture and face analysis [59, 73, 74]. The LBP descriptor assigns binary labels to pixels by thresholding the neighborhood pixels with the central value. Therefore, for a center pixel p of an image I and its neighboring pixels N i , a decimal value d is assigned to it:

$$ \begin{array}{l}d={{\displaystyle \sum_{i=1}^k2}}^{i-1}I\left(p,Ni\right)\\ {}\mathrm{where}\ I\left(p,Ni\right) = \left\{{}_{0\ \mathrm{otherwise}}^{1\ \mathrm{if}\ c<{N}_i}\right.\end{array} $$
(2)

4.4.2 Local Binary Pattern-Three Orthogonal Planes (LBP-TOP)

Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) [61] is a popular descriptor in computer vision. It considers patterns in three orthogonal planes: XY, XT and YT, and concatenates the pattern co-occurrences in these three directions. The local binary pattern (LBP-TOP) descriptor assigns binary labels to pixels by thresholding the neighborhood pixels with the central value. Therefore, for a center O p of an orthogonal plane O and its neighboring pixels N i , a decimal value d is assigned to it:

$$ d={\displaystyle \sum_O^{XY,XT,YT}{\displaystyle \sum_p{\displaystyle \sum_{i=1}^k{2}^{i-1}}}}I\left({O}_p,{N}_i\right) $$
(3)

Joshi et al. [75] proposed a LBP-TOP based framework for analyzing depression data. The video clips were divided into temporal slices and LBP-TOP was computed on each time slice. Temporal slicing helps in encoding spatio-temporal changes. Further, a Bag-of-Words (BoW) representation was learnt with LBP-TOP from each temporal slice from an interview-based video (a ‘document’). BoW-based representations come from the domain of document processing. A BoW feature represents a document (image/video) as an unordered set of frequencies of words. Li and colleagues [76] were the first to use a BoW for FAR—they fused PHOG- and BoW-based histograms constructed from a dictionary based on Scale Invariant Feature Transform (SIFT). Even though BoW-based vectors can represent the frequency of different stages of an expression, the temporal sequencing information is still missing. To overcome this problem, a data-driven technique was recently proposed to explicitly encode the temporal information using n-grams. Bettadapura et al. [77] performed experiments on human action recognition and activity analysis and showed that adding temporal sequencing information based on their method increases the accuracy of the BoW-based techniques.

The facial descriptor representation modeled on computing features based on the output of the facial parts detection module can be referred to as explicit modeling of affect. Here, an explicit model (face model) is used to localize the facial parts. In a different approach (implicit modeling), Joshi et al. [60] proposed the computation of Space Time Interest Points (STIP) [78] on the upper body of a subject in the video frame in depression. STIP are widely used in the computer vision community for human action recognition. STIP are salient spatio-temporal locations in a video, where a change has occurred. These are based on a 3D extension of the 2D Harris interest point detector. Once an interest point is detected, Histogram of Gradients (HOG) and Histogram of Flow (HOF) is computed around the interest point. Joshi et al. [75] computed STIP on the upper body and compared the performance with face area only STIP in depressed subjects. Further, a BOW is learnt on HOG and HOF features generated around the interest points. Figure 2 describes the STIP computation on a sample from the FEEDTUM database [79].

Fig. 2
figure 2

The figure describes the STIP computation on a video from FEEDTUM database [42]. The blocks represent the gradient information [41] around the generated interest points. The HOG and HOF features represent the local spatio-temporal movements

Note that the yellow ellipses around the eyes and mouth are the interest points generated. The regions of interest around the interest points are scaled up, and gradient information is shown (as demonstrated previously [80]). It is easy to notice that the two gradient blocks around the same region of interest (i.e., right tip of the mouth) show different gradients due to changes in facial expression. The flow information here describes the motion change around the interest point on the right tip point of the mouth. Hence, STIP captures local movements on a holistic level (full-face), and information is captured implicitly without having to use any output from a facial parts localization method.

4.5 Objective Identification of Melancholia

Many of the above methods have recently been used to facilitate better identification of those with depression. Williamson and colleagues [81] showed that depression was associated with changes in the coordination and movement of facial gestures using facial action units. When used in multivariate modeling, such facial features were predictive of self-reported depressive symptomatology, highlighting the utility of the approach in quantifying the depressive syndrome. There is also evidence to suggest that facial landmark fitting (66 points to the face on each frame) allows robust prediction of affective dimensions (i.e., valence/arousal) and global depression state (measured through self-rated depression scores) [82]. In addition, the predictive capacity of facial imaging in detecting depression appears to increase with the addition of other ‘affective computing’ data modalities. Pérez-Espinosa and colleagues [83] demonstrated that fusing affective dimensions and audiovisual features (facial imaging) allowed for accurate construction of depression recognition models. Whilst no studies have directly examined the performance of the methods above in classifying different types of depression such as melancholia, we propose that they will likely be of significant benefit in future studies. Indeed, our own research team recently completed recruitment for a large facial imaging study of those with melancholic depression, those with a non-melancholic depression, and healthy controls, to determine whether the above methods could be used to accurately classify these groups. Preliminary analyses using data from this study have been completed [84], and analyses are now underway to determine whether melancholia can be detected on the basis of its unique, and quantifiable, facial features.

5 Summary and Future Trends

In this chapter we have reviewed the role of facial imaging technologies in detecting affective states, and specifically the potential of emerging methods in classifying depressive disorders. Methodological advances and continuing algorithm development in the field of affective computing have the potential to offer unique insights regards depression detection. Objective characterization of facial expressions with specificity to melancholia will be the next step in determining the overall clinical utility of these methods. Based on the literature to date, however, facial imaging technologies are well positioned to contribute to the assessment and monitoring of depressive disorders.