Keywords

1 Introduction

To simplify the diagnosis process and alleviate the lack of medical resources, computer auxiliary analyses of the information generated in the process of medical diagnosis have become the focus of both academia and industry [4, 12]. Most of the related works of computer-aided diagnosis focus on the doctor’s diagnosis process of the patient’s condition, such as analyzing the patient’s medical images [5, 19], or the doctor’s performance when communicating with the patient [1, 22]. However, most of the current computer-aided methods only focus on analyzing the doctors’ diagnosis process information for medical patterns learning, ignoring the fact that the medical diagnosis process is a comprehensive, complicated and multi-step judgment process.

Pre-diagnosis is one of the most important procedures during the diagnosis process. But how to combine the pre-diagnosis information with the computer-aided diagnosis model to obtain more accurate conclusions has not yet been paid enough attention. So we propose the Multiple Graph Regularized Diagnosis (MuGRED) method to generate the pre-diagnosis information with the image process diagnosis. The proposed MuGRED method consists of two main components: multimodal representation learning, which is designed to learn the representation of patients from the diagnosis process, and a multiple graph feature fusion module to fuse the information from varieties of pre-diagnosis as well as from the diagnosis process for the final diagnosis conclusion.

In this paper, we focus on the diagnosis of mental disorders. We evaluate our method’s effectiveness on two datasets: a practical dataset for ADHD and a well-recognized mental disorders dataset for anxiety and depression. We introduce the pre-diagnosis results to these two multimodal behavior analysis problems and compare them with state-of-the-art methods. Experimental results show that our method can better use pre-diagnosis information to achieve a more accurate diagnosis. The main contributions of this work can be summarized as follows:

  • We propose a novel end-to-end Multiple Graph Regularized Diagnosis model for mental disorder diagnosis. To the best of our knowledge, it is the first work of connecting the pre-diagnosis information with machine learning-based multi-modal representation learning;

  • According to similar pre-diagnosis patterns, we construct multiple graphs to represent the connection between patients. The graphs can regularize the patients’ representations extracted from the diagnosis process and generate the optimized feature with a feature fusion mechanism.

  • Extensive experiments on two datasets of mental disorders show that our way of introducing pre-diagnosis information can effectively improve diagnostic performance, and the introduction of the pre-diagnosis information conforms to intuition and medical logic.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 introduces our proposed multiple graph regularized diagnosis method. Section 4 gives extensive experimental results. Finally, Sect. 5 concludes the paper.

2 Related Work

2.1 Computer-Aided Diagnosis

In recent years, deep learning has been widely used in the medical field, most of which focus on medical imaging analysis. There are also some works on mental disorder diagnosis. For example, [11] based on computer vision analysis technology to predict attention deficit and hyperactivity disorder (ADHD) and autism spectrum disorder (ASD). They considered the tester’s facial expression, head position, movement, etc., and used a support vector machine (SVM) to analyze ADHD and ASD is classified. [9] build a multi-modal method using 3D facial expressions and spoken language with C-CNNs to predict PHQ scale. [18] process acceleration signal from wrist and ankle by RNN to distinguish behavior between ADHD diagnosed participants and normal ones. [2] uses clinical conversation messages to train a text-based multi-task BiLSTM network aiming at modeling both depression severity and binary health state. [17] try to diagnose the major depressive disorder or simply depression through electroencephalogram by Logistic regression, Support vector machine, and Naive Bayesian. [24] describe an intelligent auxiliary diagnosis System based on multimodal information fusion to diagnose ADHD. [26] use a cross-task approach that transfers attention from speech recognition to depression severity measurement. [3] use atrous residual temporal convolution network and temporal fusion based on visual behavior to capture depression signals. These works focus on the analysis in the diagnosis process, ignoring the integration of pre-diagnosis information, which is not enough to support a complete ADHD diagnostic logic chain.

2.2 Graph-Based Information Fusion

Constructing a graph to make feature fusion between related entities is widely applied in various applications. [20] build a multi-modal heterogeneous graph for recommendation. [23] design a heterogeneous graph neural network to jointly consider heterogeneous structural information and contents information of each node effectively. Using graph-based fusion model to simulate human logic has also been widely applied in the domain of computer-aided diagnosis. For example, [15] build the interaction between individuals and groups is established through a graph convolution network to assist in diagnosing COVID-19. [25] introduce a graph-based semi-supervised model by using partly labeled samples to diagnose dementia. [16] construct a knowledge graph to assist diagnosis based on doctor experience, case data, and other information. It proves the feasibility and research value of intelligent diagnosis assisted by graph network.

We combine the pre-diagnosis information with the diagnostic analysis process. First, the time-sequential neural network is to fuse the patient’s various characteristics during the test. Then the graph neural network is established through the pre-diagnosis information, and the feature optimization is performed through the attention mechanism to improve the prediction effect.

Fig. 1.
figure 1

Overview architecture of Multiple Graph REgularized Diagnosis (MuGRED).

3 Method

This paper proposed a MuGRED (Multiple Graph REgularized Diagnosis) model to solve the mental disorder diagnosis problem. The diagnosis process of the mental disorder can be summarized into two-step: Pre-diagnosis and diagnosis. During the pre-diagnosis process, patients are required to complete some questionnaires, and the doctors can get a preliminary understanding of their psychological conditions based on the quantitative indicators. And during the diagnosis process, doctors interview patients or ask them to complete specific tasks and observe their multi-modal behaviors, such as eye movements, facial expressions, head posture, body movements, conversation content, or voice intonation. The doctors consider the patient condition of both pre-diagnosis information and performance of multi-modal behavior during the diagnosis process to draw the final diagnosis conclusion.

The proposed MuGRED model imitates the above diagnosis process of human doctors. As Fig. 1 shows, the MuGRED model consists of three modules: Multimodal Representation Learning module, Multiple Graph Construction modules, and Multiple Graph Feature Fusion modules. Multimodal Representation Learning module is to extract the behavior representation of the diagnosis process from the frame-level temporal multi-modal feature sequence. Multiple Graph Construction modules introduce the pre-diagnosis information to construct the correlation graph. The edges between patients are connected according to a similar pattern of pre-diagnosis results. Multiple Graph Feature Fusion modals firstly make a intra-correlation feature fusion to regularize the representation of patient behavior and then make a cross-correlation feature fusion to integrate the features regularized with different pre-diagnosis indicators for the final diagnosis conclusion making. In the following, we introduce the technical details of the MuGRED model.

3.1 Multi-modal Representation Learning

The inputs of the MuGRED model are from the diagnostic process. The patient behaviors are recorded as multimedia data, and we extract the patient’s behavioral features according to the diagnostic logic needs, such as gaze position, facial Action Units (FAUs) [8], facial landmarks, head pose, etc. The multi-modal behavior of the entire process is represented as a temporal feature sequence \(\textbf{X} = [\textbf{x}_1,\textbf{x}_2, \cdots ,\textbf{x}_T]^T\) with T frames, where \(\textbf{x}_i\) is the frame-level feature of the timestamp i.

We need to extract the person-level feature to describe patient behavior from the temporal feature sequence. First, we split the sequence into N fragments with the same number of frames (Note that the overlaps between fragments are allowed), and the temporal behavior feature sequence can be transformed into a set of fragment sequences: \(\textbf{C} = \{\textbf{c}_1,\textbf{c}_2, \cdots , \textbf{c}_N\}\). For each fragment sequence \(\textbf{c}_i\), we summarize and incorporate the contextual information of a frame with a temporal neural network, and transform the sequence of frame-level features \(\textbf{c}_i\) into a hidden state sequence \(\textbf{h}_i\), noted as:

$$\begin{aligned} \textbf{h}_{i} = \mathbf {\Phi }(\textbf{c}_{i}, \varTheta ) \end{aligned}$$
(1)

where \(\mathbf {\Phi }\) can be any temporal encoder, such as Temporal Convolutional Network (TCN) [14] or Long-Short Term Memory Network (LSTM) [10]. Then we aggregate the hidden state of each frame with a self-attention mechanism to generate the representation of fragment feature \(\textbf{r}_i\) as:

$$\begin{aligned} \begin{aligned} \textbf{r}_i = \frac{1}{M}\sum _{i=1}^{M}\textbf{w}_i\textbf{h}_{i} \end{aligned} \end{aligned}$$
(2)

where \(\textbf{w}_i= \frac{e^{\textbf{h}_{i}}}{\sum _{i=1}^{M}e^{\textbf{h}_{i}}}\) is the attention weight and M is the frame number in fragment sequences \(\textbf{c}_i\). So that we obtain the sequence of fragments \(\textbf{r} = [\textbf{r}_1,\textbf{r}_2,...,\textbf{r}_N]\). By repeating the same process of transforming the frame-level feature set \(\textbf{c}_i\) to the fragment feature \(\textbf{r}_i\), we can transform the fragment-level feature set \(\textbf{r}\) into the person-level representation \(\textbf{v}_{Behavior}\).

3.2 Multiple Graph Construction

After generating the person-level feature, we introduce the pre-diagnosis information. Given P kinds of pre-diagnosis patterns, the pre-diagnosis result of the patient i is represented as \(\textbf{R}_i=\{\textbf{R}_i^1, \textbf{R}_i^2, \cdots , \textbf{R}_i^P \}\).

Suppose we’ve known the behavioral representation of \(\hat{N}\) previous patients, we build the correlation hypergraph \(G(\hat{V}, E)\) to represent the connections over the patients. Where the set of node \(\hat{\textbf{V}}= \{ \hat{\textbf{v}}_1, \hat{\textbf{v}}_2, \cdots \hat{\textbf{v}}_{\hat{N}} \}\) denotes the person-level feature of the previous cases. As there are P kinds of pre-diagnosis patterns, the graph includes P kinds of edges, denoted as \(\textbf{E}=\{E^1, E^2, \cdots , E^P\}\). The p-th kind of correlation between node \(\hat{\textbf{v}}_i\) and \(\hat{\textbf{v}}_j\) is denoted as:

$$\begin{aligned} e_{ij}^p = {\left\{ \begin{array}{ll} 0&{}\textbf{R}_i^p\ne \textbf{R}_j^p\\ 1&{}\textbf{R}_i^p = \textbf{R}_j^p. \end{array}\right. } \end{aligned}$$
(3)

where \(e^p_{ij} \in E^p\), \(i,j=1,2, \cdots , \hat{N}\). i.e., if the node \(\hat{\textbf{v}}_i\) and \(\hat{\textbf{v}}_j\) are with the same result of the p-th pre-diagnosis indicator, we defined that they are with correlation in the corresponding aspect. Consequently, they are connected with the p-th kind of edge in the graph. For example, if two people are both with severe insomnia, we construct their correlation in the insomnia aspect. Note that the pair of nodes may contain multiple correlations, which indicates that they are with stronger correlations.

3.3 Multiple Graph Feature Fusion

In the previous section, we used the pre-diagnosis information to construct an association graph between existing cases \(G(\hat{V}, E)\). Next, we obtain the regularized patient behavior feature with a two-step operation: First, for each correlation constructed by the specific pre-diagnosis pattern, we make a intra-correlation feature fusion to regularize the behavior feature; and then we make a cross-correlation feature fusion to aggregate the set of features regularized with the single correlation.

Intra-correlation Feature Fusion. As we obtained the personal-level behavioral feature vector \(\textbf{v}_{Behavior}\) of the current patient, we regularize the feature representation with the correlation graph. By adding the node \(\textbf{v}_{Behavior}\) into the correlation graph, we connect the edges along with the above rules based on pre-diagnosis results’ consistency.

For the p-th kind of pre-diagnosis pattern, the feature \(\textbf{v}_{Beh}\) is optimized on the sub-graph of G, denoted with \(g_p=(\textbf{V}, \textbf{E}^p)\). We aggregate the correlated feature representation with the current case by a graph attention network on each pre-diagnosis indicator. The correlation and message passing that the node \(\mathbf {\hat{v}}_i\) affected the node \(\textbf{v}_{Beh}\) is considered as a multi-head attention mechanism, expressed as:

$$\begin{aligned} \begin{aligned} \xi _{k,i}^p&= LeakyReLU((\textbf{W}_{self}^k\textbf{v}_{Beh}) \cdot (\textbf{W}_{obj}^k\mathbf {\hat{v}}_i)^T),\ \hat{e}^p_i \ne 0 \\ \alpha _{k,i}^p&= Softmax_j(\xi _{k,i}^p/s_{scale})\\ \textbf{s}_i^p&= \mathop {\parallel } \limits _{k=1}^K\sigma (\Sigma _{j \in E_p} \alpha _{k,i}^p\textbf{W}^k_{fea} v_i) \textbf{W}^O \end{aligned} \end{aligned}$$
(4)

where K is the number of attention headers, \(\textbf{W}_{self}^k\), \(\textbf{W}_{obj}^k\) and \(\textbf{W}_{fea}^k\) are the linear transformation parameters of the k-th attention header. \(\parallel \) represents concatenation operation, and the hyper-parameter \(s_{scale}\) is a scaled index to adjust the sensitivity of feature differences in practical applications. Finally, a dimensional attention matrix \(\textbf{W}^o\) is applied for dimensional reduction.

According to the above setting, we construct correlation graph of the previous patients through P kinds of pre-diagnosis indicators and obtain an optimized representation of the current case through each indicator. So that we get the regularized feature set \(R_S = \{\textbf{r}_s^1, \textbf{r}_s^2,\cdots ,\textbf{r}_s^P\}\).

Cross-correlation Feature Fusion. The regularized features through various correlations of pre-diagnosis indicators describe the patient’s characteristics from different aspects. To generate the pre-diagnosis-related feature of the patient, we propose a cross-correlation feature fusion with attention mechanism to fuse the features in S. The attention weight of each feature is shown as:

$$\begin{aligned} \textbf{w}^p = \frac{1}{P}\sum \limits _{p \in P} q^T \cdot \tanh (\textbf{W}\cdot s^p+b) \end{aligned}$$
(5)
$$\begin{aligned} \beta ^p = \frac{\exp (\textbf{w}_i^p)}{\sum _{p=1}^P exp(\textbf{w}^p)} \end{aligned}$$
(6)

where \(\textbf{W}\) is the weight matrix, q is the attention vector. \(\beta ^p\) is the normalized weight. And the final feature is \(\textbf{v}_{Reg}\):

$$\begin{aligned} \textbf{v}_{Reg} = \sum \limits _{p=1}^P \beta ^p\cdot s^p \end{aligned}$$
(7)

Finally, we make a concatenation to the feature directly extracted from the temporal behavioral feature sequence and the optimized feature to consider both patient’s direct behavior and the correlations with the pre-diagnosis-related previous diagnostic experience. So the final feature representation of making the diagnostic decisions is denoted as \(\textbf{z} = [\textbf{v}_{Behavior} \ \Vert \textbf{v}_{Reg}]\) for the diagnosis result prediction.

4 Experiment

4.1 Dataset Description

We evaluate the performance of our proposed method on two datasets of mental disorders: A well-recognized mental disorder benchmark DAIC-WOZ and a practical on attention deficit and hyperactivity disorder ADHD, which are described as follows:

ADHD: We cooperate with The Children’s Hospital of Zhejiang University School of Medicine to collect a dataset of patients with Attention Deficit and Hyperactive Disorder (ADHD), including 109 children for training and 15 for testing. Before the diagnosis begins, each patient and their parents must complete various psychological assessment scales during the pre-diagnosis process. With the patients’ informed consent, we record videos of these children’s behaviors during their diagnosis process with multiple cameras and extract their gaze, head poses, facial action units, facial landmarks, and body movements from the videos of the diagnostic process as a sequential temporal behavioral feature. In this work, we introduce the Conners Comprehensive Behaviour Rating Scale (Conners) [6] as the pre-diagnosis information. It is a questionnaire to gain a better understanding of academic, behavioral, and social issues. The diagnostic conclusion is from the result of SNAP-IV Teacher and Parent Rating Scale (SNAP-IV) [21], which is an assessment of ADHD risk with score ranges from 0 to 3 from three main perspectives: Inattenton (Inatt for short), Hyperactivity-Impulsity(H/Imp for short) and Oppositional Defiant Disorder (ODD for short).

DAIC-WOZ: A well-recognized mental disorder benchmark that contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety [7], depression, and post-traumatic stress disorder. The participants are required to communicate with an animated virtual interview controlled by a human interviewer in another room, and their behaviors are collected as multimedia records, including video, audio and text collected. In 189 participants, 107 participants are used as training set, 35 as development set and 47 as test set. The diagnostic conclusion of DAIC-WOZ is evaluated with the depression index score Patient Health Questionnaire (PHQ-8) [13]. The questionnaire estimates the mental disorder with eight indicators: NoInterest, Depressed, Sleep, Tired, Appetite, Failure, Concentrating, and Moving. And the value of each indicator score ranges from 0 to 3 according to the severity, so the total score ranges from 0 to 24. In the experiment, we consider the multi-modal behavioral features of FAUs, head poses, gazes, and 2D facial landmarks extracted from the interview videos as the input of the temporal behavioral feature sequence. We regard gender and each indicator in the Patient Health Questionnaire (PHQ-8) [13] scale as the patients’ pre-diagnosis information. Since there is a strong correlation between the label and the indicators, introducing all the indicators as pre-diagnosis information is meaningless. Instead, we regard gender and randomly select several indicators as pre-diagnosis knowledge.

4.2 Implementation Details

To simplify the training process, we apply a transfer learning strategy. We first train the temporal feature extraction model to predict the label directly with 100 epochs and then transfer it to the Temporal Feature Extraction module of MuGRED with fixed gradient propagation to train the rest part of the model. We evaluate the diagnosis conclusion with the evaluation metrics of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).

The hyper-parameters for all the experiments are under the same setting. The model was optimized with the Adam optimizer to optimize the parameters, and we set the \(\lambda _1\) to 0.9 and \(\lambda _2\) to 0.999 with weight decay of 1e–4. The initialized learning rate is set to 1e–3. The final result for each experiment follows early-stop strategy and the number of train epochs is 500.

4.3 Baselines

We considered three baseline methods for multi-modal representation learning. DepArt-Net is one of the state-of-the-art method of on the DAIC-WOZ dataset problem, which applied an atrous temporal convolutional network to extract the multi-model sentence embedding from the multi-modal feature sequence, and then make the diagnostic decision [3], In addition, HAN+TCN and HAN+LSTM is the baselines of hierarchical attention network with two sequential encoders.

Table 1. The RMSE and MAE scores comparison of the different pre-diagnosis information aggregation methods on DAIC-WOZ dataset and ADHD dataset

4.4 Results

Considering that MuGRED introduced extra information of pre-diagnosis, to keep the comparison fair, we proposed another way to introduce the same information by simply concatenating the vector of pre-diagnosis indicators to the multi-model behavior embedding. We select gender and the indicator NoInterest as the pre-diagnosis information. The results in Table 1 compare the performance of the two kinds of pre-diagnosis information integration methods on the three kinds of temporal feature extractors. We found that for both datasets, the integration of pre-diagnosis can improve the diagnostic performance for the three kinds of feature extractors, which proves the value of introducing pre-diagnosis information. In comparing the two integration mechanisms, MuGRED leads to a more significant improvement, indicating that the model design of MuGRED plays the effectiveness of pre-diagnosis information. In addition, in most cases, integrating the pre-diagnosis information with both methods can make the best performance, which may become a practical trick of MuGRED’s modification.

We then evaluate the impact of the amount of combined pre-diagnostic indicators on the diagnostic prediction performance on DAIC-WOZ dataset. As shown in Table 2, we compared the model performance in the three settings: without pre-diagnosis, with gender and 1 indicator (NoInterest), and with gender and 3 indicators (NoInterest, Depressed and Sleep). The results show that the richer the pre-diagnosis information introduced, the more accurate the prediction of the diagnosis conclusion.

Table 2. The RMSE and MAE scores comparison of the diagnosis results with different amounts of the pre-diagnostic indicators on DAIC-WOZ dataset
Table 3. The RMSE and MAE scores comparison of the diagnosis items with/without pre-diagnostic indicators on ADHD dataset

For the ADHD dataset, the ADHD risk rating is consist of three items, i.e., Inattenton (Inatt), Hyperactivity-Impulsity (H/Imp) and Oppositional Defiant Disorder (ODD). We wonder whether the pre-diagnosis information correlation can benefit all the performance of the predicted diagnosis items. To achieve this, we evaluate how the pre-diagnosis affects the single item of the diagnosis result. The results of comparing the prediction results of the specific three symptom index with/without pre-diagnosis information are shown in Table 3. It could be found that the performance of every single item using the same method has been improved with the help of pre-diagnosis in most cases.

5 Conclusion

In this paper, we focus on the problem of mental disorders diagnosis, where both pre-diagnosis and diagnosis process information are important for medical experts to make a diagnosis. However, most existing methods in CAD only focus on the diagnosis process information while ignoring the pre-diagnosis information. Moreover, the patients are commonly treated as independent samples in previous CAD methods, while medical experts, in real applications, always need to connect patients for comparison. Therefore, we adopted the pre-diagnosis information to connect the patients and proposed a Multiple Graph REgularized Diagnosis (MuGRED) method for mental disorders diagnosis. The proposed MuGRED method is an end-to-end learning algorithm by fusing the representation from both pre-diagnosis and diagnosis processes for final diagnosis prediction. Extensive experiments on both ASD and ADHD demonstrated the effectiveness of the proposed MuGRED method for mental disorder diagnosis.