1 Introduction

According to the World Health Organization (WHO), mental health has become an important indicator of sustainable development (World Health Organization, 2019). Statistics show that people with mental disorders have disproportionately high rates of disability and death. For example, people with schizophrenia and depression are 40% to 60% more likely to die prematurely compared to the general population (World Health Organization, 2021). Through a meta-study of 174 surveys, the number of people suffering from mental illness in the world is estimated at 29.2% (Steel et al., 2014). Worse yet, mental illness is one of the leading causes of disability, driving the global cost of the mental health treatment into trillions of dollars (World Health Organization, 2021; Patel et al., 2018).

Some studies have shown that if mental illness is detected and treated early, the treatment and long-term results will be greatly improved (Bird et al., 2010; Treasure & Russell, 2011). In the long run, being able to early detect mental illness will reduce the impact of mental illness in our society and reduce the economic burden. In WHO’s Mental Health Action Plan (World Health Organization, 2021), it suggests to strengthen the research on mental health information systems. We follow this suggestion to focus on early detection of mental illness by analyzing social media data.

Owing to the popularity of the modern Internet, social media is getting closer to people’s lives and most people are more and more willing to share their lives through social networks. This allows us to indirectly understand the inner world of people through their posts on social media. With the participation of billions of people, the social media data is large enough to be suitable for deep learning.

Many studies began to collect datasets from social media for analysis or detection of mental health conditions (Coppersmith et al., 2015; MacAvaney et al., 2021; Benton et al., 2017). At the same time, many studies tried to improve the prediction performance by changing the model architecture or considering different types of data (Gui et al., 2019; Shen et al., 2017). Therefore, the performance of the proposed models was often limited by the distribution of the training data, which constrains the generality of the models and also makes the models unable to scale.

Moreover, people judge things not only by the data collected, but also by experiences and background knowledge. This means that in the process of making decisions, people often incorporate relevant knowledge and follow established conventions. Clinically, the doctor or psychotherapist asks the patient questions based on the mental health screening tools to understand their psychological state and symptoms (Butcher et al., 2001; Krug et al., 2008). A final assessment is then made based on the physician’s expertise and standard diagnostic criteria (American Psychiatric Association, 2013). Therefore, diagnosing mental illness requires a great deal of knowledge for a precise diagnosis.

To overcome overreliance on data gathered from social media, we refer to human and doctor decision-making processes to incorporate external knowledge into the model. The main idea is to incorporate relevant knowledge from the mental health screening tools and diagnostic criteria into the deep learning model such that a psychological perspective of the model can be provided. As an extension, we also explore the impact of introducing simple common sense into the model. We collected screening tools for the mental health conditions, Wikipedia’s mental health-related entriesFootnote 1,Footnote 2 and the authoritative book DSM-5 as our psychological knowledge. The contents from Wiki_dprFootnote 3, which are created by the Hugging Face, are also collected and treated as common sense. Our goal is to study whether external knowledge from psychology or common sense can improve the predictive ability and interpretability of the model. The method of introducing external knowledge has another advantage, that is, the external knowledge can be retrieved automatically and replaced freely. The former means the incorporation of the external knowledge does not need manual annotation, and the relevant knowledge can be found automatically. This will save a lot of labor costs and make our method easier to be widely used. The latter means the contents of the external knowledge can be adjusted freely, which solves the scalability problem for a large-scale model.

Our work includes the following four steps: (1) gather and index the employed external knowledge, (2) incorporate the knowledge into the model to aid in prediction, (3) add an attention layer for determining which posts and knowledge receive more attention, and (4) use a fully connected layer and a sigmoid activation function to predict mental health conditions. Our contributions can be summarized as follows:

  • We incorporate relevant external knowledge into the model and the experiment results show that the F1-score of the prediction is increased more than 10% in some situations compared with existing approaches.

  • By providing the model with external knowledge from psychology and other fields, humans can better understand what the model has learned, which makes researchers easier to optimize the model.

  • This method can be automated, and the external knowledge can be freely adjusted. This minimizes the cost of manual annotation and solves the scalability problem of the model.

  • Finally, some statistical guidelines are provided for those who want to adopt our approach to build external knowledge for their model.

The rest of the paper is organized as follows: the related works are reviewed in Section 2, the details of the external knowledge are introduced in Section 3, our approach is described in Section 4, the experiment results are presented in Section 5, and the conclusion is provided in Section 6.

2 Related work

In this section, we first describe the datasets for detecting mental health conditions, and then introduce the past work on the detection of mental health conditions.

2.1 Data collection from social media

With the rise of social media, researchers have begun to use social media data to study mental illness (Park et al., 2012). Moreover, a growing body of research is concentrating on analyzing the copious amounts of text on social media to learn more about mental health conditions (Coppersmith et al., 2015; Birnbaum et al., 2017). However, these studies use manual annotation to label data. Although manual labeling can obtain reliable data, the number of users from whom the data is collected is limited (Choudhury et al., 2021). Even though crowdsourcing is used to collect and label data, it is still difficult to collect a large amount of user data.

In order to collect larger data, Coppersmith et al. (2014) developed a method to identify self-diagnostic posts on social media by using regular expressions, which was widely used to collect data from the users with mental disorders on social media. Four types of mental health conditions were considered and the Tweets collected were analyzed using corresponding linguistic features and predictive models. Since depressive users tend to express their emotions and even reveal the fact of being diagnosed on social media, labelling users by the method can often achieve a high degree of reliability.

Cohan et al. (2018) extended this approach using Reddit data for a larger number of mental health conditions, and called the dataset Self-reported Mental Health Diagnoses (SMHD). Self-reported diagnoses mean if, for example, “I was diagnosed with depression last year” is found in a post, then this user is considered as a depression patient. The dataset contains data from users with nine different mental health conditions, including depression disorder, attention deficit hyperactivity disorder (ADHD), anxiety disorder, bipolar disorder, post-traumatic stress disorder (PTSD), autism disorder, obsessive-compulsive disorder (OCD), schizophrenia, and eating disorder. Notice that a user may have one or more mental disorders.

For each diagnosed user, nine or more control users were collected according to the following restrictions (Cohan et al., 2018): the number of posts posted by the control user must be between twice and a half of that of the diagnosed user, and the control user must have at least one post on a subreddit where the diagnosed user once posted. It is important to note that these control users cannot have any mental health-related post. Likewise, the diagnosed user is normalized by removing posts containing mental health signals, leaving only general posts in the final dataset, allowing the text analysis to focus on diagnosed user tendencies in general posts. Table 1 shows the statistics of the posts in the two groups of diagnosed users and control users.

Table 1 Average (standard deviation) of the count of posts, tokens for diagnosed and control users in SMHD (Cohan et al., 2018)

2.2 Detection of mental health conditions on social media

In order to improve the prediction performance, early studies focused more on the optimization of the model (Jiang et al., 2020; Murarka et al., 2021) or considering different types of data available on social media. Some studies (Choudhury et al., 2021; Coppersmith et al., 2014; Reece and Danforth, 2017) used handcrafted features from different types of data such as number of posts per day, number of faces in a photo, etc. as input for the predictive model. Other studies combined different types of data, such as text and images, to construct a multimodal (Gui et al., 2019; Shen et al., 2017) for the prediction.

However, only relying on social media data to determine whether a user suffers from mental health conditions is less convincing. Past research has focused on adding different types of features, such as the handcrafted features and image features as mentioned above, resulting in the predictive performance of the model limited by the training data. Moreover, because the model parameters cannot be easily changed, the model’s generalizability is limited and the model itself unable to scale.

In other fields, researchers solve the above-mentioned problems by incorporating external knowledge into the model. For example, Ghazvininejad et al. (2018) used the memory network to import external knowledge in the conversation generation domain to enable a chatbot to answer questions asked by humans. With the introduction of the external knowledge, the chatbot can answer questions with knowledge not from the training data. For another example, in the question answering domain, Li et al. (2020) used the Word Mover’s Distance algorithm to compare the distance between the query and the external knowledge, and put the most matching external knowledge into the model. Li et al. tested the model on the examinations of National Licensed Pharmacist Examination in China, and the results showed that the model can pass the examinations.

Although these methods have shown the effectiveness of incorporating the external knowledge, the method of retrieving relevant knowledge into the model is too straightforward. In recent years, since the excellent feature extraction capabilities of the Pre-trained Language Models (PLMs), the Meta AI team developed a retrieval method based on the high-dimensional feature representation, Dense Passage Retrieval (DPR) (Karpukhin et al., 2020), which greatly increased the accuracy and reliability of the knowledge retrieval. The experiment results show that the retrieval accuracy of DPR not only exceeds the traditionally used BM25 (Robertson and Zaragoza, 2009), but also because it is based on PLMs, its effect can continue to improve with more training. Meta AI team also used DPR to achieve gratifying results in the field of open-domain question answering and generation (Lewis et al., 2020).

Because of DPR’s outstanding retrieval ability, we use it to retrieve external knowledge, and then import relevant knowledge and posts into deep learning models for predicting mental health conditions. It is then tested on the SMHD dataset to show the performance of our model.

3 External knowledge

In this section, we first introduce the external knowledge used in our model, and then how we collect this external knowledge, including psychological knowledge and general knowledge.

3.1 Introduction of external knowledge

We consider two types of external knowledge to incorporate into our model. One is the knowledge that a clinical psychologist uses, called psychological knowledge, and the other is unconstrained general knowledge. By combining these two, we hope to effectively improve the predictive performance of the model and enhance the interpretability of the final results.

3.2 Psychological knowledge

Psychological knowledge is collected from three main sources:

  1. 1.

    Screening tools for the mental health conditions (psychological test questionnaires).

  2. 2.

    Wikipedia’s mental health-related entries.

  3. 3.

    DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (American Psychiatric Association, 2013)).

3.2.1 Screening tools

Screening tools are used to assess an individual’s mental health status or to identify signs or symptoms of mental disorders. These tools help clinicians understand the individual’s situation to do suitable treatment.

Nine types of screening tools were collected for each mental health condition included in SMHD. Additional screening tools were collected that can be used simultaneously for multiple disorders, such as MMPI (Butcher et al., 2001). In total, there are 10 types of screening tools collected. The collected screening tools are listed in ??.

In order to turn the screening tool into model-usable knowledge, it needs to be broken down into knowledge segments. We treat each question in the screening tool as a knowledge segment. If the question is too long to fit within the length limit (100 tokens) of a knowledge segment, we divide it into different segments.

For the ten types of screening tools, a total of 50 screening tools with 2556 questions were collected and divided into 2674 knowledge segments as shown in Table 2.

Table 2 Statistics of the screening tools

3.2.2 Wikipedia’s mental health-related entries

In Wikipedia, related entries are grouped and placed in templates. When collecting Wikipedia data, we found two templates related to mental health conditions, mood disorders, and mental disorders, and scraped all the webpages listed there. These webpages contain the description of the diseases, the symptoms, and other information, making the knowledge more comprehensive.

Similar to processing screen tools, Wikipedia data are divided into knowledge segments by sentences to be used by the model. A total of 165 webpages with 22,002 sentences were collected and divided into 22,002 knowledge segments (there is no sentence with more than 100 tokens).

3.2.3 DSM-5

DSM-5 is the Standard diagnostic criteria for mental health conditions. The DSM serves as the principal authority for psychiatric diagnoses in the United States. It contains disease definitions, symptoms, and treatment recommendations, etc. We remove the parts before the preface and after the ??, and collect the body part of DSM-5 as psychological knowledge. As above, we use a sentence as a segment and divide the book into many knowledge segments. For long sentences or paragraphs that are difficult to be segmented by automatic tools, we also split them into several segments with 100 tokens as breakpoints.

There are 17329 sentences in total, which are finally split into 17482 knowledge segments. It can be seen that there are not many sentences with more than 100 tokens, and most sentences can completely retain their meanings.

3.3 General knowledge

Wikipedia is a vast online encyclopedia, its content is universal and unrestricted, so we treat it as common sense. The Wikipedia data used was created by the Hugging Face called Wiki_dpr. It covers a wide and large amount of content from Wikipedia.

The text in Wiki_dpr is segmented by 100 tokens, which are split into 21 million segments in total. Wiki_dpr not only contains the text, but also generate the embedding of the knowledge segments by using the same knowledge encoder as ours. The creator establishes a quick search index for these segments, thus users can quickly and accurately find the most relevant knowledge segments.

Since Wiki_dpr contains a wide range of contents, we treat it as common sense to analyze whether the model would perform better if general knowledge was introduced.

4 Mental health conditions detection

In this section, we describe the methods for detecting mental health conditions, including extracting features from the posts and external knowledge segments, finding external knowledge segments related to the posts, and the deep learning model architecture used.

4.1 Feature extraction

To convert text into representative features, a good choice is to represent the text as a high-dimensional vector. In recent years, the rise of Pre-trained Language Models (PLMs) has made extracting features into high-dimensional vectors more universal and effective.

BERT (Devlin et al., 2019) is a transformer-based natural language processing model that is pre-trained on a large corpus and can cover multiple languages at the same time. Because of the self-attention mechanism, the context of the article is taken into account, therefore the meaning of individual words or the whole article can be accurately expressed. It is one of the most popular pre-trained language models nowadays.

We use BERT as our feature extraction encoder. Both user’s posts and knowledge segments are put into respective encoders, and use the output as the final feature representation. In this way, the vector representation of all posts and knowledge segments is obtained, which completes the purpose of the feature extraction.

4.2 Relevant knowledge retrieval

Dense Passage Retrieval (DPR) (Karpukhin et al., 2020) is a tool that has been widely used for knowledge retrieval in recent years. DPR is useful for retrieving relevant content and provides a solution to the scalability problem of large pre-trained models. Its basic concept is to use the excellent feature extraction ability of PLMs to find the correlation between the query and the knowledge such that the knowledge related to the query can be found.

All external knowledge segments are first passed through PLMs to get the feature representation. An index is then built on these knowledge segments such that the relevant knowledge segments can be efficiently retrieved. Next, we pass the query through PLMs to get the feature representation. Finally, the feature representation of all knowledge segments is used to perform Maximum Inner Product Search (MIPS) on the feature representation of the query, and the top m knowledge segments related to the query are obtained.

We treat each post as a query, and let each post find the relevant top m knowledge segments. Then we pass the feature representation of the posts and knowledge segments to the model for the prediction. In the end, the input of the prediction model includes n posts and mn knowledge segments, where m is a parameter of the number of most relevant knowledge segments.

We train a total of nine binary classification models for the nine corresponding mental health conditions to get nine prediction results. The overall flow is illustrated in Fig. 1.

Fig. 1
figure 1

Architecture for Retrieving External Knowledge and Predicting a Mental Health Condition

The details of this method are described as follows.

Posts P = {p1,p2,...,pt,...,pn}, where n is the total number of posts and pt is the t-th post. For each pt, we use DPR to compute the similarity with all kjK, where K represents all knowledge segments:

$$ ke(k_{j})=PLMs_{ke}(k_{j}), pe(p_{t})=PLMs_{pe}(p_{t}) $$
(1)
$$ p_{\eta}(k_{j}|p_{t}) \propto exp(ke(k_{j})^{\intercal} pe(p_{t})) $$
(2)

where ke(kj) is a feature representation of a knowledge segment produced by the knowledge encoder based on PLMs, and pe(pt) is a feature representation produced by the post encoder, also based on PLMs. Calculating top-m(pη(⋅|pt)) is a MIPS problem, and we let each pt find all pη(kj|pt)), where kjK. Then we sort K according to the result to obtain Kt. Lastly, we take top m elements in Kt to get \(K^{t, m}=\{{k_{1}^{t}}, {k_{2}^{t}}, ..., {k_{m}^{t}}\)}, which is the top m relevant knowledge segments of pt. We represent the hidden state of pt as \({h_{t}^{p}} =pe(p_{t})\), and the top-i-th knowledge segment of pt as \(h_{t}^{k, i}=ke({k_{i}^{t}})\).

4.3 Deep learning model architecture

Because different posts and knowledge segments have different contributions to determine mental health conditions, we use an attention mechanism (Vaswani et al., 2017) to let the model pick out posts and knowledge segments that are more important for the prediction.

We use three attention model architectures to test different interactions between posts and knowledge segments:

  1. 1.

    Each post is concatenated with the relevant knowledge segments, then enters the self-attention layer.

  2. 2.

    The posts and knowledge segments enter the self-attention layers separately.

  3. 3.

    The posts and knowledge segments enter the attention layer separately, with cross attention to each other first, followed by the self-attention layer.

The details are as follows.

  1. 1.

    After finding the relevant knowledge segments, let each \({h_{t}^{p}}\) be concatenated with \({h_{t}^{k}}\) to become the t-th hidden state ht. Put h1 to hn into the self-attention mechanism to calculate the attention weight of each hidden state. Then, the output of each hidden state ht is obtained by a linear combination of the weights. The output goes through a global average pooling layer to obtain the user representation υ.

    $$ h_{t} = {h_{t}^{p}} \oplus {h_{t}^{k}}, {h_{t}^{k}}=\{h_{t}^{k, 1}, h_{t}^{k, 2}, ..., h_{t}^{k, m}\} $$
    (3)
    $$ \upsilon = pooling(self\_attention(h_{1}, h_{2}, ..., h_{n})) $$
    (4)
  2. 2.

    The post and the knowledge segments are then passed to the attention mechanism separately to calculate the attention weights between the posts and between the knowledge segments. After the same linear combination of the weights is done, a global average pooling layer is performed. Then we concatenate the weighted average hidden states of posts and knowledge segments to obtain the user representation υ. This method can be shown after removing the cross attention layer in Fig. 1.

    $$ h^{p} = ({h_{1}^{p}}, ..., {h_{n}^{p}}), h^{k}=(h_{1}^{k, 1}, ...,h_{1}^{k, m}, h_{2}^{k, 1}, ...,h_{2}^{k, m},h_{n}^{k, 1}, ...,h_{n}^{k, m}) $$
    (5)
    $$ \upsilon = pooling(self\_attention_{p}(h^{p})) \oplus pooling(self\_attention_{k}(h^{k})) $$
    (6)
  3. 3.

    The cross attention on the sequences of posts and knowledge segments allows the posts and knowledge segments to interact with each other. Then as in the second method, the posts and knowledge segments are passed to the attention mechanism separately, followed by the global average pooling layer. Then we concatenate the weighted average hidden states of posts and knowledge segments to get the user representation υ. The complete process is shown in Fig. 1.

    $$ {h_{c}^{p}} = cross\_attention_{k \to p}(h^{k}, h^{p}), {h_{c}^{t}} = cross\_attention_{p \to k}(h^{p}, h^{k}) $$
    (7)
    $$ \upsilon = pooling(self\_attention_{p}({h_{c}^{p}})) \oplus pooling(self\_attention_{k}({h_{c}^{k}})) $$
    (8)

After getting the user representation, let it go through the fully connected layer and a sigmoid activation function to get the predicted \(\hat {y}\), where \(\hat {y}\) is the predictive value of binary classification for each mental health condition:

$$ \hat{y} = sigmoid(W\upsilon + b) $$
(9)

5 Experiments

In this section, we first give dataset statistics. Then we present the setup of our experiment and analyze the results of the experiment. Next we provide some statistical guidelines of knowledge sources. Finally, we explain the effectiveness of our model with cases.

5.1 Dataset statistics

We use SMHD to validate our method, which has been presented in Section 2.1. It has been divided into training, validation and test sets in equal proportions (Cohan et al., 2018). The statistics are shown in Table 3.

Table 3 Train, Validation, Test Split

5.2 Experiment setup

Our models are trained with batch sizes of 32 and 8. The optimizer for training the models is Adam with an initial learning rate of 10− 4. The loss function used by gradient descent is binary cross entropy. We use Tensorflow2 to implement the models and train the models on NVIDIA Tesla V100.

In terms of feature extraction, we use BERT as our PLMs, and freeze the training parameters of BERT without fine-tuning. The main reason is that the feature extraction ability of the pre-trained BERT is good enough to test the effectiveness of our method. For retrieving relevant knowledge segments, we use a pre-trained encoder trained by Hugging Face as PLMs for our knowledge encoder.Footnote 4

Cohan et al. (2018) selected more than nine control users for each diagnosed user, and mixed all control users and all diagnosed users. In order to fairly compare the experiment results, Sekulic and Strube (2019) set the ratio of the diagnosed users to the control users as 1 : 9, and implemented the benchmarks in Cohan et al. (2018) for a comparison. We follow this setup in our experiments. It is important to note that our way of adjusting the user ratio is different from Sekulic and Strube (2019), where the ratio of the diagnosed users to control users is adjusted through multiple random sampling. We did it by first making predictions for all users and then adjusting the values of the false positive (FP) and true negative (TN). Because the sum of the values of TP and FN is equal to the number of diagnosed users, and the sum of the values of FP and TN is equal to the number of control users, we calculated the variable x based on the ratio of the diagnosed to control users for each mental health condition to fit the following equation:

$$ (TP + FN):(\frac{FP}{x} + \frac{TN}{x})=1:9 $$
(10)

We then used the values of FP/x and TN/x to calculate the F1 score. Because we make predictions for all users, the results are not biased. In contrast, random sampling has to be done a sufficient number of times to remove bias. Our experiment results are therefore more statistically significant and more representative of the true predictive power of the model.

Furthermore, we only make predictions on 160 posts per user due to the recommendation from Sekulic and Strube (2019). If the user has more than 160 posts, the most recent 160 posts are selected; if it is less than 160 posts, a zero vector is filled.

From Table 3, we see that the number of control users is much larger than that of diagnosed users. To address the problem of data imbalance, we randomly draw diagnosed users and control users into the batch with the same probability during training. This allows all diagnosed and control users to be fairly used in the training.

5.3 Experiment results

We compare with the results of Sekulic and Strube (2019), whose main method is based on the Hierarchical Attention Network (Yang et al., 2016). The basic concept is to use two layers of GRU-based encoders for feature extraction. The attention operation is performed on the feature representations of the posts to achieve the purpose of obtaining the user representation. This study also re-implemented some classic machine learning architectures based on the benchmark models of Cohan et al. (2018), including Logistic Regression, Linear SVM, and Supervised FastText.

To test the impact of different types of knowledge on the model, we divide the knowledge into two categories for the experiments. One is the psychological knowledge specially collected for this task; the other is the addition of general knowledge to the psychological knowledge, called total knowledge. The main purpose of the total knowledge experiment is to see if common sense helps model predictions. For each post, we retrieve the top 1 relevant knowledge segment in the first experiment, so each post has one most related knowledge segment.

The method of incorporating external knowledge segments is detailed in Section 4. We train the first two prediction models with knowledge segments from the two categories, and finally obtain four experiment results, as shown in Table 4. It is seen from the experiment results that psychological knowledge is more effective than total knowledge. Therefore, we further test psychological knowledge using a cross-attention mechanism to test whether the interaction of knowledge segments and posts improves model performance.

Table 4 Prediction results of mental health conditions (1 : 9)

The F1 score cannot measure extremely imbalanced data. If more data is used in the control group, the precision becomes lower with the same model parameter weights because the number of false positives is more likely to be high. To avoid the above situation, we additionally tested the case of the equal number of diagnosed users and control users as shown in Table 5.

Table 5 Prediction results of mental health conditions (1 : 1)

Because the previous baselines do not use PLMs, the feature extraction effect for natural language is poor. In order to show that our experiment results are better not only because of PLMs, but also by including external knowledge , we exclude the external knowledge from our model (called Only_Post) as the baseline to evaluate the effect of including the external knowledge for prediction. In Tables 4 and 5, Only_Post represents only the posts part of our model, PK represents psychological knowledge, and TK represents total knowledge. S represents the model that the hidden states of knowledge segments and posts are passed to the attention layer separately, and C represents the model that the cross-attention layer is used. If there is no S and no C, it means that the hidden states of posts and knowledge segments are concatenated before passing to the attention layer.

5.3.1 Analysis of the experiment results

It can be seen from Tables 4 and 5 that with the Only_Post model, it performed better than all the baselines made by the previous work. This also coincides with our conjecture that PLMs have very good feature extraction capabilities. We use this as a baseline to analyze the results of adding external knowledge segments.

It can be found that using psychological knowledge outperforms total knowledge. There are two possible reasons. First, the total knowledge has more than 20 million knowledge segments, which makes it difficult to find the commonalities and differences between various kinds of knowledge segments, and makes the predicting difficult. Second, the total knowledge contains more common sense than the psychological knowledge such that it is easier to find some knowledge segments unrelated to the mental health. This leads to inability to find the relevant knowledge segments, leading to poor results.

In terms of the models, passing the hidden states of knowledge segments and posts to the attention layer separately performs better than concatenating them together. The reason is that the importance of knowledge segments and posts are not consistent. The posts with greater attention weight do not necessarily find the important knowledge segments, and vice versa. Therefore, simply concatenating the hidden states of knowledge segments and posts, and passing to the attention layer reduces the predictive power of the model, because it makes model more difficult to find important contents. On the contrary, passing the hidden states of knowledge segments and posts separately through the attention mechanism results in better performance. Because the model can calculate the hidden states of knowledge segments and posts separately, it becomes feasible to find the hidden states of posts and knowledge segments that are important for model prediction. However, separating the hidden states of knowledge segments and posts make them unable to influence each other. We therefore add a cross-attention mechanism to further improve the prediction effect. Finally, we have achieved good results in most of the disease categories. Among these results, schizophrenia showed the greatest improvement with an increase of more than 10% in F1 score. The incorporation of psychological knowledge and the use of the cross attention layer increases the effect by more than 3%.

5.3.2 Extended experiments: More knowledge segments

This section explores the impact of introducing more external knowledge segments on model predictions. We let each post find the top 1, 3, and 5 related knowledge segments (from Psychological Knowledge) and compare the impact on the model. The results are shown in Tables 6 and 7.

Table 6 Experiments of top 1, 3, and 5 related knowledge segments (1 : 9)
Table 7 Experiments of top 1, 3, and 5 related knowledge segments (1 : 1)

It is found that more related knowledge segments may be helpful for model prediction, but not necessarily. While more imported knowledge segments increases the likelihood of acquiring important and relevant knowledge segments, it also creates more noise and makes the model harder to focus on the really important knowledge segments. For some mental health conditions with small number of diagnosed users, the more input of knowledge segments, the harder it is to find useful information, and as a result, the more serious problem of overfitting. For example, in eating disorders, the more knowledge segments is introduced, the worse effect results. Therefore, it is not necessarily helpful to introduce more knowledge segments; it depends on the amount of data and the actual situation of the experiment to make a decision.

5.3.3 Co-occurrence of mental health conditions

In this section, we discuss whether the nine mental health conditions predicted by our binary classification model are able to distinguish the multiple disorders of the user, and analyze the comorbidity.

We use the model with the best predictive performance, i.e. Only_Post + PK + C, and analyze the prediction results of all diagnosed users on the nine disorders to study the comorbidities. We made predictions for the nine disorders for each diagnosed user. We used Exact Match Ratio (EMR), a strict metric where each label in a multi-label needs to be exactly correct, and Hamming Loss (HL), a soft metric used to report the average number of incorrectly predicted class labels. The results show that the EMR is only 1.34%, while the HL result is 61.04%, which shows that our model cannot well distinguish the differences between different disorders.

We compared the comorbidity of the test data, as shown in Fig. 2, with our predicted results, as shown in Fig. 3. It is found from Fig. 3 that as long as the user has a mental health condition, it is almost always predicted to have other mental health conditions. However, the actual situation is not the case. Our model can well detect whether users have a mental health condition, but cannot accurately tell which mental health condition it is.

Fig. 2
figure 2

The relative co-occurrence of the disorder with other disorders in the test dataset

Fig. 3
figure 3

The relative co-occurrence of the disorder with other disorders in the prediction result

5.4 Source statistics for psychological knowledge

The psychological knowledge is collected from three sources: Screening tools, Wikipedia (mental health-related entries), and the DSM-5. Table 8 shows a statistical analysis of the knowledge segments retrieved by our method from these three sources.

Table 8 The source distribution of the top 1 5 related knowledge segments

We search for the top-5 related psychological knowledge segments for all posts and count their sources. It can be seen from Table 8 that for the source distribution of the top 1 5 related knowledge segments, the knowledge segments from Wikipedia (mental health-related entries) are much more often used than the other two. It can be seen from Section 3.1 that DSM-5 occupies the largest proportion of the psychological knowledge, while has the least relevance, and the probability of being retrieved is low. There are two speculative reasons. The first is that the encoder weights of DPR are trained on Wikipedia, which makes the segments from Wikipedia easier to be retrieved. The second is that DSM-5 is a book, compared with the screening tools and Wikipedia (Mental Health), it usually uses abstract or high-level sentences, which leads to a difference from the colloquial posts.

In order to understand which sources of knowledge segments are more important to predict mental health conditions, we analyze the model, Only_Post + PK + S, and make statistics on the sources of the top ten related knowledge segments that the model pays the most attention to. As can be seen from Table 9, DSM-5 is again the least important source compared with screening tools and Wikipedia (Mental Health).

Table 9 The source distribution of the top ten related knowledge segments that the model pays the most attention to

From Tables 8 and 9, it can be concluded that the screening tools and Wikipedia (Mental Health) are very important to the model. If we can increase the amount of data for both, there is a good chance that the model will perform even better.

5.5 Case study

We show in this section why the introduction of external knowledge segments can increase model prediction and improve interpretability. The model used is Only_Post + PK + S.

Because of the user’s privacy and data usage agreement, we paraphrased the post. The post attention parts present the three posts with the highest attention weights. The knowledge attention parts present the knowledge segment with the highest attention weight, and the post that retrieved the knowledge segment.

Table 10 shows that the predictions are wrong in the baseline model, but correct after including external knowledge segments. Based on the content of the post, it is difficult to determine that this is a depressed patient. However, from the attentional weight of the external knowledge segments, it is found that patients have a tendency to impulse purchase, making the correct judgment of the model. It is worth noting that some past studies(Mueller et al., 2011; Lejoyeux et al., 1997) have found a fairly high correlation between impulse purchase and depression, which is consistent with the knowledge that the models focus on.

Table 10 Example of depression

Table 11 is also an example of an incorrect baseline prediction, but it becomes correct after including the external knowledge segments. This example is from an autistic patient. It is hard to understand why the model made the prediction with the posts. By guessing from the most important knowledge segments, we see that users do not like to contact with other people, which makes one better understand the reason for the prediction.

Table 11 Example of autism

Table 12 lists some counter-examples that illustrate the inadequacies of current methods. Since our method retrieves knowledge segments for each post, some unimportant posts find noisy knowledge segments during training and testing. This problem of miscited knowledge segments needs to be overcome when incorporating knowledge segments into the model.

Table 12 Counter example

6 Conclusion

In this study, we improved the performance of detecting mental health conditions by incorporating psychological knowledge. The experiment results show that our method outperforms previous work, and the F1-score is increased more than 10% in some situations. By the attentional weight of the knowledge segments, our model can find knowledge segments that are important for predicting mental health conditions and improve interpretability by the content of the knowledge segments. This suggests that our model has the potential to be a reference for psychiatrists to assess patients; or to allow users to learn more about their mental health.

Moreover, DPR is an automatable process, and the external knowledge can be adjusted freely, which make our method more likely to be applied in practice. Through the source statistics for knowledge segments, useful sources can be found to improve the performance of the model. We are working on solving the problem of the miscited knowledge segments and improving the feature extraction capability of the PLMs and DPR encoders, hoping to achieve an even better performance.