Introduction

When under strain, healthcare workers (HCWs) may develop a state of sustained or high work-related stress [1,2,3]. Work-related stress refers to adverse physical and emotional effects, which occur when work expectations are inconsistent with the available resources and the needs of workers [3, 4]. High exposure to potentially traumatic events coupled with work-related stress may trigger episodes of psychological distress (e.g., anxiety, depression, PTSD) [5,6,7]. As a result, HCWs may develop mental health disorders that increase sick leaves and turnover rates [8, 9]. Studies conducted in healthcare settings have often underlined the resource scarcity (e.g., human, material) prevalent in these environments before COVID-19, which has only been worsened by its impact [5, 10,11,12,13,14,15]. This resource scarcity may contribute to psychological distress among HCWs [5, 6, 12, 13, 16]. Hence, the healthcare community needs interventions that work efficiently in targeting anxiety, depression, and PTSD in work settings [14, 17]. In the event of a crisis, providing interventions to all HCWs regardless of their mental health state may lead to a waste of resources [11, 16, 18]. Identifying HCWs most likely to need assistance is therefore critical when implementing effective preventive interventions to avoid staff shortages.

In the first months of a crisis, the guidelines of the National Institute of Health and Care Excellence (NICE) suggest using active monitoring as a method to identify individuals at risk of psychological distress [19,20,21]. However, studies show that continuous screening and active monitoring with traditional methods (e.g., self-report diaries, telephone follow-ups, in-person visits by professionals, supervision by management in the workplace) and questionnaires may be time-consuming and redundant, and may lead to reduced adherence [17, 22]. Ecological momentary assessment of mental health shows some similar limitations when administered daily [23, 24].

One potential solution to this issue involves the use of machine learning algorithms to train novel models to screen for distress. Machine learning may reduce the burden of active monitoring by decreasing the number of questions asked to HCWs. This analytic approach may allow for simultaneous testing of multiple factors and their complex interactions to identify the best performing algorithm [25]. By using machine learning algorithms, researchers can narrow down the number of questions that need to be asked on a weekly basis to more efficiently identify HCWs at risk of anxiety, depression, and PTSD. Reducing the number of questions may increase the likelihood that active monitoring remains a part of HCWs practices [21].

In recent a systematic review of machine learning studies on mental health prediction outcomes, using labeled examples to produce predictions (i.e., supervised learning) emerged as the primary approach [26]. Notably, Chung and Teo (2021) highlighted support vector machine (SVM) models as prominent for their accuracy in anxiety and depression prediction [26]. Le Nguyen et al. (2023) employed SVM models, achieving 99.32% accuracy when simulating distress experiences in HCWs [27]. However, the prevailing focus on cross-sectional methodology using retrospective data collection versus prospective may have introduced a blind spot in mental health research with respect to long term outcomes and trajectory [26].

Given the complexity of machine learning, several concerns have arisen in the research literature [28]. A common challenge is the small sample size, influenced by data collection costs and ethical constraints [26, 27]. In fact, many machine learning studies are still in the early stages of demonstrating their feasibility due to small samples and limited external validation [26]. Data quality also influences machine learning predictability and underscores the need for representative samples, especially in HCW’s studies [29]. Additionally, machine learning model performances vary according to data samples and preprocessing impacts, resulting in the need for using multiple models for optimal accuracy [26].

The current study

Given the limitations of current practices, the purpose of this study was to use machine learning to develop and test preliminary models with fewer questions to screen for the risk of developing anxiety, depression, and PTSD in HCWs.

Methods

Study design

We used the data collected from a prospective cohort study through a mobile application, in the province of Quebec, Canada, between May 8, 2020, and January 24, 2021, during the first and the second waves of COVID-19 [30]. The purpose of this original cohort study was to examine the evolution and the trajectory of psychological distress in HCWs during and after the first wave of COVID-19 [30]. Participants were prompted to complete several questionnaires through the Ethica app on a weekly basis for a period of 12 weeks. A notification reminded the participants to complete their assessment each week. Weekly reports remained confidential and voluntary. The research ethics board of the CRCHUM approved the research project. Each participant provided informed consent before their participation. The original study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines [31]. We applied a fivefold cross-validation approach to train and test our machine learning models using the data from the original prospective study.

Participants

A total of 816 HCWs from eight healthcare centers in Quebec, Canada, participated in the prospective cohort study. The number of respondents varied each week (i.e., not all participants responded to the questionnaire). No participants responded more than 12 times, with 39.9% of HCW responding 10 times or more, 28.7% between five and nine times, and 31.4% fewer than five times. Our sample involved HCWs from different sectors of health and social services, including hospitals (71%), long-term care (8%), and local community services centres (11%). We invited all workers to participate in the study regardless of the position they held within their organization. Our research protocol excluded participants on sick leave, for a reason unrelated to COVID-19 at the time of recruitment. Note that we did not include age beyond the initial assessment, as the maximum duration of participation was limited to five weeks. We removed any participants with missing values (features or labels) before conducting our analyses.

Measures

The mobile application used for data collection included questions about psychological distress, which were quantified through anxiety, depression, and PTSD. We measured these indicators through the French versions of the General Anxiety Disorder-7 (7-items; GAD-7; range 0 to 21; cut-off score = 10) [32], the Patient Health Questionnaire (9-items; PHQ-9; range 0 to 27; cut-off score = 11) [33], and the short form of the Post-Traumatic Stress Disorder Checklist for Diagnostic and Statistical Manual of Mental Disorders, fifth edition (8 items; PCL-5; range 0 to 32; cut-off score = 13) [34]. For each measure, the participant was asked to respond about the state of their mental health in the past seven days. If participants scores exceeded the clinical threshold, a message appeared at the end of the questionnaire to encourage them to contact support resources. These measures served as outcomes for the supervised machine learning algorithms.

As part of the prospective study, each participant had to respond to 11 additional questions related to their contact and concerns with COVID-19, their perceived support, their quality of life, and their perceived level of stress (see Supplementary material in Fig. A for original questions). To develop a model that required the least effort from the HCWs, we selected two of those questions to use as features (i.e., predictors) in our models. To select the two questions, we first removed the questions relating to COVID-19 (n = 5) so that our models could be used beyond the context of a pandemic. Second, our analysis involved identifying the remaining questions (n = 6) with the highest correlations, on average, with the three outcome measures anxiety (M = 5.60, SD = 4.41), depression (M = 6.16, SD = 4.89) and PTSD (M = 6.61, SD = 5.88) (see Correlation Matrix in Table 1). Of the three remaining questions with the highest correlations, two involved quality of life and were highly correlated. Thus, we removed the quality of life question with the lower correlation and were left with two questions overall. The two questions were: (Q1) In the last seven days, what was your level of stress at work?, and (Q2) When reflecting on your life over the last seven days, how would you rate your personal quality of life? Each question involved a 10-point scale from lowest (1) to highest (10). The purpose was to examine whether responses to these two questions could identify scores on the GAD-7, PHQ-9, and PCL-5.

Table 1 Correlation Matrix of original six questions

Analyses

To compare the cross-sectional and cumulative measures of anxiety, depression, and PTSD, our analyses applied two machine learning algorithms: logistic regression and support vector machines. Logistic regressions involve using the sigmoid function; it is akin to a linear regression, but its output is a probability from 0 to 1. Support vector machines allow the classification of the data in a nonlinear fashion by projecting them in a higher dimension and using a hyperplane to produce the separation between the binary labels. We chose to test two different algorithms to strengthen the conclusions that could be drawn from our analyses.

For both the logistic regression and support vector machine, we generally used the default values provided by the sklearn package in Python. More specifically, our logistic regression involved an L2 penalty term, a C regularization parameter of 1, a tolerance for stopping of 0.0001, the inclusion of an intercept, and the Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm as a solver. We set the number of maximum iterations to 200 and balanced our weight classes to minimize skew in the error patterns. Our support vector machine involved the following parameters: a C regularization parameter of 1, a radial basis function kernel, and a gamma value of 1 divided by the product of the number of features and the variance of x. As with the linear regression, we also used balanced class weights.

The three predictor variables, or features, for our models, were the response to the two questions (i.e., work stress and quality of life) and biological sex for each participant. The three outcomes, or labels, were binary variables extracted from the GAD-7, PHQ-9, and PCL-5. To screen for potential anxiety problems, the authors recommend using a cut-off of 10 [32]. Hence, a score of 10 or more was considered as a positive result, when screening for anxiety, whereas a score less than 10 was labelled as a negative result. Similarly, the cut-off value was set at 13, when screening for depression, as it has been recommended as a cut-off by some more recent studies [35, 36]. In this case, a value of 13 or more was scored as positive, when screening of depression, whereas a value of less than 13 was categorized as a negative result. Finally, we also set the cut-off at 13 for the PCL-5 as recommend by Price et al. [34].

Our analyses involved testing one cross-sectional and four cumulative models for each outcome (for a total of 10 models per algorithm). The cross-sectional model used the data from time 1 only as predictors and as outcomes. The initial cumulative model used the sum of predictors of times 1 and 2 as input and the outcomes at time 2 as output. Correspondingly, the next cumulative model used the sum of the predictors of times 1 to 3 and the outcomes at time 3. Our analyses applied the same procedures to times 4 and 5, whereas all preceding measures were included in the predictors, but only the last time point was used as an outcome. This manipulation allowed us to examine whether cumulative scores provided better screening for anxiety, depression, and PTSD.

For each model, our analyses produced four measures to examine the adequacy of the results: accuracy, sensitivity, specificity, and positive predictive value. Accuracy measured agreement between the outcome values produced by the model and the true outcome values by dividing the number of agreements by the total number of samples. Sensitivity, also referred to as the true positive rate, involves dividing the number of true positives correctly identified by the model by the total number of true positives. Specificity, also referred to as the true negative rate, computes the opposite: the number of true negatives correctly identified by the model by the total number of true negatives. Positive predictive value identifies the number of positive cases correctly identified by the model by dividing the number of true positives by the sum of true positives and negatives. We created confusion matrices to analyze the total error rate and its distribution (see Supplementary material in Figs. B et C).

A risk to consider with machine learning is overfitting the data, which would result in a failure of the models to generalize novel data. To prevent this issue, our analyses involved a fivefold cross-validation, wherein the models were trained on 80% of the data and tested on the remaining 20%. Using a fivefold cross-validation balanced our need to keep the maximum amount of data in the training set while keeping at least 100 samples in our test set. The latter was necessary to control the margin of error of our outcome measures (e.g., accuracy, specificity). This process was conducted five times so that each sample (i.e., individual) was in the test set exactly once. Our results section reports the means across folds for given model parameters. We conducted all our analyses using Python (version 3.7) with the scikit-learn machine learning package (version 0.23.2). The raw (anonymized) data and code are available freely in an online repository at: https://osf.io/3ey8x/?view_only=c5a41aeeaeea4061995900fe7ac14631.

Results

Table 2 presents the profile of the cohort at each time point. The female to male ratio, the age and the number of years of experience, remained consistent across each time point, suggesting that participant loss was random. The scores on the GAD7, the PHQ9, and PCL5 all decreased over time. Two potential explanations for this observation were that participant distress decreased over time or that those participants who were distressed were more likely to drop out of the study. To identify an explanation, we conducted a post hoc analysis wherein we compared changes in scores for participants that were included in both times 1 and 5. The mean within-participant change in scores for each of the three measures closely matched the ones observed in our Table. This observation indicates that our participants dropped out at random, but that distress did decrease over time in those who remained in the study.

Table 2 Sociodemographic profile of the cohort at each time point according to sex, healthcare centers, age, years of experience, scores of the two questions selected and scores of the GAD7, PHQ9 and PCL5

Table 3 presents the four outcome measures for the models using logistic regressions (upper half) and support vector machines (lower half). For the logistic regression, the most accurate and sensitive model to screen for anxiety as measured by GAD-7 was cross-sectional. That is, cumulatively measuring anxiety did not appear to improve the screening for anxiety using the two questions and biological sex. Consistently, the most sensitive model was also the cross-sectional when applying the support vector machine. However, the support vector machine produced the highest overall accuracy and specificity with three weeks of cumulative measurements. The problem is that sensitivity was at its lowest in this model. Given the importance of sensitivity for screening tools, these results suggest that cumulative measurement does not produce more adequate screening for anxiety.

Table 3 Accuracy, Sensitivity and Specificity in Identifying a Severe Score on the GAD-7, PHQ-9, and PCL-5 Using Logistic Regressions and Support Vector Machines: Training Results

In contrast, cumulative models were better at screening depressive symptoms (i.e., PHQ-9) than cross-sectional models. Notably, the logistic regression model had over 0.80 accuracy, sensitivity, and specificity only when a cumulative 3-week measures were used as predictors. Table 1 indicates that a consistent pattern was observable with support vector machines. Specifically, the cumulative 3-week measures produced the best accuracy and sensitivity to screen for depression screening in the workers. Yet, having more weekly measures did not further improve the measures. Moreover, the positive predictive value decreased with more measures for both models, which is an issue as it would increase the number of HCWs that need to be further tested. The 5-week cumulative predictors led to the worst screening with logistic regression, and similar results to the cross-sectional model were found for the support vector machines.

We observed similar patterns when using the PCL-5 to screen for PTSD. The most accurate model overall was the cumulative measure at three weeks for both algorithms, but the cross-sectional model followed closely behind and had much better positive predictive values. The worst performing models were the cumulative five weeks of the logistic regression and the cumulative two weeks for the support vector machines. Taken together, these results underline the limited utility of using cumulative weekly measures to screen for anxiety, but support a 3-week running measure for potentially detecting depression and PTSD in workers.

Discussion

The results of this study showed that our preliminary models accurately screened for anxiety, depression, and post-traumatic stress disorder in 70% to 80% of cases using two questions and biological sex. These results highlight the ability of machine learning to reduce questions to screen for psychological distress, from 24 items for the GAD-7, PHQ-9 and PCL-5, to two questions and biological sex. Our comparison of cross-sectional and cumulative measures also revealed that the cross-sectional model provides the best balance between accuracy and predictive positive values.

The dynamic interaction of the three items included in the model may explain the observed results. The first question assessed the level of subjective work-related stress on a 10-point scale. Stress underlies anxiety, depression, and PTSD; anxiety is a result of anticipatory stress, depression develops through chronic stress and PTSD is life-threatening reaction to stress [37,38,39]. As such, this question may capture the influence of a transdiagnostic symptom. The second question related to quality of life may act as a protective factor by moderating different work-related stress components. A higher quality of life seems to impart an increased resistance to negative consequences of stress, much like a buffer, which provides to an individual a more secure distance from the stress itself [1, 40, 41]. Lastly, biological sex contribute to the model inasmuch as women are at higher risk of developing anxiety, depression, and PTSD [42]. In sum, these three items allow the screening of psychological distress while considering biological sex differences.

One of the implications of this study is that it may be possible to mitigate attrition of active monitoring by reducing the number of questions and the assessment duration to screen for psychological distress. Not only could this decrease the burden on organizations to identify at-risk workers, but it would also take less time away from workers who need to focus on patient services, especially during a crisis. Such an approach may also allow more rapid identification of workers in distress, as early interventions are more likely to prevent the emergence of mental health problems (in contrast to late interventions) [43]. Finally, proper screening allows for a better distribution and organization of resources instead of systematically offering aid to all HCWs, which may avoid wasting limited resources.

This research has limitations that should be noted. The main limitation of our analyses is that we did not conduct hyperparameter tuning for our models, which could have further improved their accuracy. Concerns with overfitting and the size of our sample guided our decision not to apply hyperparameter tuning with the current dataset. A second limitation is that participants did not respond to all measurement points. Considering that the HCWs participated in the context of the COVID-19 pandemic, we recognize that the reality of workers made it difficult to take part in active monitoring consistently. The third limitation lies in the fact that various sectors within healthcare services, including hospitals, long-term care facilities, and local community services centers, each present unique challenges that could potentially confound our model's results. This limitation is underscored by the clustered nature of the data, which comprises workers from 8 distinct Quebec healthcare centers, where the influence of internal management and environmental factors could lead to correlated residuals among the workers. However, it remains that data was compiled with the intention of limiting the number of questions. A fifth limitation is the self-reported nature of the data, where social desirability and self-representation distortion may produce bias. However, the use of a mobile app in this context reduced the probability of bias because of its anonymous nature. Finally, the positive predictive models remained too low for our models to be adopted in work settings. Further tuning with more data needs to be conducted to produce models that produce higher percentages.

That said, our proof of concept clearly shows that machine learning is a promising tool to develop more efficient screening procedures for psychological distress at work. These results provide further support for the potential of machine learning [44,45,46]. Our machine learning approach could pave the way for future research to develop more accurate models. To increase the power and positive prediction values of our models, one avenue may be to increase the range in our response scale. For example, passing from 0–10 to 0–15 might increase the specificity and sensitivity of our models. Combined with a larger dataset, machine learning algorithms may eventually train models capable of identifying the correct cut-off score in a higher proportion of cases.