Keywords

Introduction

Mental health assessment is a critical step in the clinical practice and research guiding the treatment and follow-up of patients by clinicians. So far, much of the tools utilized for screening and diagnosis have been paper-and-pencil assessments to evaluate psychopathology of several mental disorders such as depression, anxiety disorders, bipolar disorder, and schizophrenia. As the need to easily handle information about the patient’s psychiatric symptoms has increased over time, the paper-and-pencil instruments have been transformed into digital questionnaires and used in different digital formats to assess mental health. Recently, there was a high number of mobile phone mental health assessment applications (apps) available on platforms such as Google Play (Android) and iPlay (iOS) accessible to anyone with a smartphone or tablet. Mobile health (mHealth) is a promising field available to clinicians and patients from distinct areas of medicine including psychiatry. In this chapter, we review the current literature regarding the psychometric properties of the self-reported digital instruments used for screening, diagnosis, symptoms, and treatment response of mental illness. When available, the paper-and-pencil questionnaires are compared to its transformed digital version. In addition, we discuss the potential and limitations of mHealth in the assessment of mental disorders.

Psychometrics: A Brief Overview

There are several psychological scales available that are able to assess aspects of human behavior such as personality traits, thoughts, memory, cognition, mood, and motivation. However, all scales used to measure these psychological characteristics must be meaningful and reliable. The science that analyzes the basic principles of psychological scales is known as psychometrics [1] and deals with the validity and reliability of instruments that measure some hypothetical construct (for example, depression, anxiety, self-esteem, intelligence, etc.).

When we say that a scale is valid, we are referring to the degree of an instrument that explains the behavior that is intended to be measured. According to the Standards for Educational and Psychological Testing, “validity refers to the degree to which evidence and theory support the interpretations of scale scores for proposed uses of scales. Validity is, therefore, the most fundamental consideration in developing and evaluating instruments” [2]. Furthermore, in case an already developed scale is transformed from paper-and-pencil format to digital format (web page, computer software, and mobile application), there are some steps necessary to assess whether the two formats are equivalent, such as some problems they may arise from differences in the visual presentation of the items and the environment in which the assessment was carried out [3]. Thus, studies that assess psychometric equivalence in different formats of a scale (e.g., paper vs. digital) are needed to ensure that both instruments measure the same construct.

In terms of psychometric properties, there are objective ways to analyze the validity and reliability of an instrument based on the contemporary view that construct validity is the essential concept of validity. In this sense, construct validity is the degree to which an instrument score represents or can be interpreted as reflecting a psychological construct (e.g., anxiety, depression, self-esteem, motivation, etc.). According to some authors, the validity of an instrument can be assessed by types of evidence, such as content validity (face validity), the internal structure of the scale or reliability (internal consistency and test-rest), construct validity (convergent validity and discriminant validity), and criterion validity (concurrent and predictive validity) [4]. Thus, validity is a unitary concept and those types of evidence taken together add information about the scale validity.

Content Validity

The content validity refers to the items or questions of a scale and the content that would be expected in this instrument to measure a specific construct. The items of an instrument must include all relevant facets of the construct; otherwise, this instrument may have irrelevant content of the construct (question or items) and reduce validity. For example, an instrument to measure occupational functioning includes some questions relating to the ability to work, or looking for a job, or the ability to take care of one’s home on their own [5]. However, if the instrument had included questions about work preferences or house cleaning skills, they would likely be irrelevant items for measuring occupational functioning and would not reproduce the functioning construct. In addition, the construct facets should be composed of as many questions or items as possible that represent the construct to avoid reduced validity by under-representation of the construct.

Another important aspect in assessing the content validity of a scale, especially in the process of developing a new one or translating it into different languages, is face validity [4]. Face validity deals with how the respondent perceives the items of an instrument as relevant to measure the construct under study. For example, Mustafa et al. [6] translated and adapted the mHealth App Usability Questionnaire (M-MAUQ) into Malay, an app that aims to assess the usability of mobile apps and measure face validity by comparing expert scores and target user opinions on the understandability of the translated M-MAUQ items. In this example, all items had an excellent level of agreement (modified kappa >0.75) with a mean face validity index for 18 items (understandability = 0.961), indicating equivalence of face validity with the original version.

The Internal Structure of the Scale

The internal structure of the test is another important aspect while analyzing the validity of a new instrument. The internal structure refers to the items (or questions) in a scale and how they are related to each other to form one or more clusters that reflect the construct intended to be measured. Usually, items that strongly correlate with some items but weakly correlated with other items form clusters, indicating more than one domain is being measured. This is particularly useful to understand if the scale allows the assessment of a global measure or specific domains of the construct. Therefore, if a test was developed to have one dimension and the factor analysis shows a good correlation between items, there is good evidence that the internal structure predicted was achieved [7]. The internal structure of scales is commonly measured by means of factorial analysis (exploratory or confirmatory factor analysis) or principal component analysis.

Internal Consistency Reliability

Internal consistency reliability assesses the degree to which questions on an instrument measure the same underlying concept. It can be used to determine the consistency of instrument score when it is applied at once or across replications of the same test. When the test–retest approach has been applied the analysis of scores in distinct periods of time may be assessed by correlation analysis, while coefficient alpha or Cronbach’s alpha may be used when the instrument was applied once [1, 2, 8]. Furthermore, the reliability of a score can be estimated empirically by its reliability coefficient, generalizability coefficient, item response theory (IRT) information functions, standard errors, error/tolerance ratios, or various indices of classification consistency [2]. Based on the classical test theory (CCT), the reliability coefficients are estimated by statistical analysis of internal consistency.

In general, reliability can be considered as strong or weak as there is no score that represents a 100% reliablility. Keeping that in mind, and according to CTT, the reliability coefficient of a score ranges from 0 to 1, with 0 indicating no evidence of reliability and 1 a perfect measure of reliability. As the CTT takes into account observed scores, true scores, and measurement error, a score with a reliability coefficient of 0.70 would indicate that 70% of the score is actually measuring a true score of a construct and 30% of the score is a measuring error of any source [2]. According to some authors, a reliability coefficient >0.70 means a satisfactory level of reliability [8].

Construct Validity (Convergent Validity and Discriminant Validity)

Convergent Validity

Convergent validity refers to a construct measured in different ways that produce similar results. Specifically, it is the degree to which scores on a studied instrument are related to measures of other constructs that can be expected on theoretical grounds to be close to the one tapped into by this instrument. Evidence of convergent validity of a construct can be provided by the extent to which the newly developed scale correlates highly with other variables designed to measure the same construct. Therefore, if the score of the newly developed scale is highly correlated with another scale that measures the same construct, we conclude there is some level of convergent validity [4].

Discriminant Validity

Discriminant validity refers to a measure that is novel and not simply a reflection of some other construct [4]. In other words, it is the degree to which the scores of a studied instrument are differentiated from the behavioral manifestations of other constructs, which, from a theoretical point of view, cannot be related to the underlying construct of the investigated instrument [4]. For instance, González-Robles et al. [9] studied the psychometric properties of the online version of the Overall Anxiety Severity and Impairment Scale (OASIS) among Spanish patients with anxiety and depressive disorders, including discriminant validity. In this study, correlation of OASIS with the Positive and Negative Affect Schedule-Positive Affect was not as high (r = −0.40, p < 0.01) as for Beck Anxiety Inventory (BAI, r = 0.61, p < 0.01), suggesting OASIS maintained the property to evaluate symptom of anxiety and not positive affect.

Criterion Validity (Concurrent and Predictive Validity)

In addition to what has been mentioned so far, the other relevant aspects of validity are concurrent validity and predictive validity [4].

Concurrent validity refers to the relationship between the scores of two instruments measuring the same construct taken at the same time, usually the new instrument compared to another “gold standard” for the construct of interest. For example, the BDI score of depression delivered through the ReMAP app showed good correlation with “gold standard” clinician-rated depression severity using the HDRS in a subset of the sample (r = 0.78), suggesting evidence of concurrent validity [10].

Contrary to concurrent validity, predictive validity is the extent to which a measure predicts the answers to some other question or a result to which it ought to be related with, i.e., the scale should be able to predict a behavior in the future [4]. For instance, the online version of the Dutch Penn State Worry Questionnaire (PSWQ), a self-reported assessment of pathological worry, had their predictive validity estimated by relationship with worry frequency and worry duration variables [11]. In this study, score of PSWQ was significantly associated with the total time spent worrying during the day (r(187) = 0.446, p < 0.001) and during the night time (r(187) = 0.324, p < 0.001), as well as with the frequency of worry episodes during the day (r(187) = 0.418, p < 0.001), and during the night time (r(187) = 0.310, p < 0.001), suggesting that worry frequency and worry duration were predicted by PSWQ scores.

The Psychological Process Used in the Scale Responses

The psychological process used in the test responses deals with the cognitive process that a respondent uses while answering a test and the cognitive process they should use to answer the test [1, 2]. This is an important step in assessing the validation degree of a measure as any deviation of expected process to answer a test can affect the test score beyond the intended purpose of the test. Some authors exemplified the issue of process used in a test when test takers used more than cognitive attentional resources to answer a word task [1]. In this example, the scores were inflated because one group did not follow the rules and the scores did not show strong evidence of validity.

Consequences of Using Test

The consequences of using the test deal with sources of bias and useful application of scores when making decisions, affecting the degree of validity of the construct measure and their intended use. For example, men who take the test score higher than women on a screening for depression and, for that result, are referred to see a psychiatrist. However, there are some concerns that the test items were not truly gender balanced and therefore male was given priority in the consultation. In this hypothetical scenario, construct validity is impaired, as scores can be biased and result in adverse consequences for test participants. Typically, most instrument comparisons in clinical practice do not assess this aspect of construct validity.

In sum, we must give an overview of the main components of the psychometric properties commonly used by researchers to assess the validity and reliability of scales when they are developed or for existing instruments that are transformed into digital format, mainly for the purpose of helping the reader in the following sections. However, it is beyond the scope of this section to discuss Item Response Theory (IRT) as another method for evaluating measurement at scale. For this, we suggest readers to read [12] as a starting point. Finally, we have chosen examples to clarify most definitions of validity, although we cannot guarantee that the results given in the examples are in fact a confirmation of validity, as validity is a matter of degree rather than a matter of yes/no.

Psychometric of Mental Health Instruments: Paper-and-Pencil Versus Digital Formats

With the widespread use of the internet in the 1990s, the assessment of mental disorders started a new era of digital assessments through computer-based assessment, internet web page assessments, and more recently by mobile apps through smartphone or tablets [7, 13, 14]. While in the previous section we discussed the main steps to consider when assessing the validation of new instruments in psychiatry, here we describe the process that must be followed for those instruments to be transformed from paper-and-pencil format to digital versions.

While the instruments available in digital format cover a broad range of mental illnesses [15], there are some concerns that psychometrics of the digital format may not be the same as the original paper-and-pencil format and can affect, to some extent, the validity and reliability of the scores measured [16]. For instance, the assessment of mild cognitive impairment by the Cambridge University Pen to Digital Equivalence assessment (CUPDE) showed significant differences in reliability and validity of scores to its paper-and-pencil version Saint Louis University Mental Status examination (SLUMS) [17], even after change from web-based to app-based interface/layout [18]. In addition, the assessment of anxiety in patients with panic disorder by the internet-based BDI questionnaire showed significant difference in means scores, with lower scores observed in the internet version compared to pen and paper assessment [19]. Furthermore, not all studies assessing psychological symptoms by mobile apps have been validated suggesting that more studies are needed to analyze the equivalence between instruments [16].

In this sense, the equivalence of different formats of instruments used in psychiatry has been reviewed by some studies [7, 13, 14] considering some aspects of validity and reliability. According to van Ballegooijen et al. [13] the equivalence between distinct formats should be assessed by the same steps used in the validity and reliability studies of newly developed scale. Therefore, the following tests should be considered in order to examine equivalence between formats: internal consistency, test-retest reliability, measurement error, internal structure and model fit or explained variance, correlation between the two instruments, difference in mean scores between online and paper versions and criterion validity in terms of sensitivity and specificity (for the optimal cut-off point). Likewise, another systematic review [7] highlighted the importance of performing test-retest reliability, internal consistency, and mean differences between instruments, including the effect size test.

Furthermore, there is some evidence that respondent’s perception of the questions delivered should be taken into account and may produce evidence of face validity. For instance, participants reported preference for single items instead of multiple items per web page when they answered instruments such as Beck Depression Inventory (BDI), Beck Anxiety Inventory (BAI), Quality of Life Index (QOLI), and Montgomery-Åsberg Depression Rating Scale (MADRS) [20]. It is also important to consider the respondent perception of the digital layout along with functionality, navigation, personalization, and appearance of a mobile app [21].

In sum, all those aspects might influence the way that respondents answer questions, thus affecting the validity and reliability of the instrument. Thus, instruments that assess psychological symptoms need further validation study when the original format is adapted to digital devices, including original paper-and-pencil versions transformed to computer-based instruments, web page instruments, and mobile applications. In the tables below, we summarize psychometric properties (i.e., face validity, discriminant validity, concurrent validity, internal consistency, intraclass correlation coefficients (ICC), correlation, and mean scores comparisons) of some digital instruments based on pen and paper scales commonly used to measure symptoms in the field of psychiatry.

Online Web Page Self-Reported Questionnaires

The online web page includes any platform accessed over the Internet using a browser. This digital format requires an Internet connection and a mouse, keyboard, or fingertip as devices to navigate and select web page content. In the field of mental health, few studies have compared the equivalence of an online web page with pen and paper [7, 13]. Overall, online and pen and paper versions have been compared in terms of correlation between scores, comparing score’s mean, effect size of differences, internal consistency, convergent validity and criterion validity [13] (see Table 4.1). For example, instruments that assess symptoms of anxiety have shown good degrees of reliability (Table 4.1). However, the Beck Anxiety Inventory (BAI) assessed online showed a remarkable difference in terms of average compared to the paper-and-pencil version.

Table 4.1 Psychometric properties of instruments to assess self-reported symptoms of depression and anxiety in online web pages

In general, instruments that assess post-traumatic stress disorder had a good level of reliability when delivered in web page format compared to their pen and paper instrument counterpart. For example, means scores of the PTSD Check List–Civilian Version (PCL-C), Trauma Symptom Screen Frequency (TSS Frequency), Trauma Symptom Screen Distress (TSS Distress), and Traumatic Life Events Questionnaire (TLEQ) were similar to pen and paper version [7]. In addition, all showed format correlation (ICC and/or r > 0.65) with pen and paper and internal consistency >0.80 in the web page format [22,23,24].

As for other measures summarized so far, questionnaires that assessed self-reported symptoms of panic disorders (Body Sensations Questionnaire, BSQ; Agoraphobic Cognitions Questionnaire, ACQ; Mobile Inventory Accompanied, MI Accompanied; Mobile Inventory Alone, MI Alone) showed a good reliability, with format correlation (ICC or r > 0.90) with pen and paper and high internal consistency (Cronbach’s alpha >0.9) [19, 25]. However, assessment of web page means scores showed that BSQ, ACQ, and MI alone might slightly differ from pen and paper score [7]. Even though the results are informative, researchers have to consider such differences when transforming the pen and paper format to web page format of those instruments.

The instruments used to measure perceived physical and mental health had not performed well in web page format. For instance, there were some differences in scores on subscales of General Health Questionnaire-28 (GHQ-28) and Symptom Checklist 90 Revised (SCL-90-R) [7], indicating the scores of subscales might not be consistent with the pen and paper versions. However, format correlation (GHQ-28 r = 0.49–0.92; SCL-90-R r = 0.74–0.96) and internal consistency (Cronbach’s alpha >0.90) showed some evidence of validity [26]. Other scales such as the Short Form [12] Health Survey Version Two (SF12V2) had similar scores compared to paper-and-pencil version [7] with moderate to good internal consistency (Cronbach’s alpha of 0.68) [27]. Thus, researchers should use GHQ-28 and SCL-90-R with caution regarding scores of subscales, while SF12V2 might be a good alternative to assess the physical and mental health construct.

The instruments to assess self-reported drug abuse had shown a good level of evidence of reliability. For example, the Alcohol Dependence Scale (ADS), the Alcohol Use Disorder Identification Test (AUDIT), the Rutgers Alcohol Problem Index 1 month (RAPI 1 month), the Rutgers Alcohol Problem Index 6 months (RAPI 6 months), and the Rutgers Alcohol Problem Index 1 year (RAPI 1 year), all showed equivalence of means scores to pen and paper versions [7]. In addition, all performed very well regarding test-retest reliability (r = >0.78) [28].

The only instrument analyzed by Alfonsson et al. [7] to assess symptoms of insomnia, the ISI, showed a good reliability compared to the pen and paper version. For instance, analysis showed format correlation of 0.99/98 and internal consistency of 0.61/0.88 [20], with identical mean scores in paper and pen (15.86 ± 3.80) compared to online version (16.00 ± 3.87) when compared by statistical analysis [7].

Altogether, there is a good level of evidence that instruments that assess a wide range of psychological symptoms by online web pages maintain equivalence with pen and paper measurements, except for a few subscales that assess panic symptoms (BSQ and ACQ, with lower and upper marginal scores, respectively, compared to pen and paper format) and physical and mental health (SCL-90-R and GHQ-28) which showed some differences in mean scores.

Computer-Based Instruments

Computerized self-report instruments are digital versions of pen and paper questionnaires delivered through desktop software [35] instead of a web page accessed through the internet. For instance, the PHQ-9 and BDI-II were part of a computer-based therapy design to improve symptoms of depression delivered by a flash drive on a designated computer onsite in an outpatient clinic [36]. The assessment of mental health by computer-based instruments also covers a broad range of self-reported symptoms, including depression and anxiety (Table 4.2).

Table 4.2 Psychometric properties of instruments to assess self-reported symptoms of depression and anxiety in computer-based studies

In the assessment of depression, the BDI was studied by four independent authors [35, 37,38,39], with a good reliability (Table 4.2). For the assessment of reliability between pen and paper to computer-based format, studies performed with anxiety instruments showed few data to allow a full analysis, including few with scores and intraclass correlation (Table 4.2).

The study of Schmitz et al. [40] reported comparison of pen and paper and computerized versions of the SCL-90-R to assess perceived mental health. In this study, there was high internal consistency (Cronbach’s alpha =0.98), but there was no information regarding interformat correlation. In addition, there was no statistical difference in mean scores between formats (pen and paper: 1.20 ± 0.66 vs. computerized version: 1.29 ± 0.66) [7].

The studies performed by Chan-Pensley [41] and Murrelle et al. [42] assessed the psychometrics of instruments delivered by computer to measure alcohol and tobacco dependence or misuse. The instruments AUDIT (mentioned in item 3.1), Michigan Alcohol Screening Test (MAST), CAGE Substance Abuse Screening Tool (CAGE), Drug Abuse Screen Test (DAST), and Fagerstrom Tolerance Questionnaire (FTQ) showed format correlation r = >0.65. More recent analysis showed that the computerized version of AUDIT had very close scores compared to the paper-and-pencil version, while the other instruments did not report mean scores for comparison studies [7]. However, not all studies have assessed format correlation and internal consistency between instruments, which may limit the interpretation of results.

The paper-and-pencil scales transformed to computer-based instruments were the earliest digital format used to assess psychological symptoms. In general, most scales delivered through computer software showed some evidence of equivalence to pen and paper format, except for BDI (depression) and STAI-S (anxiety) that had higher means scores in the pen and paper version [7].

Mobile Application (App) Format

The number of health apps available for download can be as high as 325,000 according to estimates published in 2017 [47], with >10,000 related to mental health [48]. The use of mHealth technologies in severe mental disorders such as bipolar disorder, schizophrenia, and major depressive disorder has been systematically reviewed yielding valuable results regarding the psychometric properties of some apps [15]. Most studies in the area of mental health assessment through mobile apps were published after 2013 [13], probably as a result of the widespread use of smartphones. Thus, in this section, we summarize some findings in the field published in recent years.

The Mobile Screener was an app developed in an iOS platform (iPhone) to assess symptoms of PTSD (PTSD Checklist, PCL-C), depression (Patient Health Questionnaire-9, PHQ-9), suicidal ideation (Revised Suicidal Ideation Scale, R-SIS), anger (Dimensions of Anger 5, DAR5), common sleep difficulties and daytime tiredness (Sleep Evaluation Scale), and clinical symptoms (BI Self-Report of Symptoms) in health volunteer soldiers [22]. All measures were analyzed by internal consistency and intraclass correlation between app and pen and paper formats. In general, digital scores in all instruments were close to the original format and with intraclass correlation ranging from 0.62 (DAR5) to 0.95 (Sleep Evaluation Scale). In addition, these apps were satisfactorily qualified by the respondents as easy to submit answers, navigation through pages, sections, and questions. Indeed, more than 70% of them prefer digital app format rather than other formats of questionnaires [22]. However, the limitations of the study included the assessment of symptoms in healthy volunteers and sample (N = 46) meaning the results may not be generalized to patients.

Another study developed a mobile tablet app to measure psychosocial functioning in patients with schizophrenia based on the pen and paper full version of University of California San Diego Performance-Based Skills Assessment (UPSA) [49]. The mobile app (UPSA-M) retained 4 out of 5 subsets (planning recreational activities, finance, communication, and transportation) of the original version. The UPSA-M app showed feasibility and 80% sensitivity to differentiate health subjects from patients with schizophrenia, and the app scores significantly correlated with UPSA pen and paper version (r = 0.61). However, in the health controls the correlation did not reach significance (r = 0.24). The authors stated that the USPA-M may possess the same psychometric properties of full UPSA and further studies are needed to validate for use in clinical practice [49].

The app ClinTouch was developed to assess daily self-reported psychosis compared to face-to-face Positive and Negative Syndrome Scale (PANSS) and Calgary Depression Scale (CDS) interview [50]. The app was developed in the Android platform and contained two sets of questions based on PANSS and CDS. Set 1 consisted of questions to assess guilt, hopelessness, depression, social withdrawal, conceptual disorganization, excitement, and hallucinations, while set 2 assesses anxiety, grandiosity, hostility, somatic concern, guilty ideas of reference, paranoia, and delusions. The validity of ClinTouch was evaluated in remitted patients, acutely psychotic patients, and those with ultra-high risk of developed psychosis. The patients showed good compliance with the study procedure and only those who had negative symptoms were likely to show greater reactivity to the app (i.e., changing thoughts or mood by answering the questions). In addition, alpha scores showed satisfactory internal consistency (Cronbach’s alpha >0.76). In general, there were significant correlations with PANSS positive and affective symptoms while no correlation between the passive and apathetic social withdrawal, hostility, excitement, and cognitive disorganization subscales with the PANSS subscales, suggesting there are some limitations in self-reported assessment in this group of patients.

Apps that allow patients to assess daily measures of mania and depression are extremely useful to provide data on mood changes over time and be used as a guide to prevent relapse in individuals with bipolar disorder. The “Monitoring, treatment and prediction of bipolar disorder episodes” (MONARCA) is a specific app developed to assess mood symptoms in bipolar disorder. This app asked participants to assess every evening (during 3 months) items regarding subjective mood, sleep duration, medicine intake, irritability, activity level, mixed mood, cognitive problems, alcohol consumption, stress, and individual warning signs. In addition, objective measurements were automatically taken regarding social activity, physical activity, speech duration, and cell tower ID. The MONARCA validity study showed 88% adherence to self-report measures using the app and significant correlation between depressive symptoms measured by the app and the Hamilton 17-item Depression Rating Scale interview. However, no correlation was found between Young Mania Rating Scale and self-reported manic symptoms, which was explained by the low prevalence of mania in the sample subpopulation (YMRS score = 2.7) [51].

The Mindful Moods app was developed to assess real-time symptoms of depression in real life in a sample of adult patients with major depressive disorder (N = 13) using a smartphone version of the PHQ-9 three times a day for 29–30 days [52]. Respondents received survey notifications via the app with three random PHQ-9 pen and paper questions to answer throughout the day on a Likert scale. In addition, patients attended personal visits to respond to a PHQ-9 pen and paper at the beginning and end of the study. The analysis showed good scoring correlation between the two formats (r = 0.84), although the app’s scores were on average 3.02 points higher than the pen and paper version. Furthermore, suicide at levels 2 and 3 was reported only in the PHQ-9 app version, suggesting the scenario and may have influenced responses and scores. In addition, adherence to the study protocol was 77% for 30 days, suggesting the feasibility of a long-term protocol to assess symptoms of depression in real time.

Another study developed the Remote Monitoring Application in Psychiatry (ReMAP) app to collect ecological momentary assessment (EMA) symptoms of depression in a sample of healthy controls, patients with Major Depressive Disorder (MDD), bipolar disorder, social anxiety disorder (SAD), MDD with comorbid SAD, or specific phobia (SP) with spider subtype [10]. The study app was the digital format of the BDI and the concordance of scores with the paper-and-pen versions of the BDI, BDI-II, and HDRS assessed by the physician was compared by correlation of the intraclass coefficient and internal consistency (Cronbach’s alpha). The overall agreement of the BDI between formats was high (ICC = 0.92), but lower for healthy controls (ICC = 0.63) and patients with anxiety disorders (ICC = 0.72). The internal consistency of ReMAP BDI (Cronbach’s α = 0.944) was similar to pen and paper BDI (BDI-I: α = 0.945; BDI-II: α = 0.944). In addition, concurrent validation was established for the ReMAP BDI, which was correlated with clinician-rated depression severity using the HDRS in a subset of the sample (r = 0.78) that was comparable to the association between the HDRS score and the score of the pen and paper BDI (r = 0.68), suggesting ReMAP showed evidence of equivalence with pen and paper BDI in bipolar patients.

Lastly, a recent systematic review determined the feasibility and evidence of validity of mobile apps developed to monitor episodic symptoms and course of symptoms over time in bipolar disorder patients [14]. The review included 13 studies, but only eight studies assessed the equivalence of the scores obtained in the digital version with clinician-rated assessment or pen and paper self-reported scales. In general, the authors concluded that there is some evidence of concurrent validity for the app Monsenso system (compared to clinician-rated HDRS and YMRS) and MONARCA (compared to clinician-rated HDRS-17 and YMRS), while a mood chart scale app did not show concurrent validity compared to pen and paper mood chart, MADRS and YMRS. In addition, there was convergent validity between the app MONARCA self-reported mixed symptoms and Cohen Perceived Stress Scale (PSS), but not with the abbreviated World Health Organization Quality of Life scale (WHOQoL-BREF) scores. Furthermore, the app MONARCA showed convergent validity for both irritability and mood instability with the Functional Assessment Short Test (FAST), PSS, and WHOQoL-BREF. These findings suggest that mobile app-based self-report tools are valid in the assessment of symptoms of mania and depression in euthymic patients with bipolar disorder.

Conclusion

The evaluation of psychometric properties of instruments transformed into digital format has great potential in psychiatry. When developing a scale, the researcher must carefully examine all types of validity evidence when developing new scale formats based on previous excellent mental health instruments. First, the selection of gold standard instruments is suggested, ideally those that were studied in the target population (general population or clinical sample). In the case of developing new instruments in digital format from scratch, it would be extremely important to choose the appropriate construction and content of the instrument, usually based on previous instruments and the opinions of experts in the field. Second, another key aspect of developing a digital assessment is testing whether the target population is able to use the format, especially considering their ability to use mobile devices. Ideally, pilot studies with the target population would improve face validity before establishing a new scale in digital format. Third, after collecting data in a pilot study, check the agreement of the internal structure of the digital instrument with the original paper-and-pen version, usually by internal consistency and factor analysis. If not fully adhered to, consider the extent to which the differences might impair the accuracy of the construct being measured. Fourth, it is very important to compare the scores of newly developed digital format instruments with other instruments that measure the same construct to confirm concurrent validity, preferably with a gold standard instrument. Finally, and most importantly, researchers must plan carefully before starting research on scale validation in a new format, as digital health is continually evolving in the way data is collected.

In conclusion, the field of digital mental health assessment has evolved over the past 25 years from computer-based instruments to today’s use of mobile apps to measure symptoms across a wide range of mental conditions. Although mobile assessment psychometrics has been studied for some recognized instruments, it is imperative that more psychometric studies be carried out in patients with symptoms of anxiety, post-traumatic stress disorder, and dependence or misuse of alcohol and tobacco. In addition, the respondent’s perception of digital layout in mobile apps, along with their judgment on navigation, safety, and ease of use, should be addressed in future studies that compare the psychometric properties of mobile app questionnaires with their paper-and-pencil versions.