Introduction

In psychiatric research projects a diagnosis is important for the selection of participants and as an outcome measure. To obtain a sample of participating patients who fulfill the criteria for the condition under study, or to assess the outcome, is demanding because of the length of the necessary psychiatric interview. Up until the 1970s, these interviews were mainly face-to-face. Telephone interviews as alternative method were hardly ever mentioned in textbooks on survey methods [1], and they were seen as inferior to face-to-face interviews [2]. Researchers assumed that telephone interviews should be short and that they were only suitable for gathering factual data and not for more sensitive issues [1, 3]. The main advantage of telephone research is obvious: the low cost rate compared to face-to-face interviews [2, 46], which are about twice as expensive [1, 3]. Another advantage could be more control over the interview process [3, 4, 7, 8] thus decreasing interviewer influence [2]. The obvious drawback of the telephone interview is the lack of visual signs, which may be a cause of missing important diagnostic cues [2].

Telephone interviews in general show more compliance or acquiescence (yes-saying), evasiveness (“I don’t know” answers, or no response at all) and more extreme responses compared with the face-to-face interviews [2, 3, 911]. Also, respondents tend to give more information in face-to-face interviews, especially following open-ended questions [2, 4, 7, 10]. Telephone interviews may be less suitable for people who are hearing impaired [2, 3, 10], mistrustful [8, 12], older, [3, 7, 10, 13] or very ill [3, 7]. The same goes for people from minorities or lower socioeconomic class [3, 12] and for people with lower education [3, 4, 7, 10, 11].

A systematic review comparing telephone and face-to-face interview for a specific psychiatric disorder—depression—showed a good comparability for the two methods, but the authors stated that the study quality was generally low [14]. There are, as far as we know, no reviews for psychiatric disorders in general. An important question is therefore, how valid telephone interviews are for psychiatric diagnosis in comparison with face-to-face interviews. This study reviews the value of telephone-administered standardized psychiatric diagnostic interviews from the following perspectives: (1) sensitivity and specificity of telephone interviews using face-to-face interviews as the golden standard and (2) agreement between telephone interviews and face-to-face interviews.

Methods

We performed a systematic review of the available literature in PubMed, PsychINFO and Embase, examining the value of telephone interviews in providing a psychiatric diagnosis as compared to face-to-face interviews.

Search strategy

In June 2012, we systematically searched for publications with a comparison between telephone and face-to-face diagnostic interviewing. We did not restrict our search by language or by age of participants. An academic reference librarian was consulted to ensure that search strategies and relevant articles were not overlooked.

We searched in three databases: PubMed, PsychINFO and EMBASE. For PubMed our search consisted of the All Fields and MeSH terms for “mental disorder(s),” “Diagnostic and Statistical Manual of Mental Disorders,” “psychiatry,” “psychiatric,” “bipolar disorder(s),” “anxiety disorder(s),” “depressive disorder(s)” or “depression(s),” AND “interview(s),” “psychological,” “interviewing,” “Interviews as Topic,” “telephone-administered,” “face to face,” “questionnaires,” “diagnosis,” “diagnoses,” “diagnostic,” “assessment,” “measuring,” “telephone” or phone (the complete search string for PubMed is shown in “Appendix”). We adapted the search for the other databases as required.

Selection of publications

For inclusion, we screened titles and abstracts. When title and abstract did not reveal sufficient information for inclusion or exclusion, the investigators read the full-text publication. Two investigators (EM, WG) independently selected publications from the list of retrieved publications. Disagreements about inclusion or exclusion were resolved by consulting a third investigator (PL). Inter-rater reliability on inclusion and exclusion was calculated as kappa; we considered kappa 0.6–0.8 as good and kappa 0.8–1.0 as excellent agreement [15]. After inclusion, we checked the references for additional publications.

To be included in the selection, studies had to be original studies comparing telephone and face-to-face interviews using the same standardized diagnostic criteria for a mental health problem. Each patient had to be subjected to both modes of interviewing. Studies were included that considered [1] the comparison between telephone and face-to-face interviewing as a criterion validity issue with face-to-face interviewing as the gold standard and [2] the agreement between the two methods. Agreement is based on all items of the questionnaire.

We excluded (1) studies with interviews about topics outside the field of mental health, [2] studies with non-standardized psychiatric interviews, (3) non-diagnostic interviews, (4) studies using different diagnostic interviews by telephone than face-to-face, (5) studies using different respondents for the two interview methods, (6) interviews using interactive voice response and (7) studies comparing scores of the two instruments with statistical testing or ICC values, as these studies did not determine whether a diagnosis was present or not.

Outcome assessment

We ranked the outcomes of the selected studies according to the risk of psychiatric morbidity. So, we considered studies in the general population as having a low risk of psychiatric morbidity, studies in general practice and studies with patients with risk factors as having intermediate risk, and studies in outpatients of psychiatric hospitals as having a high risk of psychiatric morbidity (Table 1). For the outcome assessment of the selected studies, we examined the sensitivity, specificity, percentage agreement and kappa values. Sensitivity is the proportion of true positives that are correctly identified by the test. Specificity is the proportion of true negatives that are correctly identified by the test. In general, the higher the sensitivity, the lower the specificity and vice versa [16]. Percentage agreement is defined as the extent to which the outcomes of the telephone and face-to-face interview agree with each other [17]. Kappa is a measure of reliability in which the agreement between two observers or two assessment methods is calculated, corrected for chance. A kappa of 0 means that the agreement rests fully on chance, a kappa of one means perfect agreement [18].

Table 1 Included studies

Quality assessment

We used the QUADAS-2 (quality assessment of diagnostic accuracy studies) tool to estimate the risk of bias in individual studies [19]. The use of this tool is recommended in systematic reviews of diagnostic accuracy by the Agency for Healthcare Research and Quality, Cochrane Collaboration and the U.K. National Institute for Health and Clinical Excellence. To estimate the risk of bias, the QUADAS-2 tool distinguishes four key domains that have to be rated: “patient selection” [question 1–3 (1) Was a consecutive or random sample of patients enrolled?, (2) Was a case–control design avoided?, (3) Did the study avoid inappropriate exclusions?], “index test” [question 4–5 (4) Were the index test results interpreted without knowledge of the results of the reference standard?, (5) If a threshold was used, was it pre-specified?], “reference standard” [question 6–7 (6) Is the reference standard likely to correctly classify the target condition?, (7) Were the reference standard results interpreted without knowledge of the results of the index test?] and “flow and timing” [question 8–11: (8) Was there an appropriate interval between index test(s) and reference standard?, (9) Did all patients receive a reference standard?, (10) Did patients receive the same reference standard?, (11) Were all patients included in the analysis?). We chose not to rank the included studies with numerical scores because quality scores have been shown to produce different results depending on how the individual items are weighted [20]. Two researchers (EM, WG) independently scored the risk of bias. Disagreements were resolved by consulting a third researcher (PL) [19].

Data extraction

Data extraction was performed independently by two researchers (EM, WG). For the construction of the data extraction form, we used the items of the STARD statement (Standards for Reporting of Diagnostic Accuracy) [21]. The items relevant for the quality assessment according to the QUADAS-2 tool [19] could be derived from this data extraction procedure.

Results

Selection of publications

Our database search retrieved 3,042 publications. We found six additional articles, four by checking the references of the retrieved articles and two on internet. After removing the duplicates, 1,879 publications remained to be screened (Fig. 1, flowchart). Applying the exclusion criteria on the title and abstract of these 1,879 publications resulted in the selection of 41 citations. The inter-investigator agreement was “good” with a kappa of 0.77 (95 % CI 0.71–0.83). Definite assessment of the full text of the 41 citations resulted in the exclusion of 25 studies, leaving 16 studies to be included.

Fig. 1
figure 1

Flowchart

Description of selected studies

The included studies were generally small with 13 studies reporting about less than 100 participants (Table 1). Many different instruments had been used. Studies using standardized psychiatric interviews (SCID [22], DIS [23] and CIDI [24]) frequently used only one diagnostic section. There was also a large heterogeneity concerning the age and psychiatric morbidity of the included participants. Most studies reported on outpatients visiting specialized clinics. The number of psychiatric disorders addressed in individual studies ranged from one to 21. Several small studies addressed a large range of disorders [2528]. Two studies examined general population samples [26, 27], four studies examined samples with an intermediate risk of psychiatric disorder [2831] and the remaining 10 studies examined high-risk samples with psychiatric outpatients [25, 3240] (Table 2). Four studies used semi structured interviews; the outcomes did not differ from the outcomes of studies with structured psychiatric interviews (Table 3). The time between telephone and face-to-face interview did not influence the outcomes (Table 4). Finally, there were no differences between outcomes from interviews by trained lay interviewers and interviews by professionals (Table 5).

Table 2 Possible sources of bias in studies
Table 3 Subdivision of studies in structured and semi structured questionnaires
Table 4 Subdivision of studies in time duration between telephonic and face-to-face interview
Table 5 Subdivision of studies in interviewer type

Sensitivity and specificity

The two studies in samples with a low risk of psychiatric morbidity [25, 26] mainly aimed at diagnosing depressive and anxiety disorders. The study of Cacciola [26] with 41 respondents found a specificity of 94.1 % for any disorder. The study of Watson [25] with 49 respondents found a specificity of 98 % or higher for substance use disorders. Sensitivity was low in both studies (Table 1). From the four studies with intermediate risk of psychiatric morbidity [27, 3133] only the study by Wells [32] with 230 patients provided data about criterion validity. They found high specificities for lifetime major depression (89 %), lifetime dysthymia (95 %) and lifetime MDD and/or dysthymia (89 %). Sensitivity was 55 % for lifetime dysthymia, 56 % for lifetime major depression and 71 % for the combination of both disorders. From the remaining 10 studies with a high risk of psychiatric morbidity, three studies provided data about criterion validity [25, 37]. Hajebi [29] assessed 72 outpatients with the SCID psychotic disorder module. Sensitivity and specificity were 86.5 and 82.9 % for any psychotic disorder in lifetime, 80.6 and 80.6 % for primary psychotic disorder in lifetime, and 73.3 and 67.9 % for primary psychotic disorder in the past 12 months, respectively. Aziz [30] tested the CAPS for detection of PTSD and the HAM-D for depression in 34 outpatients. The sensitivity and specificity for CAPS 65 84 and 80 %, and for HAM-D 79 and 100 %. Burke [37] assessed the criterion validity of a version of the Geriatric Depression Scale in 83 elderly outpatients. They used a cutoff point of 14. Specificity was 42 % and sensitivity 94 %.

Agreement

From the studies with low risk of psychiatric morbidity [26, 27], Cacciola reported low agreement and low kappa values; for any disorder, these were 22.2 % and 0.27, respectively. Watson only reported kappa values, which were generally low, with the exception of a kappa of 0.92 for substance use disorders. In the intermediate risk studies [2831], one study reported about percentage agreement. Paulsen [27] found high values for percentage agreement; percentage agreement for no mental disorder (85 %) was the lowest. Kappa values in the four studies ranged between 0.45 and 0.84. Paulsen found kappa values between 0.69 (agoraphobia with panic, major depression, no mental disorder) and 0.84 (alcoholism). Evans [33] reported kappa values of 0.72 and 0.75 for common mental disorders and psychiatric caseness, respectively, in a study of general practice attendees. Crippa [31] assessed 100 volunteering undergraduate students with the SCID social phobia module and found a kappa value of 0.84; this study enriched the sample by a screening for social phobia before the study. Wells [32] found kappa values of 0.45, 0.48 and 0.57 for lifetime major depression, lifetime dysthymia, and lifetime MDD and/or dysthymia, respectively; they stratified the sample for the presence of indicators for depression prior to the study. The studies in high-risk samples generally reported high percentages of agreement and high kappa values. Six of these studies, however, reported on <40 participants [27, 30, 34, 36, 38, 39]. Paing assessed 12 parents of children for assessing 21 psychiatric disorders. From the larger studies [29, 35, 37, 40], two reported on agreement only providing data about the kappa values [35, 40]. Lyneham [40] assessed 73 outpatient children with the ADIS-C-IV for anxiety, mood and externalizing disorders. They found a Kappa of 0.86. Rohde [35] used the KIDDIE-SADS in 60 psychiatric outpatients and found kappa values for major depressive disorder of 0.96, for anxiety disorder of 0.87, for alcohol and substance use of 1.00, and for adjustment disorder with depressed mood of 0.74.

Quality of included studies

Both studies in low-risk samples had a high risk of bias [36, 37]; three of four studies in intermediate risk samples had a medium risk of bias [27, 29, 31, 32]; from the remaining 10 studies in high-risk samples, two had low risk of bias (Table 2). In 13 studies, there were problems concerning patient selection [2532, 34, 35, 3840]: for instance, oversampling of patients with depressive symptoms [32] or with any lifetime psychotic disorder [29] or other sampling strategies leading to one group with cases and one group with non-cases. This strategy likely causes an exaggerated diagnostic accuracy. Three studies used a convenience sample resulting in uncertainty about the direction in which the results are biased [25, 26, 30]. Apart from patient selection, the other main cause of bias is interpretation of the index test with knowledge of the results of the reference test or vice versa. This also causes favorable results in validity or agreement. In one study, the same interviewer performed all tests [25], thus introducing bias in the direction of favorable validity measures.

Discussion

Is it valid to perform telephonic interviews instead of a face-to-face format? The use of telephone interviews relies on the premise that the diagnosis obtained with this method should be as valid as the diagnosis obtained in face-to-face interviews [29]. Generally, our conclusion is that there are too few studies properly performed to draw a definite conclusion about the comparability of telephone and face-to-face interviews for psychiatric morbidity.

The included studies are very heterogeneous (considering patient groups, setting, type of instruments and quality of the data). The two studies in the general population (low risk of psychiatric disorder) had high specificities for the DIS (diagnostic interview schedules) and SCID (structured clinical interview for DSM Disorders). This implies that the cases identified by telephone would probably be identified by the face-to-face interview as well. This conclusion is subject to doubt because these studies had a high risk of bias. Moreover, the sensitivity was low, implying that many of the cases might be missed by the telephone interview in comparison with the face-to-face interview. Agreement measures also showed that both interview modes are not leading to comparable results. The studies with intermediate risk of psychiatric disorder still had reasonably high specificity and low sensitivity, but medium to high risk of bias. The study in general practice had good kappa’s for the broad category of psychiatric caseness. The studies with high risk of psychiatric disorder had higher sensitivity (less false-negative diagnoses) but lower specificity (more false-positive diagnoses) and were of low quality.

The reliability of the assessment of the lifetime prevalence of a psychiatric diagnosis is questionable regardless of which method has been used [41]. Therefore, if we restrict our conclusion to the studies assessing current psychiatric morbidity with agreement measures and medium or low risk of bias, there remain three studies [27, 31, 36] comparing the two modes of interviewing in patients with anxiety and depressive disorders. Kappa values in these studies range between 0.69 and 0.84, indicating good agreement. Possibly, in this field, the results of telephone interviewing are comparable with face-to-face interviews.

Strengths and weaknesses

A strength of our study is that, to our knowledge, this is the first systematic review studying the diagnostic agreement between telephone and face-to-face interviewing, using methodological criteria as the QUADAS statement. We performed a broad search in three databases for publications with a comparison between telephone and face-to-face diagnostic interviewing. We performed inclusion and exclusion of the publications and the data extraction with two researchers. An important weakness of our study is that it was impossible to perform a meta-analysis for this review because the eligible studies were too heterogeneous with respect to sampling, number of participants and study quality. We limited our review to studies using the diagnostic instrument for making a diagnosis and excluded studies comparing the scores of both modes of questioning.

Comparison with the literature

There are few systematic reviews comparing telephone diagnostic interviewing with face-to-face interviewing for mental health. One Dutch study [14] about depression concluded that telephone interviewing for depression is feasible and yields comparable results as face-to-face interviews, but the selected studies are weak concerning methodological quality. According to the authors, the psychiatric face-to-face interview is still the gold standard. The reliability of psychiatric interviews, however, is not perfect when considering agreement for interviews considered to be golden standards. For example, Segal [41] reports test–retest reliabilities (kappa) of the SCID interview of 0.32–1.00. Witchen [42] found very high kappa’s for test–retest reliability of the CIDI interview, probably due to the fully structured nature of the CIDI interview. Our results are broadly in line with these studies. Another review [42] which compares telephone and video conference for assessing cognitive function concludes that the telephone interview had much to offer for the clinician and researcher, but nevertheless, the choice of the cognitive interview should fit for the limitations of telephone interviewing (lack of visual cues). This finding could also apply for mental health interviews. Psychiatric interviews are frequently clinician-administered. Clinicians always use more information than the direct answers on the questions; non-verbal cues probably play an important role in the final judgment about the diagnosis [2]. Therefore, a telephone interview cannot be as specific as a face-to-face interview.

Implications for future research

We recommend that further studies in this field should adhere to the guidelines in the QUADAS statement. Specifically, researchers should pay attention to patient selection and unbiased judgment of the tests. Patients should be consecutively (or randomly) enrolled in a study to avoid a case–control design. Future studies should include larger samples of participants, for example, at least 200 respondents (pilot) or 400 for reliability studies and even more for validity studies [43]. Finally, it would be desirable to study a specific disorder with a specific instrument instead of a combination of disorders including psychotic disorders and affective disorders with a general instrument. The study should use a structured interview, not a semistructured one because of the variability inherent for these interviews. For example, a study of depression with a specific structured depression questionnaire in a group of psychiatric outpatients. We propose to start with the field of depressive and anxiety disorders.

Conclusion

Taking this altogether, we conclude that there is inconsistent evidence that telephone interviews for the diagnosis of psychiatric disorders are valid compared to face-to-face interviews. Telephone interviewing in the general population may not be valid because comparability measures are lowest in these low-risk populations. Finally, telephone interviewing for research purposes in depression and anxiety disorders might be a proper and valid method. Future research on depression and anxiety disorders may benefit the field and should preferably be conducted with fully structured interviews leaving no room for clinical interpretation of the answers.