Introduction

Depression can occur at any age. Pregnancy and childbirth make some women vulnerable to developing major depressive disorder (MDD) [1]. Although postpartum depression (PPD) is widely known, depression is also common during pregnancy. We call this antepartum depression (APD). Public-health experts are becoming increasingly focused on APD [2]. It is known that 7–20% of pregnant women suffer from a depressive disorder [3]. APD significantly increases the risk for postpartum depression (PDD) [4, 5]. Untreated APD and PPD have multiple potential negative effects on maternal-infant attachment, child development, and can also cause big problems such as infanticide [6,7,8]. Thus, early and accurate detection of MDD is imperative [8] and it should be required in peri-partum. Fortunately, depression during and after pregnancy is treatable [9]. Its eventual remission rate with early improvement reaches over 80% [10].

There are many screening tools for MDD available to us. The representative tools are the Beck Depression Inventory (BDI), the Center for Epidemiologic Studies for Depression (CES-D), the ten-item Kessler Psychological Distress Scale (K-10), and the Patient Health Questionnaire (PHQ-9), all of which are self-rating tools used for adults [11]. Strong negative emotions such as sadness, loss of interests, and hopelessness are prevalent in all types of depression. But since pregnancy and childbirth are often celebrated events, pregnant women may not be always aware of their depression symptoms [12]. Even if they are aware, they often hide their symptoms because of their compulsion to be a good mother [13]. In addition, symptoms such as fatigue, as well as changes in appetite, and sleep are typical signs of pregnancy and postpartum women, and may be misinterpreted as false negative [14, 15].

Considering this, Cox et al. [15] developed the Edinburgh Postnatal Depression Scale (EPDS) in 1987 to better detect PPD. While other tools such as the BDI and the PHQ-9 asked simple questions about sleep disturbances, the EPDS questions difficulty sleeping in relation to unhappiness, and excluded questions about appetite and fatigue. The EPDS expands its target not only for postpartum women, but also pregnant women [16]. And it is widely used in epidemiologic surveys to measure the prevalence of major depression in perinatal women [17]. Therefore, the EPDS, a perinatal-specific depression screening tool, should be more usefully spent to evaluate depression in perinatal women.

The screening accuracy of the EPDS has been verified through several systematic reviews (SRs) [18,19,20]. These were SRs to test the performance of the EPDS as a depression screening tool or to find the optimal cutoff scores for the EPDS in pregnant and postpartum women. However, none of the SRs compared the screening accuracy of the EPDS with other depression screening tools or contrasted the screening performance of the EPDS according to APD or PPD.

Thus, the purpose of this study was to compare the screening performance of the EPDS and other depression screening tools, specifically in two ways. First, we compared the predictive validity of the EPDS and other screening tools by dividing the subjects into pregnant, postpartum, and perinatal women (mixed of pregnancy and postpartum). Second, we compared the screening performance of the EPDS with each depression screening tool.

Methods

This study was performed according to the guidelines of the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy [21] and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 Statement [22].

Search strategy and literature sources

We systematically searched eligible articles in four electronic databases, MEDLINE, Embase, CINAHL, and PsycArticles, on July 17, 2021. Key search terms were depression, perinatal-related terms, and EPDS. Depression and postpartum were searched based on MeSH terms (both free text and MeSH, exploded). EPDS was searched using its full name and abbreviation. We expanded the search scope using free text searching to search full texts in addition to titles and abstracts. We present the search strategies of all databases in Supplementary Table 1.

Table 1 Characteristics of selected studies
Table 2 Summary results of meta-analysis

Eligibility criteria

Inclusion criteria were: (i) Types of studies, studies that reported diagnostic accuracy such as sensitivity and specificity (e.g., observational studies such as cohort or cross-sectional studies); (ii) types of participants, studies on pregnant and postpartum women over 18 years of age; (iii) indexed tests, studies that used EPDS-10 items; (iv) comparators, studies on depression screening tools of all types compared to the EPDS (for the meta-analysis, the depression screening tool reported in more than three studies was selected); (iv) gold standard, studies that conducted direct (e.g., the Diagnostic and Statistical Manual of Mental Disorders [DSM], the International Classification of Diseases [ICD]) or structured interviews (e.g., Structured Clinical Interview for DSM [SCID], Mini-International Neuropsychiatric Interview [MINI] etc.) by trained psychiatric professionals using diagnostic criteria for MDD as a gold standard; and (v) types of outcomes, studies with data of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). We derived sensitivity, specificity, positive and negative likelihood ratio, diagnostic odds ratio, and sROC curve from these data as the outcome measures.

Exclusion criteria were (i) retrospective studies such as case–control studies; (ii) non-original articles such as reviews, letters, or editorials; (iii) studies using the EPDS to assess risk such as anxiety or suicide; (iv) studies that only presented sensitivity or specificity and did not provide sufficient data to create a two-by-two contingency table; and (v) studies that included subjects with other mental disorders or diseases. However, the language was not limited.

Full text screening and data extraction

After removing duplicate articles, two authors (S-H and J-I) independently selected titles and abstracts for study screening and data extraction to confirm their potential eligibility. If there was a difference in opinion among authors, it was resolved through a consensus-based discussion. We extracted the following information from full texts of selected studies: the year of publication, authors, location, subjects, age, gestational age, gestation or postpartum period in weeks, sample size, gold standard, blinding, cutoff scores of the EPDS and other tools, and outcomes such as TP, FP, FN, and TN.

Quality and risk of bias

The quality of selected studies was assessed using QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2 [23]. QUADAS-2 assesses the risk of bias and applicability with four domains of criteria which include patient selection, index test, reference standard, and flow and timing. Applicability is assessed only in the first three domains. Two independent authors (S-H and J-I) completed this assessment. When there was a disagreement between both authors, through discussion, reached consensus.

Statistical analysis

The meta-analysis was conducted using MetaDiSc 1.4 [25] and the Meta DTA program [26, 27]. Using a bivariate random effect model, the MetaDTA program allowed the evaluation of the accuracy of screening and the heterogeneity across studies [24]. Screening accuracy was evaluated by yielding pooled sensitivity, specificity, positive and negative likelihood ratios (LRs), diagnostic odds ratios (ORs) with 95% of confidence intervals (CIs), and the area under the curve (AUC) of summary receiver operating characteristics (sROC) curve. The area under the curve (AUC) and the index Q* value were analyzed to describe test accuracy. The AUC values were interpreted as follows: AUC of 0.5 = non-informative test; AUC of 0.5–0.7 = low accurate; AUC of 0.7–0.9 = moderate accurate; AUC of 0.9–1 = highly accurate; and AUC of 1 = perfect test [28]. The index Q* value represents the point at which the sensitivity and specificity are equal in the ROC curve, with a value of 1 indicating the accuracy of 100% [29]. The heterogeneity among studies was judged using random effect (RE) correlation.

We presented the sensitivity and specificity of the EPDS and other tools as forest plots and sROC curves. And we categorized subjects into three groups (pregnant, postpartum, and perinatal women) and performed a subgroup analysis. In addition, we analyzed the screening performance for the EPDS and each of the other tools.

Results

Selection process

We searched 1,129 articles from the electronic database. Duplicate articles (k = 306) were excluded. Inclusion and exclusion criteria were applied to titles and abstracts of 823 articles. When it was difficult to accurately determine the title and abstract, we searched for full texts and checked them. Seventeen studies [30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46] were retained for quantitative synthesis while 806 (97.9%) articles were excluded. The study selection process is detailed in a PRISMA 2020 flow diagram as shown in Fig. 1.

Fig. 1
figure 1

Flow diagram of article selection

Risk of bias assessment

Among the 17 selected studies, 6 (35.3%) [31, 34, 36, 40,41,42] were assessed to have a low risk in all domains. In patient selection, eight studies were assessed to have a low risk either randomly [31, 34, 35, 41] or as consecutive samples [30, 36, 40, 42], while one study [44] had a high risk of bias as a convenience sample. Because the EPDS and other tools were all self-reported questionnaires, the risk of bias in the domain of index test was assessed to be low for all studies. Ten studies [31, 33, 34, 36, 40,41,42, 44,45,46] were assessed to have a low risk of bias in the reference standard. Of them, nine studies were blinded during the test process. In one study [44], the index test and the reference standard test were performed randomly. In addition, it provided interpretation without information on results of each test. The flow and timing, and applicability concerns of each domain were all assessed to have a low risk (Fig. 2).

Fig. 2
figure 2

Quality assessment results of the selected studies by QUADAS-2

Summary included studies

A total of 17 studies analyzed the predictive validity of the EPDS and a total of 2902 women were included. There were nine studies [30, 31, 34,35,36,37, 39, 40, 45] on pregnant women. Six studies [32, 40, 42,43,44, 46] included postpartum women (one [40] of these studies analyzed pregnant and postpartum women separately). Three studies [34, 38, 41] included both pregnant and postpartum women as perinatal women. The average age of women was in their 20 s in 9 studies [30, 31, 33,34,35, 38, 39, 41, 42] and in their 30 s in 8 studies [32, 36, 37, 40, 43,44,45,46]. Selected studies were published in a total of 12 countries. Four studies [38, 40, 43, 46] were from the United States. The United Kingdom [39, 42] and South Africa [33, 35] each had two studies. There were 6 studies [32, 33, 38, 40, 43, 44] with less than 100 women, 6 studies [31, 34, 39, 42, 45, 46] with 100–200 women, and five studies [30, 35,36,37,38,39,40,41] with 200 or more women. A total of 11 types of depression screening tools were compared to the EPDS, of which the PHQ-9 [34, 35, 37, 40, 43] and the BDI [31, 36, 38, 45, 46] were reported in 5 studies, and the K10 [35, 39, 42] and the PDSS [43, 44, 46] in 3 studies, respectively. The optimal cut-off score of the EPDS presented in each study ranged from 3/4 to 16. The most common cut-off score was 13, reported in six studies [31, 32, 38,39,40, 45] (Table 1).

Predictive validity of the EPDS versus the other tools by subjects

The EPDS

The predicted validity of the EPDS was analyzed using 17 studies involving 2902 women (Fig. 3). The prevalence of MDD was 26.0%. The sensitivity ranged from 0.56 to 1.00 and the specificity ranged from 0.54 to 0.99. In meta-analysis, pooled sensitivity and specificity were 0.79 (95% CI 0.74–0.84) and 0.88 (95% CI 0.82–0.92), respectively, with RE correlation of -0.506, sROC AUC of 0.89 (SE = 0.02), and Q* value of 0.82 (SE = 0.02).

Fig. 3
figure 3

Predictive validity of the EPDS

The predictive validity of the EPDS analyzed by subject is as follows. In pregnant women (9 studies including 1831 women), pooled sensitivity and specificity were 0.81 (95% CI 0.75–0.86) and 0.87 (95% CI 0.81–0.91), respectively, with RE correlation of −0.807, sROC AUC of 0.90 (SE = 0.02), and Q* value of 0.83 (SE = 0.02). In postpartum women (6 studies including 515 women), pooled sensitivity and specificity were 0.79 (95% CI 0.67–0.88) and 0.92 (95% CI 0.76–0.98), respectively, with RE correlation of −0.953, sROC AUC of 0.90 (SE = 0.02), and Q* value of 0.83 (SE = 0.02). In perinatal women (3 studies including 652 women), pooled sensitivity and specificity were 0.72 (95% CI 0.64–0.79) and 0.83 (95% CI 0.63–0.93), respectively, with RE correlation of NaN, sROC AUC of 0.74 (SE = 0.04), and Q* value of 0.68 (SE = 0.03).

The other tools

In meta-analysis of all other tools (Fig. 4), pooled sensitivity and specificity were 0.78 (95% CI 0.71–0.84) and 0.86 (95% CI 0.77–0.92), respectively, with RE correlation of −0.304, sROC AUC of 0.87 (SE = 0.02), and Q* value of 0.81 (SE = 0.02).

Fig. 4
figure 4

Predictive validity of the other tools

In pregnant women, pooled sensitivity and specificity were 0.78 (95% CI: 0.70 to 0.84) and 0.86 (95% CI 0.83–0.89), respectively, with RE correlation of −1.000, sROC AUC of 0.89 (SE = 0.01), and Q* value of 0.82 (SE = 0.01). In postpartum women, pooled sensitivity and specificity were 0.80 (95% CI 0.60–0.91 and 0.79 (95% CI 0.33–0.97), respectively, with RE correlation of 0.002, sROC AUC of 0.86 (SE = 0.08), and Q* value of 0.79 (SE = 0.08). In perinatal women, pooled sensitivity and specificity were 0.70 (95% CI 0.58–0.80) and 0.90 (95% CI 0.74–0.97), respectively, with RE correlation of −0.802, sROC AUC of 0.83 (SE = 0.06), and Q* value of 0.76 (SE = 0.06) (Table 2).

Predictive validity of the EPDS versus each other tools

The PHQ-9

There were 5 studies (1054 women) compared the PHQ-9 and the EPDS. Pooled sensitivity and specificity of EPDS were 0.82 (95% CI 0.74–0.89) and 0.79 (95% CI 0.69–0.87), respectively, with RE correlation of −1.000, sROC AUC of 0.86 (SE = 0.03), and Q* value of 0.79 (SE = 0.03). On the other hand, the pooled sensitivity and specificity of the PHQ-9 were 0.72 (95% CI 0.54–0.85) and 0.69 (95% CI 0.40–0.88), respectively, with RE correlation of 0.643, sROC AUC of 0.74 (SE = 0.12), and Q* value of 0.68 (SE = 0.10).

The BDI

Five studies (8054 women) compared the BDI and the EPDS. The pooled sensitivity and specificity of EPDS were 0.82 (95% CI 0.75–0.87) and 0.92 (0.82–0.97), respectively, with RE correlation of −1.000, sROC AUC of 0.90 (SE = 0.02), and Q* value of 0.83 (SE = 0.02). The pooled sensitivity and specificity of the BDI were 0.73 (95% CI 0.55–0.86) and 0.96 (95% CI 0.71–1.00), respectively, with RE correlation of −0.915, sROC AUC of 0.91 (SE = 0.04), and Q* value of 0.84 (SE = 0.04).

The K-10

Three studies (670 women) compared the K10 and the EPDS. The pooled sensitivity and specificity of EPDS were 0.92 (95% CI 0.71–0.98) and 0.82 (0.77–0.85), respectively, with RE correlation of 1.000, sROC AUC of 0.87 (SE = 0.04), and Q* value of 0.80 (SE = 0.04). The pooled sensitivity and specificity of the K10 were 0.91 (95% CI 0.70–0.98) and 0.82 (95% CI 0.78–0.86), respectively, with RE correlation of −1.000, sROC AUC of 0.91 (SE = 0.02), and Q* value of 0.84 (SE = 0.02).

The PDSS

Three studies (239 women) compared the PDSS and the EPDS. The pooled sensitivity and specificity of EPDS were 0.72 (95% CI 0.58–0.82) and 0.95 (0.86–0.99), respectively, with RE correlation of 1.000, sROC AUC of 0.54 (SE = 0.25), and Q value of 0.53 (SE = 0.18). The pooled sensitivity and specificity of the PDSS were 0.94 (95% CI 0.86–0.97) and 0.71 (95% CI 0.09–0.98), respectively, with RE correlation of NaN, sROC AUC of 0.98 (SE = 0.01), and Q* value of 0.94 (SE = 0.02).

Others

Other depression screening tools were reported in 7 studies (1449 women). The pooled sensitivity and specificity of EPDS were 0.73 (95% CI 0.65–0.80) and 0.90 (0.79–0.96), respectively, with RE correlation of -0.424, sROC AUC of 0.84 (SE = 0.04), and Q* value of 0.77 (SE = 0.04). The pooled sensitivity and specificity of other tools were 0.70 (95% CI 0.62–0.77) and 0.90 (95% CI 0.85–0.94), respectively, with RE correlation of −0.552, sROC AUC of 0.88 (SE = 0.01), and Q* value of 0.81 (SE = 0.01).

Discussion

Depression occurs mainly during the postpartum period, but pregnancy also can be an emotional time [1]. In many cases, the depression often sets in while the women are still pregnant, so we prefer the term peri-partum to postpartum [47]. The EPDS was developed as a screening tool to detect PPD [15]. It is also actively used to screen for depression in pregnant women [16]. Therefore, the target of EPDS has been expanded from postpartum women to pregnant women. This was also revealed in the 17 studies included in this review. The studies on postpartum women were published mainly before 2010, the majority of the studies on pregnant women were published recently. The screening accuracy of the EPDS has been proven through several SRs [19, 48]. However, in most cases, only the EPDS was verified [20]. Even if pregnant women were included, no distinction of among depression or depressive order between pregnant and postpartum women was made in the report [18] or adolescents were mixed into subjects [49]. In addition, another SR was limited to one country [50]. Therefore, the objective of this study was to perform the first systematic review and meta-analysis to classify pregnant, postpartum, and perinatal women through literature selected with a standardized method. How the EPDS differed from other depression screening tools was then determined.

First, the quality of the selected studies was assessed using QUADAS-2 [23]. The EPDS is a self-reporting tool. Subjects were women in their 20 s and 30 s. The EPDS as an index test has a quantified scoring system. It can be interpreted that there is no bias in the test execution process. Thus, blinding in reference standards is important when assessing the quality of the literature. Since this study included only studies consisting of unstructured or semi-structured interviews with psychiatrists, there was no bias in the reference standard for correctly classifying the target condition, MDD. Most studies (9 out of 17) were blinded, allowing analysis to be performed using well-designed literature with a relatively low risk of bias.

In the meta-analysis, the EPDS showed a moderate accuracy with an sROC AUC of 0.89 (SE = 0.02) as a tool specialized for perinatal women. When pregnant and postpartum women were analyzed separately, sROC AUC was equal to 0.90 (SE = 0.02) for both women. The pooled sensitivity was slightly higher in pregnant women, while the pooled specificity was slightly higher in postpartum women. The RE correlation had negative values for pregnant women (− 0.807) and postpartum women (− 0.953), and no heterogeneity among studies. These results confirm that the EPDS is a suitable screening tool for not only PPD, but also depression during pregnancy, which was overlooked in the past. However, perinatal women, the sROC AUC was as low accurate as 0.74 relatively. As the number of studies was small, we could not interpret them significantly. The SROC AUC of other tools was 0.87 (SE = 0.03), which showed a low value slightly compared to the EPDS. Therefore, we could confirm that the EPDS is a perinatal-specific depression screening tool. In perinatal women, the sROC AUC was 0.83 and showed low accuracy performance relative to pregnant and postpartum women like the case of the EPDS.

Subgroup analyses of other tools were performed only when there were three or more studies. However, due to the small number of studies, it was not possible to be analyzed by subjects. As a result of meta-analysis of all studies reporting other tools, the sROC AUC was 0.87 (SE = 0.02), which was similar to that of the EPDS (0.89). However, there were differences in subgroup analysis for each tool. Compared with five studies (1,054 women), the sROC AUC of the PHQ-9 was 0.74 (SE = 0.12), which was lower than that of EPDS 0.86 (SE = 0.03). SR by Wang et al. [51] has also compared the EPDS and the PHQ-9. However, quantitative meta-analysis was not performed. Only median AUC was presented (EPDS: 0.88; PHQ-9: 0.86. Only one in every five studies showed the PHQ-9 differed from the EPDS: Gawlik et al. [37] in sensitivity and Hausa et al. [43] in specificity. Therefore, although the screening performance of the EPDS in perinatal women is rather good, by only these results, we did not interpret as that the PHQ-9 had lower screening performance than the EPDS.

On the other hand, in comparison with the PDSS (0.98), the sROC of the EPDS was very low accuracy at 0.54. Both tools are perinatal-specific depression screening tools. The EPDS was developed in 1987 and the PDSS was made in 2000 [52]. In all individual studies, the PDSS displayed higher sensitivity, and the EPDS had higher specificity. So, the pooled sensitivity was relatively high in PDSS (0.94) while the pooled specificity was high in EPDS (0.95). When interpreting only these results, it was considered that they could be complementary tools. First, if women with suspected depression are selected through the PDSS, and then the normal range of woman is correctly excluded by the EPDS, it seems that there will be a synergistic effect to increase the accuracy of depression screening in perinatal women. However, in this case, there were only three studies and the RE correlation value was not calculated in PDSS. In EPDS, there was a heterogeneity between studies as a positive value (1.000). In addition, the PDSS includes both a long and short version. Thus, interpretation of this tool was withheld. For the BDI, sROC AUC was 0.91, which was similar to that of EPDS (0.90). RE correlation was negative for both. There was no heterogeneity between studies. In the case of other tools, eight depression screening tools such as the CES-D and Zung Self-Rating Anxiety Scale were included and analyzed.

Limitation

This study also has limitations. First, the cut-off score of the EPDS was the optimal score suggested by each study was applied. In most studies, the recommended cut-off score for the EPDS was 13, which was within the optimal cut-off score range of 9–13 in Fellmech’s SR [53]. However, in some studies, exceptional cut-off scores such as 3/4 [41] and 16 [34] were used. Second, in the meta-analysis results of 17 studies on pregnant and postpartum women with subgroup analysis using the BDI, the RE correlation indicating heterogeneity between studies had negative values. However, in subgroup analysis with other tools such as the PDSS, the RE correlation indicating heterogeneity between studies had positive values. In the present meta-analysis, the cause of heterogeneity could not be identified due to a small number of studies. Third, this meta-analysis did not include studies on women under the age of 18 with high-risk pregnancies as subjects because young pregnant and postpartum women with age under 18 should be considered as a different group of subjects to estimate the sensitivity and specificity of EPDS for screening postpartum depression.

Conclusion

This study demonstrates that EPDS is an excellent depression screening tool for pregnant and postpartum women aged over 18 years based on 17 well-designed studies with a low risk of bias. In this study, the sROC AUC of the EPDS was 0.9, indicating moderate accuracy. In a primary care setting or a midwifery center, the EPDS is a very good screening tool that can be applied before other tools when trying to screen for depression in perinatal women. Among other tools for screening depression, the BDI has the most similar screening accuracy to the EPDS. The PHQ-9 and the K10 can also be used.