Background

The World Health Organisation (WHO) defines stillbirth as the death of an infant after more than 28 weeks of gestation, but before or during labor [1, 2]. The personal, emotional, and financial impact of stillbirth have a profound effect on parents, healthcare providers, and society [3, 4]. In an effort to end preventable stillbirth, in 2014, the WHO Every Newborn Action Plan (ENAP) set a target of ≤ 12 stillbirths per 1000 total births for every country by 2030. Despite this, there are over 5 million cases of stillbirth globally each year, and the incidence of stillbirth remains particularly high in middle and low-income countries, with rates reaching 22.8 stillbirths per 1000 [5] total births in some regions [1].

The etiology of stillbirth is multifactorial, but interventions aimed at prevention are effective [6]. Globally, there is no standardized system of investigating and reporting stillbirth, and available information is classified using numerous and disparate systems [7]. There is an unmet need to collect quality information on the causes of stillbirth to inform predictive models of stillbirth [1, 8].

Simple linear statistics lack the capacity to model complex problems such as stillbirth. Advances in the processing power, memory, and storage of computers and the widespread avaiability of rich datasets have led to the application of artificial intelligence (AI) and machine learning (ML) in healthcare to improve risk prediction [9]. ML can integrate vast and heterogeneous datasets and identify patterns and correlations. ML algorithms are created and trained to make classifications or predictions [10, 11]. The performance of an ML model may be explained by its accuracy, assessed using metrics such as sensitivity, specificity, predictive value, probability ratio, and the area under the receiver operating characteristic (ROC) curve (AUC) [12].

ML has the potential to improve early disease prediction, diagnosis, and treatment in maternal-fetal medicine [13], and has been used to assess fetal well-being and predict and diagnose gestational diabetes, pre-eclampsia, preterm birth and fetal growth restriction [14]. Predictive modes established using ML can inform clinical decision-making but should not be used to make definitive diagnoses. An increasing number of models are being constructed to screen and monitor pregnancies and detect those at high-risk of stillbirth; however, there is no comprehensive systematic review of the latest advances in this field [15]. This study reviewed the literature on predictive ML models for stillbirth, highlighting input characteristics, performance metrics, and validation. Findings should improve care relevant to stillbirth.

Methods

This study is reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [16]. The protocol that incudes these analyses was registered on the PROSPERO database under the ID CRD42022380270.

To account for the disparate systems used for classifying stillbirth globally, for the purpose of this study, stillbirth was defined as the death of an infant after 20 weeks of gestation, but before or during labor or at birth.

The PubMed, Cochrane, and Web of Science databases were searched for studies using AI (e.g. ML and deep learning) to develop predictive models for stillbirth. Search terms included “stillbirth”, “fetal death”, “mortality”, “death”, “artificial intelligence”, “machine learning”, “deep learning”, “predictive models” (Please see Supplementary Table 1).

Inclusion criteria were: (1) use of a ML technique to develop and/or validate a predictive model for stillbirth; (2) the model integrated data for ≥ 2 variables (features) for stillbirth prediction; and (3) comprehensive performance assessment was conducted. Exclusion criteria were: (1) reviews, meta-analyses, conference abstracts, letters or comments; (2) studies conducted in animals; (3) studies with a sample size < 100 stillbirths as machine learning requires a certain sample size for model training [17]; (4) studies investigating the molecular mechanisms underlying stillbirth; or (5) studies that did not propose a predictive model for stillbirth.

Duplicate references were eliminated, and two reviewers (Q.L. and P.L.) independently screened titles and abstracts and reviewed the remaining full text articles to determine which studies met the inclusion criteria.

The two reviewers independently extracted information pertaining to the construction of predictive models for stillbirth from the included studies, including number of models created, features used and their contribution to model prediction, and use of hyperparameter optimization/tuning, internal and/or external validation, and calibration analysis. Full details of the features used in the predictive models are provided in Supplementary Table 2.

The two reviewers independently assessed risk of bias and the applicability of included studies using PROBAST, a tool that consists of 20 signaling questions in four domains: participants, predictors, outcomes, and analysis [18].

Discrepancies between reviewers were resolved through consultation with a third reviewer (J-Y.C.). Findings were analyzed qualitatively using narrative synthesis and graphics generated by Revman and R version 4.2.3.

Results

Study Selection

The initial search identified 455 studies. Titles and abstracts were screened and 368 studies were excluded, including 128 duplicates and 240 studies that did not meet the inclusion criteria. The full text of 49 studies were reviewed, and 8 studies were included in the qualitative analysis (Fig. 1).

Fig. 1
figure 1

PRISMA flow diagram of study selection

Study Characteristics

The characteristics of the included studies are summarized in Table 1. These studies described 84 (range 1–30) predictive models for stillbirth created using 22 ML algorithms applied across various datasets.

Table 1 Characteristics of included studies

All studies explicitly outlined their inclusion and exclusion criteria. 62.5% of the studies reported on methods for handling missing data; of these, 80% of studies omitted cases with missing data. All studies reported AUC as a performance metric for the models. Five studies provided information on the sensitivity and specificity of the models [19,20,21,22,23,24]. Two studies gave detection rates for a 5% and 10% false positive rate [24, 25]. One study reported the positive likelihood ratio, negative likelihood ratio, positive predictive value, and negative predictive value [25].

Risk of Bias

The risk of bias assessment based on the PROBAST tool is shown in Supplementary Table 3. Three studies had a “high risk”, of bias, one study had a “low risk” of bias, and four studies had an “unclear” risk of bias.

Characteristics of Predictive Models

The most common ML algorithms included in the predictive models were logistic regression (LR), artificial neural networks (ANN), and random forest (RF) [26]. Less than 50% of the studies conducted model calibration, hyperparameter tuning, and external validation.

Across studies, 226 predictive features of stillbirth were identified, including 154 distinct features. Finally, 15 to 53 features were analyzed in the predictive models. Certain features emerged as potential predictors of stillbirth (Fig. 2). The top five predictors were age, parity, hypertension, smoking, and miscarriage. Other features that predicted > 50% of stillbirths included maternal body mass index, place of birth, maternal education, previous stillbirth, gestational diabetes, gestational hypertension, diabetes, and intrauterine growth restriction.

Fig. 2
figure 2

Bar chart of model features. BMI: Body Mass Index; SLE: Systemic Lupus Erythematosus; HBV: Hepatitis B Virus

The performance of all models was assessed using AUC. The performance of the machine learning models by dataset are summarized as a box plot (Fig. 3). Mean AUC is shown by the line that divides the box into two parts. The whiskers represent the standard deviation. Outliers are depicted as black circles. The predictive models had a mean AUC > 0.7 (range, 0.54–0.9). When reported, sensitivity (range, 60– 90%) and specificity (range 64–93.3%) of most models was > 70%. Notably, the stacked ensemble (SE) model proposed by Khatibi et al. (2021) [15] achieved an AUC of 0.9 and sensitivity and specificity > 85% (Figs. 4 and 5).

Fig. 3
figure 3

Box plot of AUC

Fig. 4
figure 4

Model sensitivity

Fig. 5
figure 5

Model specificity

Discussion

Stillbirth remains a significant public health concern worldwide. Mothers who have previously experienced stillbirth are at a considerably higher risk of recurrence in subsequent pregnancies. Given the wide-reaching impact of stillbirth, accurately predicting this condition is of paramount importance. However, conventional methods for stillbirth prediction, such as statistical modeling, may be insufficient due to limitations in their ability to account for the multifactorial determinants of stillbirth and the complex interactions between potential risk factors [27, 28]. Unlike statistical modeling, ML-based predictive models do not require a priori selection of predictors. Instead, they are capable of automatically and thoroughly exploring the complex associations and interactions among potential risk factors and outcomes. This approach allows the outcome to be investigated in great depth and facilitates the identification of new insights in complex systems [29,30,31,32,33,34]. Therefore, ML-based predictive models represent a promising tool for enhancing the accuracy and effectiveness of stillbirth prediction [35].

Despite numerous studies exploring predictive models for stillbirth, there are a lack of systematic evaluations consolidating and assessing the effectiveness of these models [11, 36,37,38]. To fill this gap, the present study undertook the first systematic qualitative analysis of published predictive ML models for stillbirth. Findings identified the strengths and limitations of existing models and opportunities for improvements.

Selection of Predictive Features

ML algorithms such as decision tree (DT), support vector machine (SVM), and RF can account for non-linear and high dimensional relationships, which may lead to better predictive performances over traditional prediction methods that involve statistical modeling [9, 39, 40]. When using ML algorithms, selecting appropriate datasets and predictive factors are crucial for model development, training and validation and achieving optimal results [9].

This qualitative analysis revealed the selection of predictive factors for stillbirth varied across published ML models. The influence of factors such as maternal age, number of births, history of miscarriage, infectious diseases during pregnancy, and smoking status were consistent with the results of clinical studies. Preexisting conditions such as hypertension, diabetes, and obesity were significantly associated with stillbirth. As the choice of features can significantly impact the accuracy and efficiency of a predictive ML model, there is a need to select the right input features and achieve a balance between accuracy and data limitations [39,40,41,42]. The characteristics of the target population should be considered. For example, different predictive factors may be relevant for women with underlying chronic conditions (e.g., systemic lupus erythematosus, obesity, hypertension, and acute fatty liver) compared with healthy women [21,22,23].

The models evaluated in this review included a range of features. The least number of features were used by Yerlikaya et al. (2016) (n = 14 features) and Kumar et al. (2022) (n = 15 features). Both studies constructed singular LR models and achieved AUCs of 0.642 and 0.846, respectively. The most features (n = 53) were used in the stacked ensemble (SE) model proposed by Khatibi (2021) [15], potentially making its application impractical in a clinical setting.

Missing Data

Among the 8 studies included in this qualitative synthesis, there was limited reporting on the handling of missing data. Of the 5 studies that reported on the management of missing data, 4 studies omitted data with ambiguous records, including cases with imprecise timing of stillbirth [20, 22, 23, 25]. This may have led to bias and loss of accuracy [43]. Alternative approaches to handling missing data will ensure reliable and robust predictions [43]. Among the published ML models, alternative approaches included imputing missing data utilizing records from hospital visits as supplementary data sources [25], or computing medians and prevalence of stillbirth for pregnancies with missing data on any of the predictive variables [19].

Validation

One notable issue with the studies included in this qualitative analysis was the relative lack of validation techniques employed, with only 50% of studies utilizing model validation, and even fewer performing external validation. In the absence of scrupulous and impartial external validation, ML may generate erroneous high-risk predictions by incorrectly capturing the interconnectedness of features [44]. It is imperative for future studies to conduct comprehensive validation of their models to ensure accuracy and reliability.

Of the 8 studies reviewed, cross-validation was the most commonly employed method for validation. While all studies used internal validation, only two studies also used external datasets. The predictive model of stillbirth constructed by Koivu [45] was developed using two datasets, namely Centers for Disease Control and Prevention (CDC) and New York City Department of Health and Mental Hygiene (NYC). The external validation of the model utilized the NYC dataset in combination with a subset of the CDC dataset, consisting of 31,429 pregnancies. In the model constructed by Khatibi [20], the dataset used in external validation included stillbirth cases that occurred between 2011 and 2018 in hospitals across different provinces in Iran. This external dataset was obtained from a different registry than the one used for model development. The feature selection process and preprocessing techniques applied during model development were also applied to the external validation dataset to ensure consistency. The remaining 6 studies did not report external validation. Yerlikaya [24] evaluated the performance of their predictive model of stillbirth across different gestational ages by analyzing data from a different cohort in California, to some extent validating its generalizability to stillbirth rates. Kumar [21] assessed the performance of their model using 30% of the data as a separate test set.

Overall, the included literature lacked sufficient reporting on external validation, indicating a need for future ML predictive models to recognize this as a necessity. Without comprehensive assessment using a diverse collection of datasets, the external validity of a model may be limited, potentially hindering its effectiveness when applied to data collection systems from different patient populations and regions [44, 46].

Model Performance

This qualitative analysis focused on AUC as a metric of the performance of the published predictive models for stillbirth. AUC is widely used for assessing the discriminative ability of ML-based prediction models [47]. An AUC close to 0.6 represents low discriminative ability, while an AUC close to 1 indicates high discriminative ability [48]. The majority of the models evaluated in this qualitative synthesis had an AUC between 0.7 and 0.9, and none were considered highly discriminative, highlighting the need to optimize the performance of existing predictive models for stillbirth before they can assist in clinical decision-making.

Overall, the predictive models had a mean AUC > 0.7. When reported, sensitivity and specificity of most models was > 70%, with the exception of the LR model constructed by Amark (2018) [19]. Sensitivity or specificity was highest for the LR models constructed by Meng (2021) [22] or Wu (2019) [23], respectively.

The SE model proposed by Khatibi (2021) [20] performed better than other models, achieving an AUC of 0.9 and sensitivity and specificity > 85%.

Future Perspectives

This review performed a qualitative analysis of 8 studies that included a large sample (n = 14,840,654) of pregnant women. Findings should be interpreted with caution as the analysis was associated with some limitations, including the need to combine studies with and without limited reporting of handling of missing data and those that did and did not conduct external validation into a single outcome. However results demonstrate the potential of ML to improve the prediction of stillbirth by analyzing large datasets to identify cases of stillbirth in populations with medical disorders and infections during pregnancy, and in populations without obvious high-risk conditions [22].

Existing predictive models for stillbirth created using ML algorithms require further development before being applied in clinical care. Optimal decision thresholds are challenging to assign due to variability in the classification of stillbirths and miscarriages across regions. Model performance has not been consistently evaluated, making comparisons between models difficult. Some studies report AUC and others rely on sensitivity and specificity. In particular, models have not been validated in external datasets, which is crucial when assessing model performance. External validation provides an understanding of the importance of individual features, and can allow comparisons between models, which is essential when multiple models claim high performance on their respective training datasets. A high AUC on the training or validation set does not guarantee a model will perform similarly on new, unseen data. Training data may cause overfitting. Training and validation datasets may represent a specific patient population or setting at a certain point in time and have inherent biases due to sampling methods and data collection procedures. External validation helps ensure a model generalizes well to new data and is robust and applicable in diverse populations or settings over time. This allows healthcare stakeholders, including researchers, clinicians, and regulatory agencies, confidence that the model can be applied in real-world scenarios.

Conclusion

This qualitative analysis revealed that available ML models can attain a considerable degree of accuracy for prediction of stillbirth; however, these models require further development before they can be applied in a clinical setting, including improved approaches to handling missing data and external validation. The optimization of predictive ML models for stillbirth should ensure precise support of individual patients and contribute to the wellbeing of mothers, children, and society.

Registration and Protocol

  1. 1.

    The protocol of this review was registered on the PROSPERO database under the ID CRD42022380270.

  2. 2.

    The review protocol can be accessed at: https://www.crd.york.ac.uk/PROSPERO/display_record.php?RecordID=380270.