Introduction

Delivery is a critical event having various risky conditions for the fetus and takes a short time when compared to pregnancy period [1]. The undesired and stressful events for the fetus such as hypoxia and asphyxia frequently occur during the labor due to generally the lack of oxygen [2]. Fetus equipped with defense mechanisms struggles against these developments throughout the pregnancy and more importantly throughout labor [3]. At this point, cardiotocography (CTG) consisting of fetal heart rate (FHR) and uterine contraction (UC) signals recorded as simultaneously is the most prevalent used diagnostic technique to enable both determining distress level and continuous monitoring of the fetus [4]. This surveillance technique has been commonly adopted because of the sense of security it provides to observers [5]. Although the usage of CTG is great common, a gold standard has not been still accepted to evaluate the CTG traces. CTG has led to several debates about the increased rate of cesarean sections as well as high inter- and even intra-observer variability [6, 7].

Several parameters of FHR known as morphological features are the reliable and prominent indices to ascertain whether the fetal condition is well-being. More specifically, the baseline level of FHR, its variability both in short and long terms and its temporal transients considering as accelerations and decelerations are the primary indicators regarding the clinical assessment [8]. The various guidelines, such as the International Federation of Gynecology and Obstetrics (FIGO) [6] have been published by different health organizations to provide identifying the morphological features and a consistent CTG interpretation. In fact, the main aim of the guidelines is to decrease the variability among observers while preventing unnecessary interventions as possible. Despite the existing guidelines, unfortunately, the disagreement level among clinicians has remained stable [10]. The possible approaches to tackle these drawbacks were discussed in [11]. As a result, the usage of computerized CTG systems by supporting the decision process of observers has been pointed the most promising solution.

Automated CTG analysis requires several basic steps which are achieving the database, preprocessing, feature extraction, feature selection, and classification [12]. The morphological features mentioned above are extended with diagnostic features that are obtained from linear and nonlinear [13], time–frequency [14,15,16], and recently image-based time–frequency (IBTF) [12, 17,18,19] domains in computerized CTG analysis [20, 21]. Furthermore, FHR signals are classified using numerous machine learning techniques such as artificial neural networks (ANNs) [22, 23], extreme learning machine (ELM) [17, 24], and support vector machine (SVM) [25,26,27]. Feature extraction algorithms are utilized to improve the performance of classifiers and to propose clinically applicable models. For these particular purposes, genetic algorithms (GAs) [28], principal component analysis (PCA) [29], information gain (InfoGain), group of adaptive models evolution (GAME) neural network [30], correlation-based (CFS), Relief, Mutual Information (MI) [31] feature selection methods have been employed.

In this study, the combinations of five feature selection algorithms and machine learning algorithms, which are artificial neural network (ANN), k-nearest neighbor (kNN), decision tree (DT), and support vector machine (SVM), are evaluated on CTG data. To this end, weighted by support vector machine (WSVM), information gain ratio (IGR), relief, backward elimination (BE), and recursive feature elimination (RFE) algorithms are examined. Lastly, the commonly selected features by the related algorithms are used to generate the most relevant feature subset.

Materials and methods

Data description

An open-access intrapartum CTG database was introduced in 2012 [32] and it can be downloaded from Physionet. The database consists of 552 recordings, and these recordings are a subset of 9164 intrapartum CTG recordings that were acquired by the means of STAN S21/S31 and Avalon FM40/FM50 electronic fetal monitoring (EFM) devices. All signals were selected carefully considering the several technical and clinical criteria. Furthermore, the signals were stored in electronic form using OBTraceVue® system.

We use umbilical artery pH value obtained after delivery to separate the signals as normal and hypoxic. It is observed that different values of pH have been used as a borderline for separating FHR signals [30]. 177 recording with umbilical artery pH < 7.20 were considered hypoxic. The rest of the signals have umbilical artery pH ≥ 7.20 and thus were considered as normal.

Signal preprocessing

FHR signals can be acquired using either Doppler ultrasound or scalp electrode. In both cases, the signals are contaminated by several factors such as mother and fetal movements, displacement of the transducer and network interference as well. Segment selection, outlier detection, and interpolation are the basic procedures in preprocessing. The experimental study is performed on the signals which last 15 min (3600 sample points due to the 4 Hz sampling frequency). Extreme values (≥ 200 bpm and ≤ 50 bpm) are interpolated, and the long gaps (> 15 s) are not included in the subsequent feature extraction process. FHR signals are detrended using second-order polynomial in the last step of preprocessing since nonlinear signal processing techniques are utilized. After the preprocessing, we achieve 15 min duration segments of the signals that are as close as possible to the labor. Figure 1 demonstrates the state of sample recording before and after the preprocessing. Figure 1 comprises of small squares and large rectangles. Each small square corresponds to 30 s on the horizontal axis and 10 bpm on the vertical axis whereas each large rectangle takes up 3 min on the horizontal axis and 30 bpm on the vertical axis.

Fig. 1
figure 1

The signal state before and after the preprocessing stage (Rec. Id:1061, the internal number of CTU-UHB database)

Feature transform (feature extraction and selection)

The features used in this study to identify FHR recordings are obtained from an open-access software that is used to analyze CTG recordings called CTG-OAS [29].

The morphological features describing the shape and changes of FHR signals are extracted firstly in accordance with FIGO guidelines. The baseline and the numbers of transient changes called accelerations (ACC) and decelerations (DCC) are taken into consideration [9].

Then, the morphological features are supported using several linear features such as mean (µ), standard deviation (σ), long-term irregularity (LTI), delta, short-term variability (STV), and interval index (II) [22, 33].

The third category of the features is the nonlinear domain. Approximation Entropy (ApEn), Sample Entropy (SampEn) and Lempel–Ziv Complexity (LZC) are the most commonly used features from this domain [34]. Two parameters pairs (embedding dimension, m = 2 with tolerance r = {0.15; 0.20}) are utilized individually in the experiment for ApEn and SampEn.

In the last category of the features is IBTF features involving contrast, correlation, energy, and homogeneity. IBTF features are obtained using a combination of Short-Time Fourier Transform (STFT) and Gray Level Co-occurrence Matrix (GLCM) [19]. GLCM is a directional pattern counter, and IBTF features are extracted according to angle and distance parameters. Distance (δ) and angle (θ) parameters are set to 1 and 90°, respectively. The spectrograms of very low frequency (VLF, 0–0.03 Hz), low frequency (LF, 0.03–0.15 Hz), middle frequency (MF, 0.15–0.50 Hz) and high frequency (HF, 0.50–1 Hz) are used to achieve IBTF features.

At the end of feature extraction stage, a total of 30 features coming from 3 morphological, 6 linear, 5 nonlinear and 16 IBTF (4 features for each specified frequency bandwidths) domains are extracted considering their origins.

In order to determine the most relevant features and to generate an effective subset, we utilize three filters and two wrappers methods. The commonly selected features by the algorithms are added to the final most relevant feature subset, and thus the effective subset is generated.

Artificial neural network (ANN)

ANN is a computational model inspired by the human brain and nervous system [35]. In the ANN architecture, an input layer, one more hidden layer(s) and an output layer are used [25]. Each node in the layers has a connection with the nodes in the subsequent layer, and this connection is represented with the weights [36]. An output of a layer for ANN is represented as follows:

$$a^{i} = \sigma \left( {\mathop \sum \limits_{j = 1}^{N} \omega_{ij} x_{j} + b^{i} } \right)$$
(1)

where σ is the activation function, N is the number of input neurons. ωij and b represent the weights and bias value. Levenberg–Marquardt backpropagation algorithm and only one hidden layer with 30 nodes were used in the configuration of ANN. The other parameters were used with their default values.

k-Nearest neighbor (kNN)

kNN is a non-parametric classification method [37]. It is carried out a classification task using a distance metric such as Euclidean as described in Eq. (2). It needs a training set to determine the distribution of the samples. Then the test data is classified using a majority vote of k-nearest neighbors in the training set [38].

$$d_{E} = \mathop \sum \limits_{i = 1}^{N} \sqrt {x_{i}^{2} - y_{i}^{2} }$$
(2)

We preferred the Euclidean distance metric for kNN and k was searched in the range of 1 and 10. The most efficient results were obtained when k was set to 3.

Decision tree (DT)

DT is a useful machine learning method to generate regression or classification models based on the tree structure. A DT consists of a root node, branch nodes, and leaf nodes [39]. These nodes correspond to an algorithm that is used to control conditional statements. It means that the way from root to a leaf corresponds to a set of classification rules. The root is determined using information gain theory, and growing a DT continues until the leaf nodes are obtained [40]. To achieve an efficient DT model, some hyperparameters such as the depth of the tree, merging criteria of the leaf, the size of parents, and splitting predictor should be chosen properly. In the experiment, we employed hyperparameter optimization based on the Bayesian optimization to optimize all eligible parameters. Gini’s diversity index (GDI) was used as a split criterion.

Support vector machine (SVM)

SVM is an important machine learning concept which can be used for either supervised classification and regression applications or for unsupervised data clustering [30]. SVM aims to find an optimal separating hyperplane between positive and negative samples, where the margin around the hyperplane is maximized. Let’s assume a set of training samples \(\left( {\varvec{x}_{1} ,\varvec{y}_{1} } \right), \ldots ,\left( {\varvec{x}_{\varvec{N}} ,\varvec{y}_{\varvec{N}} } \right)\) are given, where xi shows sample feature vector and yi is the class label. Class labels are either positive or negative. As it was mentioned earlier, the SVM approach runs an optimization algorithm to find an optimum class separation hyperplane, which has the maximal margin. To do so; the following equations are considered;

$$maximize\left( {\mathop \sum \limits_{i = 1}^{N} \alpha_{i} - \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j} K\left( {x_{i} ,x_{j} } \right)} \right)$$
(3)
$$Subjected\,to \mathop \sum \limits_{i = 1}^{N} \alpha_{i} y_{i} , 0 \le \alpha_{i} \le C$$
(4)

where αi is the weight vector that is accompanied with xi and C is called as the regulation parameter. K shows the kernel function that is used to calculate the similarity between xi and xj. Gaussian radial function, linear function, and polynomial function can be used for kernel function.

In the experiment, RBF kernel was used and sigma was assessed in the range of 1 and 10. As a result, the most efficient results were yielded when sigma was set to 2. Also, the regulation parameter was evaluated in the range of 1 and 100. It was adjusted to 10.

Results

A total of 30 features were obtained by means of CTG-OAS. The features and marginal histograms are illustrated using the first two principal components in Fig. 2. As shown in Fig. 2, separating the recordings as normal and hypoxic and finding a borderline for this purpose is a quite challenging task. For this reason, we utilized several machine learning models such as ANN, kNN, DT and SVM.

Fig. 2
figure 2

The distribution of the recordings on the first two principal components

In order to measure the performance of the feature selection algorithms, 10-fold cross-validation (CV) method was used. The several performance metrics, which are accuracy (Acc), sensitivity (Se), Specificity (Sp), quality index (QI) and F-measure, derived from confusion matrix were also considered. Confusion matrix consists of four prognostic indices which are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). TP and TN represent the number of hypoxic and normal fetuses identified correctly whereas FP and FN represent the number of hypoxic and normal fetuses identified incorrectly, respectively. The aforementioned performances metrics are calculated as follow:

$$Acc = \frac{TP + TN}{TP + FN + TN + FP}$$
(5)
$$Se = \frac{TP}{TP + FN}$$
(6)
$$Sp = \frac{TN}{TN + FP}$$
(7)
$$QI = \sqrt {Se*Sp}$$
(8)
$$F{ - }measure = \frac{2TP}{2TP + FP + FN}$$
(9)

Acc gives the overall efficiency of the model. Se and Sp explain the efficiency of the model on positively and negatively labeled data, respectively. QI is the geometric mean of Se and Sp and it is a very useful metric when the distribution of data is imbalanced among the classes. F-measure expresses the harmonic mean between precision and recall. Furthermore, we used receiver operator (ROC) curve which defines the relationship between Se and Sp. Also, the area under this curve (AUC) was calculated to determine the performance of the classifier.

In the experimental study, feature ranking techniques were examined first. Weighted by SVM, IGR and Relief methods were employed individually with 10-fold cross-validation procedure and machine learning models for this particular purpose. The selected features and classification performances were reported in Tables 1 and 2, respectively. As shown in Table 2, the most efficient results were obtained using a combination of Weighted by SVM and SVM classifier. Then, two wrappers methods, BE and RFE were utilized. Sp values were superior to Se values because of imbalanced data distribution. The features selected by at least 3 of 5 feature selection algorithms were used to generate the most relevant final feature subset. A total of 12 features was determined as the most relevant, and these features and their relationship with each other are illustrated in Fig. 3.

Table 1 Selected features by the feature selection algorithms
Table 2 The performance results of the feature selection algorithms with machine learning models
Fig. 3
figure 3

Pairwise correlation matrix of selected features

It is observed that there is a high correlation among IBTF features, especially in the features belonging to the VLF band. Figure 3 is examined, a similar relationship can also be seen between Baseline, Mean, and LTI features. It should be noted that the values of IBTF features are normalized in the range of 0 and 1.

In the last step of the experiment, the most relevant features were applied as an input to machine learning models. The aggregate confusion matrices and performance metrics are given in Tables 3 and 4, respectively. Table 4 is compared with Table 2, it can be seen clearly that the best results were obtained using the most informative feature subset. As a result, Se of 77.40% and Sp of 93.86% were achieved. Also, the values of QI (85.23%) and F-measure (81.30%) metrics were quite satisfactory. As mentioned above, another significant tool regarding the model evaluation with two classes is the ROC curve and AUC. The highest AUC (close to 1) shows the highest certainty of the fetal hypoxia detection according to the analyzed feature set. ROC curve of the models with most relevant features are illustrated in Fig. 4, and AUCs were achieved as 0.7890, 0.6777, 0.8591, and 0.8874 for ANN, kNN, DT and SVM, respectively.

Table 3 The aggregate confusion matrices of machine learning models obtained after 10-fold CV procedure
Table 4 The performance results of the most relevant feature set
Fig. 4
figure 4

ROC curves of the models with the most relevant features for fetal hypoxia detection

Discussion

As underlined in the introduction, CTG has a high disagreement level among observers because of visual inspection and suffers from lacking practicable standards in daily clinical practice [7]. For this reason, automated CTG analysis is admitted as the most promising way to tackle these disadvantages. Features selection algorithms are of great importance in terms of automated CTG analysis. In this paper, we evaluate a total of five feature selection algorithms consisting of three filters and two wrappers methods on CTG data for the fetal hypoxia detection task.

Identification of FHR signals by diagnostic indices obtained from different fields such as morphological, linear, nonlinear, and IBTF enhances the possibility of recognizing fetal hypoxia. A crucial factor is connected with the selection of the most relevant features which are applied as the input to classifiers. The use of multiple feature selection algorithms can produce better results as in our experiment since the most relevant features are determined according to their selection frequency by the feature selection algorithms. Consequently, the most informative feature set which is a subset of the full feature set consisting of 30 features has only 12 diagnostics indices. Moreover, this subset provided the best results.

According to the results of each method used in the experiments, Se values were higher than Sp values due to the imbalanced data distribution. Using either oversampling or downsampling technique to balance data distribution could lead to better results [30]. A further step for improving the classification performance will be using different machine learning techniques. Furthermore, the spectrogram images may be an enormous information source for detection of fetal hypoxia considering deep learning algorithms such as convolution neural network [12, 41]. We believe that we can obtain more successful results by IBTF analysis projection.

In this section, we also present a comparison of the related works considering several parameters such as methods, datasets, the number of features for describing the CTG signals, and performance metrics. However, it is important to be aware that making a one-to-one comparison among the related works is not suitable due to the different parameters as mentioned above. The comparison results are given in Table 5. Subha et al. [42] and Velappan et al. [43] used a public dataset called UCI CTG. This dataset generated using SisPorto software [44] and come up with 21 diagnostic features extracted automatically by the software. On other words, no need to use advanced signal processing techniques on raw CTG signals thanks to the SisPorto software for this dataset. As a result of this situation, high-performance results were achieved. Genetic algorithm, filters, and wrappers methods have been examined on CTU-UHB intrapartum CTG database to reach more consistent diagnosis models. However, because of the different division criteria, and the complex structure of the intrapartum recordings, this area has remained a challenging work. To overcome this issue, we generated a more relevant feature set based on the five feature selection algorithm covering filters and wrappers methods. Each feature in this subset was selected by at least three feature selection algorithms. As a result, we achieved 88.58% classification accuracy.

Table 5 Comparison of the related works

Conclusion

CTG is one of the fetal surveillance technique used routinely in obstetric clinics to monitor fetal well-being. Basically, it suffers from visual examination, and for this reason, the computerized systems are in demand. In this study, we carried out advanced signal processing techniques to achieve reliable segments and to extract features for describing the signals. Then, the combinations of four machine learning algorithms and five feature selection algorithms were examined. As a result, 12 features were determined as the most relevant from the full feature set consisting of 30 diagnostic features. As a result, we achieved Acc of 88.58%, Se of 77.40% and Sp of 93.86%. This work points out that determining the optimal feature set ensures more consistent and effective diagnosis models.