Keywords

1 Introduction

In medical imaging diagnoses, where decisions can have significant implications for individual’s health, it is essential to gain a thorough understanding of the factors influencing these decisions. While Machine Learning (ML) models have proven effective in diagnosing a variety of medical conditions in medical imaging [29], their limited interpretability poses a challenge to their broader adoption. Moreover, the recently introduced European Union (EU) Communication on Fostering a European approach to AI [1] specifically targets explainability as a key concern for the deployment of ML and Artificial Intelligence (AI) models.

Another prevalent challenge in the medical imaging domain is the issue of class imbalance within the dataset. Methods such as Synthetic Minority Oversampling Technique (SMOTE), Edited Nearest Neighbour (ENN), and Mixup combined together as STEM [16], which leverages the full distribution of minority classes, can effectively address both inter-class and intra-class imbalances. In [16], STEM was applied in-conjunction with an ensemble of ML classifiers, producing promising outcomes. However, understanding the reasoning behind ML model predictions remains a complex task. Furthermore, as the volume of instances and the specificity of problems grow, the complexity of the derived solutions also increases.

Building trust in ML classifiers and understanding the behaviour of the solutions is pivotal to their broader acceptance. Employing inherently explainable models is a useful strategy when generating Explainable AI models. Grammatical Evolution (GE) [26], an Evolutionary Computation (EC) technique, has been used to leverage grammars to define and constrain the syntax of potential solutions, producing inherently explainable models [22].

To address these challenges, we developed a classification system based on GE. Our study includes a comprehensive comparison with an ensemble of other ML classifiers. Notably, GE models show enhanced interpretability compared to other traditional ML models. GE provide solutions in the form of symbolic expressions, offering a more intuitive understanding of the decision-making process. This emphasis on interpretability is crucial, especially in healthcare, where understanding the rationale behind decisions is of paramount importance.

Our research hypothesises that the use of the STEM augmentation technique combined with an approach rooted in GE produces more interpretable solutions as compared to the other ensemble ML classifiers.

The contributions of this paper are as follows. Firstly, we develop a method that combines a GE classifier with STEM, outperforming an ensemble of ML classifiers, as indicated by the superior AUC. Secondly, our approach distinguishes itself by offering more interpretable solutions compared to the ensemble method. Finally, the paper presents rigorous statistical analyses to comprehensively evaluate the performance of implemented data augmentation techniques on each data setup.

The rest of the paper is structured as follows: Sect. 2 reviews the existing literature. Section 3 outlines the proposed methodology, and Sect. 4 addresses experimental details performed in this work. Results and discussion are described in Sect. 5, and Sect. 6 presents the conclusion and future guidelines.

2 Literature Review

In the realm of medical applications, particularly in the context of breast cancer diagnosis, the issue of imbalanced datasets is a critical concern. Imbalances, where one class significantly outweighs the other, can introduce biases and compromise the reliability of ML models. Implementing effective strategies for class balancing, such as oversampling, undersampling, and their combination, results in a more balanced and representative training dataset [9]. Previous studies [14, 17] have recognized the impact of class imbalance in medical datasets for ML tasks.

Moreover, ML algorithms have demonstrated notable efficiency in the classification of medical data. A compelling study showcases the effectiveness of ensembles, where Bayesian networks and Radial Basis Function (RBF) classifiers with majority voting resulted in an accuracy of 97% [20] when applied to the Wisconsin Breast Cancer (WBC) dataset. Furthermore, an approach that combined linear and non-linear classifiers using Micro Ribonucleic Acid (miRNA) profiling achieved an impressive accuracy of 98.5% [28].

While these findings are promising, ML algorithms may struggle to contextualize information and are susceptible to unexpected or undetected biases originating from input data. Additionally, they often lack transparent justifications for their predictions or decisions [25]. In response to this, employing GE can yield interpretable solutions. As a variant of Genetic Programming, GE evolves human-readable solutions, offering explanations for the rationale behind its classification decisions, which is a significant advantage over current paradigms in unsupervised and semi-supervised learning [10].

Previous studies have already demonstrated the effectiveness of GE across a range of ML tasks. It has proven valuable for feature generation and feature selection [11], as well as for hyperparameter optimization [24]. The GenClass system [3], built upon GE, demonstrates promising outcomes and outperforms RBF networks in certain classification problems. They utilized thirty benchmark datasets from the UCI and KEEL repositories, including Haberman, which consists of breast cancer instances. While it has excelled in these areas, there are still avenues for further exploration.

In this paper, we aim to investigate the efficiency of utilizing GE as a medical imaging classifier combined with STEM to handle imbalance distributions of data samples, particularly in breast cancer diagnosis. Leveraging the interpretive and adaptable features of GE, our objective is to achieve accurate and reliable outcomes that can be easily explained.

3 Methodology

For analysis, we utilize two primary breast cancer datasets. One consists of images, the Digital Database for Screening Mammography (DDSM) [18], while the other consists of tabular data, the WBC [31] dataset. DDSM is a comprehensive collection of mammograms, encompassing both normal and abnormal images. For this study, we focused on \(DDSM's\) Cancer 02 volume and three volumes of normal samples (Volume 01-03). By selecting one volume of cancer images compared to three volumes of normal images, we maintain a realistic class imbalance ratio. These images come from the Craniocaudal (CC) and Mediolateral Oblique (MLO) views of both the left and right breasts. We work with 152 cancerous images and 876 healthy ones from volumes 1-3. Each image was divided into four segments: the entire breast (I), the top segment (It), the middle segment (Im), and the bottom segment (Ib).

Fig. 1.
figure 1

Outline of the proposed approach for breast cancer classification using GE and other classifiers.

The WBC dataset consists of 30 features derived from Fine Needle Aspiration (FNA) samples of breast masses, categorising patients into benign (non-cancerous) and malignant (cancerous) cases. It comprises 212 malignant samples and 357 benign samples.

To create a dataset containing breast cancer images from the DDSM image for evaluating the proposed methodology, we first need to extract features that will be used for training. This involves isolating the breast region, eliminating irrelevant background data, segmenting the breast region, and extracting pertinent features to generate a comprehensive training dataset of breast segments. Initially, a median filter is applied to reduce noise within the images. Subsequently, non-essential background data, often containing machine-generated labels such as ‘CC’ or ‘MLO’, is removed. For this step, we employed a precise Otsu thresholding technique. Following this, the segmenting process proposed in [27] effectively partitioned images into three overlapping segments.

Feature extraction is the next critical phase. In our study, we extracted a set of Haralick’s Texture Features [15] from both whole and segmented images. The selection of these features is based on the hypothesis that there are discernible textural differences between normal and abnormal images. Specifically, we compute thirteen distinct Haralick features from the Gray-Level Co-Occurrence (GLCM) matrix, employing four orientations corresponding to two diagonal (grey-level numeric values of the images) and two adjacent neighbours. This process results in generating a total of 52 features per segment or image.

High class imbalance present in the utilized datasets poses a significant challenge in developing robust and accurate predictive models. Therefore, explicit data augmentation has been implemented in the training set to effectively address this class imbalance challenge. Using nine distinct augmentation approaches outlined in Sect. 4.3, synthetic samples are generated to enrich the dataset with more discriminative information, ultimately improving the learning capabilities of the model.

In the last step, the GE classifier and an ensemble of other ML classifiers are trained separately to make predictions on the test set. Augmented training data is used, while the original imbalanced test set is used for testing. For ensembling, eight ML classifiers are used as mentioned in Sect. 4.5. The top three classifiers, based on AUC, are selected and combined through majority voting to create the final predictions. The complete pipeline of the proposed approach is shown in Fig. 1.

4 Experimental Details

The DDSM and WBC datasets are used to evaluate the proposed technique. The study employs five different data setups to train the classifiers. For the WBC dataset, a single setup is utilized, consisting of 30 breast mass features per sample acquired through FNA.

In contrast, the DDSM dataset includes images from two views, CC and MLO. To conduct experiments, the dataset is categorized into four distinct configurations based on these views. In the initial setup, denoted as “\(S_{CC}\)”, data is exclusively extracted from segments of the CC view. Conversely, the second category, “\(S_{MLO}\)”, comprises segmented images exclusively from the MLO view. The third configuration, “\(S_{CC+MLO}\)”, combines segments from both views. Lastly, the fourth setup, “\(F_{CC+MLO}\)”, considers the full image (non-segmented) features from both the CC and MLO views for comprehensive analysis. The number of features for each segment or image is 52, used in all these setups

We divided the datasets into training and testing sets at an 80:20 ratio, respectively. Notably, all DDSM configurations exhibit significant class imbalances, with class ratios ranging from 6:94 \(S_{CC}\), \(S_{MLO}\) and \(S_{CC+MLO}\) setups. For \(F_{CC+MLO}\) the ratio between the positive versus negative class is 15:85. Likewise, the WBC dataset has a class distribution of 37% positive and 63% negative classes as illustrated in Fig. 2.

Fig. 2.
figure 2

Concentric ring chart for setup description. Rings are setups, and the coloured areas indicate training positive percent. Legend includes the training positive and negative total samples.

4.1 System Settings

All the ML experiments were conducted using the PyCaret library [2]. The GRAPE [8] framework was used to perform GE experiments. For statistical analysis, we employed the AutoRank Python library [19] to evaluate the performance of the implemented augmentation approaches. Our code, along with our dataset configurations, is available in our GitHub repositoryFootnote 1.

4.2 Performance Metric

To evaluate the performance of the designed approach, AUC has been selected as the assessment metric which uses Trapezoidal rule for its computation. AUC has become a widely accepted performance measure in classification problems due to its reliability, particularly in the context of imbalanced datasets [13, 21].AUC serves as a comprehensive metric, encompassing both sensitivity (Eq. 1) and specificity (Eq. 2), considering various threshold values. TPos denotes true positives, TNeg true negatives, FPos false positives, and FNeg denotes false negatives.

$$\begin{aligned} Sensitivity = \frac{T_{Pos}}{T_{Pos} + F_{Neg}} \end{aligned}$$
(1)
$$\begin{aligned} Specificity = \frac{T_{Neg}}{T_{Neg} + F_{Pos}} \end{aligned}$$
(2)

4.3 Class Balancing

The methods utilized for generating synthetic data with the aim of equalizing the class distribution ratio include the Synthetic Minority Oversampling Technique (SMOTE) [7], Borderline SMOTE (BSMOTE) [14], SMOTENC (S-NC) [7], Support Vector Machine SMOTE (SVM-S) [23], Mixup [32], and ADASYN (ADA) [17]. Additionally, three hybrid methods, SMOTE Edited Nearest Neighbour (S-ENN) [30] SMOTE-Tomek (S-Tomek) [5] and combination of SMOTE, ENN, and Mixup (STEM) are also implemented to compare against each other.

Notably, STEM generates a balanced number of samples for each class. Compared to other methods, it demonstrates the ability to increase the number of data samples more extensively, resulting in improved model performance.

4.4 Grammatical Evolution

GE’s grammars are typically defined in Backus-Naur Form (BNF), a notation represented by the tuple N, T, P, S, where N is the set of \(non-terminals\), transitional structures usually with semantic meaning, T is the set of terminals, items in the phenotype, P is a set of production rules, and S is a start \(non-terminal\). The following simple grammar was created to evolve solutions for the first four data setups with 52 numerical features, whereas, for the last setup, 30 numerical features were used:

figure a

This grammar permits the use of basic arithmetic operations (addition, subtraction, multiplication, and division –protected in case the divisor is equal to 0) and the inclusion of real numbers constants. These constants are helpful because GE can explore beyond the parameter space given to minimize the error between expected and predicted outputs, something that does not happen with other ML classifiers. The \(non-terminal\) X encompasses the fifty-two numerical features for the first four setups of the DDSM dataset and the thirty numerical features for the WBC dataset.

The output domain of the evaluations is \(o \in [-\infty ,\infty ]\). Subsequently, a sigmoid function is applied to constrain the values to \(\sigma (o) \in [0, 1]\). For binary classification, the typical interpretation of the sigmoid function is the probability of belonging to class 1, and therefore we use \(\sigma (o)\) to calculate AUC. Table 1 presents the experimental parameters used in this work:

Table 1. List of parameters used to run GE

4.5 Other Classifiers

We also used the augmented training data to train a diverse ensemble of eight ML classifiers. This ensemble includes Random Forest (RF), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis, LightGBM, XGBoost, AdaBoost, KNN, and Extra Trees models. Initially, a comprehensive model is trained using all eight classifiers. Subsequently, based on the AUC metric, the three best-performing models are selected. These selected models are then combined through a majority voting approach. The final predictions are made on the test dataset, which consists of imbalanced and unseen samples.

5 Results and Discussion

To evaluate the performance of the proposed method, five distinct data setups are employed. Four configurations are derived from the DDSM dataset, considering variations in views, segments, and full images. The fifth setup is from the WBC dataset. To enhance the robustness of the training setups, nine augmentation approaches are applied and compared. The assessment is conducted using an ensemble of other ML classifiers, alongside GE.

The performance of the classifiers is compared based on AUC for each dataset. The ensemble classifiers are denoted by their respective initials: \(L_d\) for Linear Discriminant Analysis, Q for Quadratic Discriminant Analysis, E for ExtraTree, R for Random Forest, \(L_i\) for Lightgbm, K for KNN, A for Adaboost, and X for Xgboost. It is important to note that the AUC values of the other ensemble classifiers are presented for a single run, and they are then compared against the median AUC derived from 30 runs conducted with GE.

Table 2 provides an overview of the results. In the first setup, \(S_{CC}\), an AUC of 0.91 was achieved, outperforming the ensemble of \(L_dQE\), which obtained an AUC of 0.90. Similarly, in the second setup, \(S_{MLO}\), an AUC of 0.90 was attained, while the ensemble of \(L_dQE\) achieved a slightly lower AUC of 0.84.

For the third setup \(S_{CC+MLO}\), an AUC of 0.92 was observed using the GE classifier, outperforming other classifiers that yielded the highest AUC of 0.87 using \(L_dQE\). When the classifiers were trained on full image features in setup \(F_{CC+MLO}\), the highest AUC values were 0.94 and 0.85, obtained by the GE classifier and the ensemble of \(L_dQE\), respectively.

When comparing the AUC using the WBC dataset, both GE and the ensemble of \(AKL_r\) achieved an AUC of 0.99.

Table 2. A comparison of the AUC for GE and the ensemble approaches using the nine different augmentation techniques for each data setup.

The augmentation approaches are compared using the boxplot presented in Fig. 3. The plot indicates the AUC obtained from all nine augmentation approaches for each setup across all 30 runs. The horizontal line in red indicates the median value of the respective group.

Fig. 3.
figure 3

Boxplot analysis comparing opponent approaches and their AUC distributions across multiple runs

GE provides valuable insights into the most informative features used in the solutions, as demonstrated in Table 3, which present the most frequently used features for each setup. The features extracted and presented in these tables are sorted by their impact on the solutions. Common features consistently found in Table 3 for the DDSM dataset include “Inverse Difference Moment (IDM)”(feature 17), “Contrast” (feature 5), and “Difference Entropy” (feature 41). Both contrast and IDM represent the difference in grey levels between pixels, while entropy indicates the level of randomness in the grey levels.

For the WBC dataset, as shown in Table 3, the top three features that consistently appear in the solutions are 21, 20, and 27, corresponding to “Concave Point Worst”, “Fractal Dimension”, and “Radius Worst” respectively. The concave point worst feature indicates the severity of the concave portion of the shape, with “worst” denoting the highest mean value. The “fractal dimension” is a crucial characteristic that provides information related to the geometric shape of the fractals. The third feature, radius worst, represents the largest mean value for the distances from the centre to points on the perimeter.

While other ML models may share the feature of interpretability, they often present challenges that GE does not encounter. Decision trees and RF, though interpretable, lose clarity with complex structures and aggregation [4]. LDA relies on the Gaussian distribution of the data and assumes that the covariance of two classes is the same [12], limiting its applicability. In contrast, GE does not depend on these factors and maintains transparency throughout its evolution, even when addressing complex and non-linear problems.

Table 3. This analysis unveils prevalent features used by GE in all five setups. For \(S_{CC}\) and \(S_{MLO}\), percentages are computed from 8684 and 7945 occurrences. Likewise, contributions to \(S_{CC + MLO}\) and \(F_{CC + MLO}\) are based on 8138 and 8522 occurrences, respectively. The features of WBC are also examined, with percentages drawn from 9076 appearances.

5.1 Statistical Analysis

The statistical comparison of implemented data augmentation techniques involved a non-parametric Bayesian signed-rank test [6] applied to each dataset. In our analysis, conducted on nine augmentation techniques with 30 paired AUC samples each, the test distinguished between methods being pair-wise larger, smaller or inconclusive. The approaches listed in the rows are compared with the methods presented in the corresponding column. The subsequent Bayesian signed-rank test revealed significant distinctions among the techniques. In the cases where STEM has outperformed the other approaches are underlined in the Table 4.

In the \(S_{CC}\) setup, as illustrated in Table 4(a), STEM, Mixup, SMOTE, ADA, S-NC, SVM-S, S-Tomek and BSMOTE all exhibit larger medians than S-ENN.

The statistical comparison of medians depicted in Table 4(b) among various augmentation populations reveals notable differences for \(S_{MLO}\) setup. STEM, S-NC, S-Tomek, ADA, and Mixup exhibit larger medians compared to BSMOTE, SVM-S, SMOTE, and S-ENN.

Similarly, for setup \(S_{CC+MLO}\) in the Table 4(c) STEM again showcases its effectiveness by outperforming S-NC, BSMOTE, Mixup, SMOTE, and ADA in medians. Additionally, S-ENN demonstrates superiority by exhibiting larger medians than Mixup, SMOTE, and ADA. Additionally, S-Tomek outperforms SMOTE in median values. SVM-S, in particular, stands out with a larger median than ADA.

Moreover, STEM stands out by consistently surpassing S-Tomek, Mixup, BSMOTE, ADA, SVM-S, SMOTE, and S-ENN in median values presented in Table 4(d) for \(F_{CC+MLO}\) . Additionally, S-NC demonstrates superiority over SMOTE and S-ENN, while S-Tomek outperforms S-ENN in median values. Mixup, BSMOTE, ADA, SVM-S, and SMOTE all exhibit larger medians than S-ENN.

Finally, in the WBC setup, as depicted in Table 4(e), STEM emerged as the top-performing method, surpassing S-NC, BSMOTE, S-Tomek, Mixup, SVM-S, SMOTE, and ADA. S-NC exhibited a higher median than SMOTE and ADA, while Mixup outperformed SMOTE in median value. SVM-S demonstrated a larger median than SMOTE and ADA.

Table 4. The results of the Bayesian signed-ranked test are summarized here for the nine augmentation approaches for each data setup. Arrows indicate the direction of differences: \(\Uparrow \) for larger, \(\Downarrow \) for smaller, - for inconclusive, and N/A for not applicable results. A family-wise significance level of \(\alpha \equiv 0.05\) is employed.

The Bayesian analysis results are summarized in Fig. 4. It reveals that STEM, a combination of S-ENN and Mixup, emerges as the top-ranking approach. This result underscores the effectiveness of this combined strategy in enhancing performance. Notably, S-ENN and Mixup individually secure the second and third positions, further affirming the significance of this ensemble approach.

Fig. 4.
figure 4

The illustration of the overall results acquired from the Bayesian signed-rank test is shown here. The cumulative score is the total number of times one approach outperforms the other. STEM obtained a cumulative score of 23 where the maximum possible is 40 (comparing one versus another 8 approaches in 5 setups), outperforming the other approaches. Each color represents distinct test setups used for the evaluation.

6 Conclusion and Future Work

In this study, we addressed class imbalance and interpretability challenges in medical imaging diagnosis by using GE to produce classifier trained on data augmented by the recently-introduced STEM technique. Our approach not only delivers interpretable solutions but also outperforms an ensemble of other ML classifiers in terms of performance. The analysis conducted on the DDSM and WBC datasets emphasizes the effectiveness of GE, as evidenced by improvements in AUC and its ability to identify critical data features. Notably, our inclusion of Bayesian signed-rank test results confirms that STEM emerges as the best-performing approach for augmentation. The improved AUC and enhanced interpretability of our approach can help build trust and facilitate informed decisions. Thus, our study validates the proposed hypothesis, demonstrating the efficacy of the combined GE and STEM approach.

For future research, we suggest improving performance by incorporating additional image attributes, such as wavelet transformations and local binary patterns, to enhance the feature set and dataset diversity. Furthermore, exploring the mixture of different datasets to assess the robustness of our approach across various image data sources would be interesting.