Keywords

1 Introduction

Breast cancer is the most commonly diagnosed cancer and causes the most deaths for women diagnosed with cancers. [6] Early diagnosis plays an important role in both treatment and prognosis for breast cancer. It has been extensively reported that patients diagnosed with smaller primary breast tumors had a significantly higher disease-free survival and overall survival, compared to patients with locally advanced breast tumors. Early detection and diagnosis of breast cancer are therefore of interest. Various imaging modalities have been applied to breast cancer diagnosis. Among these, ultrasound (US) imaging which employs sound waves to generate images of the internal morphology of the breast is the most widely used method due to its safety and painlessness. The US is able to help diagnose breast lumps and other abnormalities in a noninvasive way.

Despite its usefulness and wide applicability, breast US has suffered from limited specificity and interobserver variability, both of which contribute to a high rate of false-positive and false-negative. The misdiagnoses cause either a number of unnecessary biopsies and surgeries, or missed cases. To address the challenge of limited specificity and interobserver variability, there has been a growing interest in the application of machine learning technology for automatic US breast tumor identification [4].

Different from conventional US diagnosis, the machine learning approaches make decisions based on extracted computational features. The features extraction procedure can be performed using either deep neural networks [2] or spatial and texture computational tools. While the deep neural network-based features are usually illusive and lack interpretability, the spatial and texture computational tools extract features that are directly related to tumor size and shape, image intensity histogram, and relationships between image voxels from radiologic images. The mathematical definitions of these features are explicit and easy to reproduce. Some of these features, such as tumor texture, have been demonstrated to be useful for differentiating malignant from benign tumors in breast cancer. In this study, we aimed to develop a learning model based on the computational ultrasound image features and applied the model to breast tumor identification. Clinically relevant features were used to differentiate breast tumor malignancy.

2 Method

Fig. 1.
figure 1

Adopted pipeline of the research.

Fig. 2.
figure 2

Sample of a malignant ultrasound image, a benign ultrasound image, and their corresponding masks.

Radiomics researches have a rather clear pipeline [3] which we adopted. First, we prepared the data, where the segmentation of region of interest (ROI) had been already available. Next, we extract features from ROIs with PyRadiomics package. Then, we selected and eliminated features and prepared them for modeling. At last, we built our model and evaluated the model by common metrics. The adopted pipeline is shown in Fig. 1.

2.1 Data Preparation

The BUSI dataset [1] was collected from 600 female patients and divided into three categories: benign, malignant, and normal. Both ultrasound images and segmentation masks are stored as 8-bit pngs. A sample of a malignant ultrasound image, a benign ultrasound image, and their corresponding masks are shown in Fig. 2.

Since the radiomics extract information from the region of interest (ROI) instead of the entire image, an ultrasound image with more than one tumor will result in the situation that the number of tumors ROIs is greater than that of the ultrasound image. Through the pairing of the ultrasound images and the masks, 454 benign tumor ROIs and 211 malignant tumor ROIs were finally obtained.

2.2 Feature Extraction

PyRadiomics [7] is an open-source Python library for radiomics feature extraction. With PyRadiomics, we extracted 1318 image-related features, which consist of eight classes:

  • First Order Statistics

  • Shape-based (2D)

  • Shape-based (3D)

  • Gray Level Cooccurence Matrix (GLCM)

  • Gray Level Run Length Matrix (GLRLM)

  • Gray Level Size Zone Matrix (GLSZM)

  • Neigbouring Gray Tone Difference Matrix (NGTDM)

  • Gray Level Dependence Matrix (GLDM).

2.3 Feature Selection

Features with too high dimension hinder the implementation of classification algorithms, so feature selection is required. After the following steps, the number of features is controlled in an appropriate range.

Data Standardization. The standardization process unifies the dimensions of the features and prevents the effect of the different magnitude order during the selection and modeling process.

We standardized the data by the formula

$$\begin{aligned} \hat{x} = \frac{x - \mu }{\sigma }, \end{aligned}$$
(1)

where x represents the original data and \(\hat{x}\) represents the standardized data. \(\mu \) represents the mean of the data, and \(\sigma \) represents the standard deviation of the data.

Mutual Information Filtering The mutual information (MI) of a chosen feature X and label Y is defined as

$$\begin{aligned} I(X ; Y)=E\left[ I\left( x_{i} ; y_{j}\right) \right] =\sum _{x_{i} \in X} \sum _{y_{j} \in Y} p\left( x_{i}, y_{j}\right) \log \frac{p\left( x_{i}, y_{j}\right) }{p\left( x_{i}\right) p\left( y_{j}\right) }, \end{aligned}$$
(2)

where \(x_i\) represents the chosen feature of i-th sample, and \(y_j\) represents the binary label of j-th sample.

For a chosen feature, the less mutual information it has with the label, the less information it provides for classification. Based on this principle, we performed feature filtering based on the MI, and the features whose MI with the label was lower than the threshold of 0.1 was eliminated.

Recursive Feature Elimination. Recursive feature elimination (RFE) method works with predictive models. The feature which contributes the least to the result is determined by the model during each recursion and then eliminated. The recursive process goes on until the number of remaining features does not exceed the threshold we set.

In our implementation, we used random forest as the predictive model during the RFE process, where 25 decision trees were ensembled. 30 features were selected.

It is worth mentioning that the above steps of feature selection are not quite clear at the initial stage. Instead, they are determined by trying applying common feature selection methods(including filters, wrappers and embedded ones) by following the principles that through one single selection process, an appropriate number of features can be eliminated. Removing too many or too few features in one process are avoided because the extreme threshold of the former extremizes the training data distribution, and the latter fails the selection process.

2.4 Modeling and Evaluation

We chose linear regression, a simple machine learning model for the purpose of classification, with \(L_1\) norm as the penalty, and liblinear as the solver. The max iteration was set to \(10^4\).

For evaluation, we used common metrics, including:

  • F1-score

  • Accuracy

  • Sensitivity

  • Specificity

  • Precision

  • ROC curve [5] and area under curve (AUC).

Each metrics were calculated with respect to the 30% test data for 50 random splits of the dataset.

3 Results and Discussions

3.1 Metrics Performance

The performance of the LR model on the selected metrics is listed in Table 1, and visualized in Fig. 3. The error bar indicates the 95% confidence interval (95% CI).

It can be seen from the figure that the model is robust to different split of training and test sets. Thus the metrics have a small interval of 95% CI.

The sensitivity is relatively low compared with other metrics. As the BUSI dataset suffers from data imbalance, where the number of available benign ROIs is nearly twice as that of malignant ones. Considering the definition of the sensitivity metric, it may be improved by properly oversampling the positive samples, i.e. malignant ROIs.

Fig. 3.
figure 3

Numerical metrics

Table 1. Metrics

3.2 ROC Curve

The ROC curve of the model on a random split of the dataset is shown in Fig. 4. The corresponding AUC is 0.9469.

The ROC curve and corresponding AUC reveal that the model has a relatively high predictive value from an overall perspective, especially considering the imbalance of the dataset in this study.

Fig. 4.
figure 4

The blue line is the ROC curve of our model on a random split of the dataset. (Color figure online)

3.3 Calibration Curve

The calibration curve corresponding to the model with the ROC Curve above is shown below in Fig. 5. As can be seen from the figure, when the predicted value is at lower (<0.3) and higher (>0.7) values, the calibration curve of the model is close to the perfectly calibrated curve. The deviation on the interval around 0.5 indicates that the model has much room for improvement. Attaching attention technologies or simply put more weight on the training samples whose predicted value falls in the interval around 0.5 may lead to the calibration curve approaching to the perfectly calibrated one and improve the performance of the model.

Fig. 5.
figure 5

The calibration curve of a random split.

4 Conclusion

We present a computational US image modeling algorithm to accurately identify breast tumors. The algorithm is able to extract reproducible and interpretable features to differentiate breast tumor malignancy. Using these clinically relevant features, the proposed classification model achieves promising results based on clinical US images from public BUSI dataset. We anticipate that the proposed tumor identification and feature extraction and selection scheme can adapt to a broader category of cancers.