1 Introduction

During the past years, doctors have classified breast cancer into various subtypes. In 2020, there were 2.3 million women diagnosed with breast cancer and 685,000 deaths globally. At the end of 2020, there were 7.8 million women alive who were diagnosed with breast cancer in the past 5 years. According to GLOBOCAN, breast cancer is the predominant cancer in women, making up to 25.1% of all cancers. There has been a significant decline in the amount of cases of breast cancer because of improvement in the medical sector with the help of advanced technology and software. Over a period of time, machine learning has helped a lot in improving the accuracy of detection of breast cancer in early stages. Breast cancer is more difficult and costly to treat as it reaches upper stages, hence, proper machinery and technology for detection and treatment are called for. This paper lays out a machine learning algorithm that helps to detect breast cancer in women efficiently and with great accuracy.

2 Related Works

This section gives the review of literature work in the field of “Breast cancer detection using machine learning.” Multiple sources were reviewed by the authors for the literature which have been referred for the analysis of breast cancer using machine learning. Further, the authors reviewed datasets from reliable sources for testing purposes of the methods reviewed. The most popular methods [1] of breast cancer detection, namely Naive Bayes classifier, K-nearest neighbors (KNN) [2], logistic regression, decision tree classifier, support vector machine (SVM) classifier [3] and random forest classifier [4, 5], which are given in Table 1.

Table 1 Literature work

3 Proposed Methodology

The purpose of this work is to analyze the dataset using random forest classifier, logistic regression and decision tree classifier and compare the efficiency of the respective algorithms in terms of accuracy in detecting cancer in the subject.

3.1 Experimental Setup

The program was made using Python programming language on Jupyter Notebook application. We have used the Wisconsin Breast Cancer datasets [14] from the UCI machine learning repository for our analysis regarding the scope of our work.

We imported NumPy library for working with data using arrays. We imported Pandas library to analyze, clean and manipulate the given dataset. We imported Matplotlib and Seaborn libraries for data visualization of the statistics. We made use of Sclearn which is a machine learning library for Python to import and implement logistic regression, random forest classifier and decision tree classifier algorithms on the given data.

Logistic Regression: It uses a logistic function to for modeling data using dependent and independent variable. It is used for binary data and it is efficient to train. Logistic regression algorithm performs efficiently when the given dataset is separable linearly.

Decision Tree Classifier: A decision tree consists of decision nodes and leaf nodes. It begins with the root node and finds the best attribute using attribute selection measure so as to divide it into multiple subsets. From these, a decision tree node is generated. The process repeats recursively until no further classification of nodes is possible. It makes predictions from all types of outcomes.

Random Forest Classifier: It uses the functionality of a group of decision trees. It is an ensemble algorithm. Individual trees are generated by attribute selection indication. Greater number of trees lead to computation of a better average and thus increases the accuracy of the making predictions on the given dataset. It resolves the overfitting problem that arises with decision trees.

3.2 Feature Analysis

First up, the features are analyzed for their frequency count [15], and then, their pairwise correlation is computed. The code is given below.

The plot for the count of diagnosis for malignant (M) and benign (B) is shown in Fig. 1. The pairwise correlation is shown with the help of a heat map [15,16,17] as in Fig. 2.

Fig. 1
A bar graph depicts count of diagnosis for two classes malignant and benign. Benign has highest count compare to malignant.

Frequency plot for two classes-malignant (M) and benign (B)

Fig. 2
An image depicts a heat map for the given data with a pairwise correlation of the features.

Heat map plot for pairwise correlation among features

The classification models are constructed. After training the model with the given algorithms, we computed accuracy and confusion matrix from Sclearn library to ascertain the performance of the algorithms on the constructed model.

We calculated the accuracy of each model using a confusion matrix:

TrPo:

True Positive

TrNe:

True Negative

FaNe:

False Negative

FaPo:

False positive

$${\text{Accuracy}} = \left( {{\text{TrPo}} + {\text{TrNe}}} \right)/\left( {{\text{TrPo}} + {\text{TrNe}} + {\text{FaNe}} + {\text{FaPo}}} \right)$$

4 Results and Discussions

We were successful in detection of breast cancer in patients through the use of machine learning algorithms. To assess the performance of all the algorithms, we used a confusion matrix to provide the best evaluation. We took 80% of the dataset to train each of the model and 20% of the dataset to test the precision of each model.

Logistic regression algorithm has an accuracy of 96.49%, while the decision tree algorithm has an accuracy of only 93.85%. However, the best results were shown by random forest classifier algorithm with an accuracy of 97.36%. Hence, we are able to tell with high accuracy if the cancer is benign or malignant in the subject (Figs. 3, 4 and 5).

Fig. 3
A table depicts logistic regression confusion matrix includes TrPo equals 66, FaPo equals 1, FaNe equals 3, and TrNe equals 44.

Confusion matrix for logistic regression

Fig. 4
A table depicts decision tree confusion matrix includes TrPo equals 64, FaPo equals 3, FaNe equals 4, and TrNe equals 43.

Confusion matrix for decision tree

Fig. 5
A table depicts random forest classifier confusion matrix includes TrPo equals 67, FaPo equals 0, FaNe equals 3, and TrNe equals 44.

Confusion matrix for random forest classifier

5 Conclusions

Breast cancer is one of the most predominant cancers found in women. Detection at an early stage and diagnosis of this disease can save lives and the patient can undergo suitable treatment. Machine learning has vast applications in the modern healthcare system. One such use of machine learning is detection and diagnosis of diseases that has been thoroughly discussed in this paper. Integration of machine learning with medical databases and devices will make healthcare system more efficient and help in better organization of data. Healthcare industries such as tempus are using machine learning on their clinical data to provide personalized treatments for patients. Hence, an increased adoption of machine learning technology is expected in medical fields in the future.