Introduction

Medical scans are great tools that help the specialists to identify the different abnormalities in the body organs. These scans can detect, diagnose, and treat different diseases. The main used medical scans are ultrasonic (Us), magnetic resonance imaging (MRI), computed tomography (CT), and X-ray [1, 2], which are formed of malignant and benign tumors which have become a major element of healthcare. Us imaging as a tool for medical diagnosis is excessively utilized in clinical practice, and in some situations is standard procedure because it is usually a painless scan, available, less expensive and uses non-ionizing radiation. X-ray is the most used diagnostic imaging test and is widely available. They use radiation to form the X-ray image of the body and bones which in some situations is harmful and precautions must be taken. CT scan combined X-ray scans with different angles to give a cross section image of the inside object under scan easily and the subject contrast is clearly visible. MRI scan uses a powerful magnetic field, magnetic field gradients, and radio waves to form images of different organs. It can also create more visible image details compared to X-ray and CT scans. As well as, it uses non-damaging radiation.

The medical image diagnosis process is performed in two steps. At first, the most significant features are identified and extracted. Then, the most significant features are used in building the diagnostic model. The previous concept falls within medicine so that doctors use their experience to extract the most significant features and then determine the type of disease, which makes the diagnosis process a waste of time and there is a small percentage of human error. At the present time, CNN has proven a remarkable superiority in diagnostic tasks, as it can make a diagnosis that doctors are unable to do [3,4,5,6,7,8,9,10]. The authors of [11] proposed a CNN model based on k-mean to extract significant features and then applied a multi-class SVM model to diagnose the mammography dataset. In [12], the authors suggested an automatic computer-aided diagnosis model diagnose Us breast images, and the segmentation model was used to show the disease, and then, the classification models were implemented. The proposed approach also achieved 85.42% classification accuracy using CNN and between 80 and 77% classification accuracy in machine learning models. Yi Wang et al. [13] proposed a multi-view CNN diagnostic model on the Us breast images dataset divided into 135 malignant and 181 benign breast lesions. In [14], a CNN multi-organ CAD model is proposed to classify breast and thyroid in Us images.

Rajeshwari S. Patil et al. [15] proposed a hybrid CNN and recurrent NN to detect lesions in mammogram images. Their basic phases are pre-processing followed by segmentation, feature extraction, and detection. Hua Li et al. [16] proposed a classification of benign and malignant mammogram images based on an improved DenseNet model for effective and accurate classification. The model was based on three stages: The first one is preprocessing and normalization. The second is replacing the first convolutional layer of their model with the Inception structure. Finally, the datasets are applied to pre-trained models and the DenseNet model. Umar Albalawi et al. [17] proposed a classification mammogram model based on CNN. They used the wiener filter to remove the noise and used the K-means clustering technique to segment the image followed by the CNN classifier. Shen, L. et al. [18] proposed a DL model in order to classify mammogram lesions. They compared their model with the previous models, and it acquired an AUC value of 0.91 on the CBIS-DDSM database and 0.95 on the FFDM database. Yuezhong Zhang et al. [19] proposed a CNN classifier for CT images based on the CDBN model. They used SVM as the feature classifier to enhance feature transfer and reuse to enrich the features. Applying the Adam optimizer algorithm, they get both good accuracy and speed. Huseyin Polat and Homay Danaei Mehr [20] proposed a hybrid CNN lung classifier. They used the SoftMax radial basis function-based SVM to study their model performance. They also compared their model with AlexNet and GoogleNet. They acquired 91.81%, 88.53%, and 91.91% for accuracy rate, sensitivity, and precision, respectively. Li et al. [21] proposed a CNN classifier based on augmentation for a hyperspectral image. They proposed an augmentation technique to make the training samples number increased. Their method benefits deep CNN and extracts PBP features. They also used the decision fusion classifier.

Agrawal et al. [22] proposed a CNN model to classify gastrointestinal system features using a few samples in the training stage and transfer learning models. They also developed a metric to study model performance. This metric carried a correlation of 87% in the validation stage. Keita Saito et al. [23] proposed a CNN classifier for heart diseases. It was trained by heart disease images from scratch. Samir S. Yadav et al. [24] proposed a CNN classifier to diagnose a disease from chest X-ray images. It was shown that using augmentation techniques as well as transfer learning is very effective and leads to improved performance. Feng-Ping An et al. [25] proposed a CNN classifier for breast mass and brain tumor tissue. Their method constructed different CNN models suitable for the medical images’ features using the adaptive sliding window fusion mechanism. The biggest problem with classifying the medical images using neural networks is the used database size. In addition, pre-processing is required; however, it is known that the pre-processing and the feature extraction in CNNs do not have to be performed. Table 1 shows an overview of the recent work using deep learning techniques for medical image classification.

Table 1 Overview of the recent work using deep learning techniques for medical image classification

Knowledge transfer learning has been used in many computer vision tasks. But there is a difference between natural images and medical images, which presents a difficult problem in building an effective CNN model for medical image diagnosis that outperforms other intelligent systems. In this work, a CNN architecture for diagnosing benign and malignant tumors is proposed. Also, a simple network was built that does not require many resources to implement the proposed network on mobile platforms. For effective evaluation of the proposed network, four different data sets were used: Us, X-ray, CT, and MRI. Detailed comparisons were also made with the latest transfer learning models, including the different measures of accuracy with the confusion matrix.

This paper aims firstly to classify the two types of tumors from four different database modalities with a CNN architecture between benign and malignant. Two methods are tested through experiments, namely transfer learning on three CNN models: VGG16, VGG19, and AlexNet. Also, a proposed training network is built from scratch. The more complex models were compared with the simple ones in terms of efficiency and training time. In addition to using a general model that works on various data, its implementation is realistic, easy, and guaranteed. The main contributions of the present study are as follows:

  • Development of three AlexNet, VGG-16, and VGG-19 transfer learning models for classifying multitype medical images.

  • Employing several pre-trained CNN models with fine-tuning and applying them to four different datasets, namely MRI, X-ray, Us, and CT with or without using data augmentation technique.

  • Develop a proposed CNN architecture built from scratch that is characterized by low complexity and low training time.

  • Apply the proposed CNN architecture consisting of 3 × 3 kernels and 1 stride to all convolutional layers, unlike other more complex models. The proposed model also achieved higher diagnostic accuracy compared to the state-of-the-art models.

The rest of this paper is organized as follows. Sect. "Convolutional neural networks (CNN)" gives short notes about CNN. Sect. "Material and methods" describes the material and methods used in our work. Sect. "Deep features extraction and classification via transfer learning" gives short notes on the transfer learning methodology. Sect. "Proposed CNN Model" illustrates the proposed model architecture. Sect. "Experimental results and discussion" shows the experimental results and discussion. Sect. "Conclusions and future work" provides the conclusions followed by references.

Convolutional neural networks (CNN)

In the past few years, he showed that interest in the medical field is a priority for human beings. Therefore, a lot of research has been developed in the medical field. Most of the new research focused on the use of artificial intelligence in many medical branches because of its superiority over traditional techniques. The CNN architecture consists of an input layer, a convolution layer, a classification layer, and an output layer [4, 5, 26,27,28,29]. The input layer consists of the dimensions of the input images. The convolution layer is the main layer that performs two operations of feature extraction and feature selection. It also depends on the trainable filters, and each filter consists of a number of weights that adapt to the images entered during training. The convolution layer also contains padding, which is adding rows and columns of zeros to the borders of the entered images so that the image dimensions do not change. In addition, the number of convolutional layers reflects the complexity of the network. At the end of each convolution layer is a sub-layer called an activation layer. The activation layer is responsible for selecting the best values or weights for the filters used in the convolutional layers. The choice of values varies by choosing the type of activation layer, and the most used is the ReLU layer, which chooses the values of weights between zero and infinity. The ReLU layer carries out a threshold process for each one of the inputs, and the values that are smaller than zero are replaced by zero.

$$f(x)=\left\{\begin{array}{c}x, x\ge 0\\ 0, x<0\end{array}\right.$$
(1)

As mentioned earlier, convolutional layers select the best features, yet features are reduced relatively slowly. Therefore, pooling layers such as max-pooling, mean-pooling, and average-pooling are usually used. Pooling layers reduce the number of features without training, so they do not count in memory. Classification layers are neural networks (NNs) and are called fully connected layers (FC). FC layers combine all the features you learned in the previous layers to identify patterns and then classify the images. The output layer is based on the SoftMax activation function. In addition, the output layer calculates the cross-entropy loss.

$${y}_{r}(x)=\frac{exp({a}_{r}(x))j}{\sum_{j=1}^{k}exp({a}_{j}(x))}$$
(2)

where \(0\le {y}_{r}\le 1\) and \(\sum_{j=1}^{k}{y}_{j}\).

The SoftMax function is:

$$P\left(\left.{c}_{r}\right|x,\theta \right)=\frac{P\left(x,\left.\theta \right|{c}_{r}\right)P\left({c}_{r}\right)}{\sum_{j=1}^{k}P\left(x,\left.\theta \right|{c}_{j}\right)P\left({c}_{j}\right)}=\frac{exp({a}_{r}\left(x,\theta \right))}{\sum_{j=1}^{k}\mathrm{exp}({a}_{j}\left(x,\theta \right))}$$
(3)

where \(0\le P\left(\left.{c}_{r}\right|x,\theta \right)\le 1\) and \(\sum_{j=1}^{k}P\left(\left.{c}_{j}\right|x,\theta \right)=1\) d.

Moreover, \({a}_{r}=\mathrm{ln}(P\left(x,\left.\theta \right|{c}_{r}\right)P\left({c}_{r}\right))\)\(P\left(x,\left.\theta \right|{c}_{r}\right)\) is the sample conditional probability of the given class r, and \(P\left({c}_{r}\right)\) is the probability of the class. The SoftMax function output values are taken and assigned to one class of the two classes using the cross-entropy function [19]:

$$\mathrm{loss}=-\sum_{i=1}^{N}\sum_{j=1}^{K}{t}_{ij}\mathrm{ln}{y}_{ij}$$
(4)

where \(N\) is the number of samples, \({t}_{ij}\) is the indicator that the \({i}^{th}\) sample belongs to the \({j}^{th}\) class, and \({y}_{ij}\) is the output for sample \(i\) for class \(j\), which, in this case, is the value from the SoftMax function. That is, it is the probability that the network associates the \({i}^{th}\) input with class \(j\).

Material and methods

In this paper, a simple CNN structure is proposed to classify tumors in multimedia medical images. In this section, all the proposed data sets in this research are described. Table 1 also summarizes the details of the databases.

Dataset

Simulation results are conducted on four different examples of medical images (Us breast images [30], X-ray (mammogram) images [31], CT chest images [32], and MRI brain images [33]), and each dataset contains different numbers of benign and malignant images. Each data set is divided into 70% training set and 30% test set. So that the training set is used to train the proposed model. The test set is also used to verify the model training results. A sample from all datasets is shown in Fig. 1.

Fig. 1
figure 1

Samples of Us, X-ray, CT, and MRI tumors images: a Benign tumor images; b Malignant tumor images

Image processing data augmentation

One of the most important problems facing any training process is the lack of training data, which is the key to achieving the best classification accuracy. Therefore, a data augmentation technique was used, where the images of the model are entered in each epoch differently. The possible five augmentation techniques employed here are resizing and rotation followed by adding speckle and Gaussian noise, blur, sharpening, and filtering [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21, 34]. Table 2 shows the dataset’s image numbers and their specifications before and after data augmentation.

Table 2 Medical image datasets and their specifications before and after data augmentation

Deep features extraction and classification via transfer learning

The main function of using pre-trained networks is to transfer the values ​​of weights, which is called transfer learning. Most of the pre-trained networks are trained on the ImageNet dataset, which contains various types of images. Then, the trained weights are transferred to be applied to smaller data sets to take advantage of the previously trained weights. Then, the last layers of the previously trained model, FC, are changed only to adjust the model for the task of classifying tumors [35,36,37,38,39,40]. The learning transfer procedure is shown in Fig. 2.

Fig. 2
figure 2

The procedure of transfer learning (TL)

The transfer learning models used in this research are VGG16-19 and AlexNet. Transfer learning networks are also applied to the proposed datasets with and without data augmentation, and the results are compared. These networks are explained as follows:

VGG-16 architecture

VGG16 [35, 36, 39, 40] is a primitive CNN consisting of 13 convolutional layers and three FC layers interspersed with five max-pooling layers, ending with a soft-max output layer. The VGG16 network contains about 138 million parameters, which is a very large number, but it guarantees a stable and relatively high classification accuracy. The VGG16 network was used to win the ImageNet competition in 2014. Figure 3 contains all the details of the VGG16 network.

Fig. 3
figure 3

VGG-16 architecture

VGG-19 architecture

The VGG19 [35, 37,38,39,40,41,42,43,44,45] is similar to the VGG16 network in layer arrangement but differs in the increase in the number of convolutional layers which is 16 layers. Figure 4 contains all details of the VGG19 network. The VGG19 network contains about 143 million parameters.

Fig. 4
figure 4

VGG-19 architecture

AlexNet architecture

The AlexNet architecture [35, 37,38,39,40] consists of five bypass layers and three FC layers. AlexNet also contains about 60 million parameters. Figure 5 contains all AlexNet details.

Fig. 5
figure 5

AlexNet architecture

AlexNet is significant because it is the first neural network to win the 2012 ImageNet competition. AlexNet is also the first to use ReLU instead of the sigmoid or hyperbolic tangent function. AlexNet also provided a solution to the overfitting problem by applying a dropout between FC layers.

Proposed CNN model

In the proposed model, we intend to use a smaller number of parameters than the rest of the pre-trained models that were previously explained. It is known that if the depth of a neural network increases, the accuracy, and performance of the neural network increase. However, memory and GPU consumption increase, and sometimes neural network performance does not improve.

As shown in Fig. 6, our proposed model consists of three convolutional layers and one FC layer interspersed with three batch normalization layers and two max-pooling layers. You can know the details of each layer in Fig. 6. The major advantage of the proposed model is the use of batch normalization layers to speed up the training process while reducing the number of parameters.

Fig. 6
figure 6

The proposed CNN model

The training option and the hyperparameters can be specified as follows: We will employ the stochastic gradient descent with momentum (SGDM) optimizer with 0.9 Momentum while training the VGG-16, VGG-19, AlexNet, and the proposed model. A relatively average learning rate of 10–5 was used, with 10 epochs during training. The number of iterations is one step that was taken in the gradient descent algorithm toward minimizing the loss function using a mini-batch based on the dataset images number, i.e., 740, 740, 440, and 700 for Us, X-ray, CT, and MRI datasets, respectively. The training data are shuffled before each training epoch. The mini-batch size used here for each training iteration is 10. The L2 regularization is the contribution of the gradient step from the previous iteration to the current iteration of the training, specified as a scalar value from 0 to 1.

Experimental results and discussion

In this section, the experimental results of the proposed CNN model, as well as the pre-trained models, are presented for the different medical image datasets classification. The effect and benefits of using augmentation techniques are discussed.

Without data augmentation

The validation accuracies, validation loss, and training time, for each dataset without data augmentation using VGG-16, VGG19, AlexNet models, and the proposed model are shown in Table 3. It can be observed that the accuracies when applying the VGG16 model are 52.6%, 50%, 100%, and 100% on Us, X-ray, CT, and MRI datasets, respectively, and the accuracies of applying the VGG19 model are 68.4%, 53.3%, 100% and 100% on Us, X-ray, CT and MRI datasets, respectively, and the accuracies of applying AlexNet model are 89.47%, 60.00%, 100% and 100% on Us, X-Ray, CT and MRI datasets, respectively. When applying the proposed model to Us, X-ray, CT, and MRI datasets, the accuracies are 75%, 63.3%, 100%, and 100%, respectively. Also, it can achieve the lowest validation loss.

Table 3 The accuracy (Acc.) and loss of the pre-trained models and the proposed model for Us, X-ray, CT, MRI datasets without data augmentation

The results for the Us and X-ray datasets are not convincing enough due to the low image quality of the Us and X-ray; also, the properties of the two classes are very similar and the number of images in each dataset is small. On the other hand, the four models perform very well with CT and MRI datasets due to the high variance between the two classes and the good image quality. The CPU time illustrates the complexity of the models. As can be shown in Table 3, the proposed model can obtain the lowest system complexity due to its simple architecture.

With data augmentation

In this section, the effect of using data augmentation on transfer learning models and the proposed model are discussed. As a result, evaluation results are more compelling for use in real-world applications. When applying the VGG16 model, the accuracy for the Us dataset is increased from 52.6% to 58.3% and for the X-ray dataset from 50% to 54.7%. Similarly, the accuracy for Us and X-ray for the VGG19 model is increased from 68.4% to 75%, and 53.3% to 63.8%, respectively.

For the AlexNet, the accuracies increased from 83.47% to 89.9% for the Us database and from 60% to 70.6%. The overall accuracy obtained for the proposed model is 92.7% and 91.1% for Us and X-ray datasets which are much greater than the accuracies achieved without data augmentation. Thus, the experiments illustrate that the data augmentation techniques have an apparent effect on classification accuracy. In Figs. 7 and 8, the training and validation curves and confusion matrices of the proposed CNN model are shown with data augmentation.

Fig. 7
figure 7figure 7

Training and validation curves for accuracy and loss using the proposed model for different datasets after data augmentation

Fig. 8
figure 8figure 8

The confusion matrices of the proposed architecture for different datasets

There is a difference between medical images and natural images, which is the gradation of colors. From this point of view, the results of training the pre-trained models may not be the most suitable choice for classifying medical images (Table 4).

Table 4 The accuracy (Acc.) and loss of the pre-trained models and the proposed model for Us, X-ray, CT, MRI datasets with data augmentation

Initially, the proposed model achieved relatively low training accuracy for the Us and X-ray datasets due to the low image quality and the small number of images. However, after using the data augmentation technique, the accuracy began to gradually increase.

From the simulation results, this paper presents a comprehensive comparison between the use of transfer learning models, which are VGG16, VGG19, and AlexNet, and between the use of a simple proposed model with a few parameters. The results showed that the proposed model outperformed the previously trained models in classifying various medical images, but after using the data augmentation technique. The proposed model also helps the radiologist make an accurate decision to classify different medical images.

Conclusions and future work

In the scientific community, it has become necessary to use static models that work on different data sets. So in this paper, a relatively simple model with few parameters based on neural networks is proposed. Where the proposed model is used to classify various data sets such as Us, X-ray, CT, and MRI. The proposed model also achieved a classification accuracy of 92.7%, 91.1%, 100%, and 100% for the datasets Us, X-ray, CT, and MRI, respectively. A comparison was made between the proposed simple model with transfer learning models such as VGG16, VGG19, and AlexNet. It is also possible to use the proposed model on the simplest available resources due to the small number of layers with a small number of parameters. Medical diagnosis in developing countries is one of the easiest and most important factors for epidemic prevention. Therefore, our proposed model can be used on mobile platforms because of the small neural network used with high efficiency on different data sets. The proposed model can also be used in real-time discovery applications. In future, the performance of the proposed model will be tested on recent data sets with better improvements in accuracy and complexity.