1 Introduction

Brain tumors are one of the most serious health problems in the world that can affect anyone. Cancer is the second leading cause of death because one in six deaths is caused by cancer. Early classification of cancer can be life-saving, but this is not always possible. Brain tumors are one of the deadliest cancer types due to their aggressiveness and low survival rate. Since 2000, June 8 has been considered World Brain Tumor Day, the purpose of which is to raise awareness and inform people about brain tumors. The brain is a very complex and sensitive organ. It contains about 100 billion nerve cells that control the human nervous system [41]. It may be affected by the tumor. Tumors can change brain behavior. Therefore, any abnormality in the brain is dangerous to human health. Brain tumors are the uncontrolled spread of abnormal cell populations in or around the brain. Brain tumors can generally be classified as malignant and benign. A benign tumor can be removed by surgery because it will not spread to other parts of the brain, Malignant tumors are larger than benign tumors and can spread to other parts of the body. Therefore, early detection of brain tumors is essential to improve the survival rate of patients. According to the American Brain Tumor Society, about 700,000 patients in the United States suffer from brain tumor disease. Patients reportedly have a survival rate of only 36%. In the last year of 2020, approximately 87,000 patients have been diagnosed with brain tumors. In 2021, 84,170 patients from all over the world were diagnosed with brain tumors [62]. There are over 120 types of brain tumors. However, the most common types of brain tumors are glioma, pituitary gland and meningioma. Among all brain tumors, the incidence rate of glioma is 45%, pituitary tumors is 15% and meningioma is 15% [67]. Meningioma is the most common benign tumor, and it develops in the membrane that surrounds the brain and central nervous system. Pituitary tumors primarily affect the pituitary gland of the brain. On the other hand, glioma originates from the brain tissue within the substance of the brain. The main difference is that gliomas are malignant, while meningiomas and pituitary tumors are usually benign. According to the type of tumor, doctors can diagnose and predict the survival rate of patients. Therefore, tumor grading is an important part of the treatment of patients with brain tumors.

Medical imaging techniques are used to detect tumors. Medical imaging is the most economical and accurate method for diagnosing and detecting dangerous human diseases such as brain tumors detection [54], classification of skin cancer [30], stomach cancer [31] and lung cancer [32]. There are different ways to treat a brain tumor, depending on the size and type of tumor. Computed tomography (CT), magnetic resonance imaging (MRI), and other diagnostic imaging methods are used to look inside the human body. MRI is considered the first choice for brain tumors because it is the only painless medical imaging method used to provide excellent images of brain tumors. However, due to the large number of patients, viewing these images manually is time-consuming and can cause errors. MRI makes it easy to calculate the size, shape, and location of detective tissue. According to tissue characteristics, different MRI protocols are used, such as T1W1, CE-T1W1, and T2W2.

For early detection and classification of brain tumors, computer-aided diagnosis (CAD) systems may be helpful and can be used as a tool to help radiologists and doctors [3]. Automatic detection of brain tumors is necessary not only for accurate assessment and timely diagnosis, but also for saving radiologist time. Some efforts have been made to develop powerful solutions for the automatic classification of brain tumors. Over the past few years, many machine learning (ML) and deep learning (DL) methods based on feature selection and learning techniques have been proposed to classify brain tumors. There are many ways to classify brain tumors, such as machine learning methods [39, 74], fusion vectors [56], deep networks [46], and transfer learning (TL) [68]. Deep learning is much better at dealing with more complex classification problems than traditional machine learning techniques [51]. With the recent development of deep networks, there are several studies that have adopted Convolution Neural Network (CNN) for the diagnosis of brain tumors [17, 71]. The essence of this work is to find the best deep learning framework for the classification of brain cancer. In this article, an enhanced deep learning model is proposed to examine brain MRI and provide early diagnosis. Most of the research in previous work has focused on classifying binary classes. However, the binary classification is simple because the shape of the tumor can be easily interpreted. Multiclassification is difficult due to the high similarity between tumor types. We used publicly available three-class and four-class brain MRI dataset for performance analysis of our proposed model. The main findings of this study are as follows.

  1. 1.

    We have proposed a novel and robust deep learning-based system for multiclass brain tumor classification on two benchmark datasets exploiting five state-of-the-art architectures, Xception, DenseNet201, DenseNet121, ResNet152V2, and InceptionResNetV2.

  2. 2.

    The performance of the proposed model using deep dense block based on the Xception architecture is compared with the state-of-the-art methods. The proposed model uses various preprocessing techniques, data augmentation and deep dense block to improve classification performance. Various techniques are used to avoid overfitting, such as dropout, batch normalization, global average pooling, early stopping method, and L2 regularization.

  3. 3.

    We also implemented 3-class and 4-class versions of the proposed model and compared the results with other studies in the literature.

2 Related work

In recent years, there have been many attempts to create an accurate and effective classification system for brain tumors. Many methods have been proposed to automatically classify MRI of the brain based on traditional machine learning and deep learning methods such as convolutional neural networks (CNNs) and transfer learning. Therefore, we conducted a detailed study of the previously proposed methods for classifying brain tumors from various sources such as Springer, IEEE Explore, and Elsevier. In the literature, most methods focus on binary classification. However, the binary classification is simple because the shape of the tumor can be easily interpreted. Due to the high degree of similarity between tumors, the multi-class classification of brain tumors is difficult.

Several authors used traditional ML methods to obtain the final output through sequential stages. Different feature extraction schemes are used, such as DWT [18, 43, 50], GLCM [37, 45], and genetic algorithm [7]. Several authors use support vector machines because it is the most popular technique for classification problems [10, 18]. Other authors used various classification methods such as Random Forest [38], Extreme Learning Machines [63], and Sequential Minimal Optimization [16]. Ullah et al. [72] extracted the approximation, used color moments (CM) to reduce the coefficients, and finally used a feedforward artificial neural network to classify the brain tumors. Zang et al. [73] used the ML paradigm to conduct brain tumor classification research, where binary classification is the main focus. In addition, it is difficult to distinguish between glioblastoma multiforme (GBM) and brain metastases (MET) using MRI. Therefore, this is another challenge faced by researchers in the field of brain tumors. Yang et al. [70] used morphological features to study MET and GBM tumor classification. Rajan and Sundar [52] proposes hybrid energy saving method consisting of 7 long stages for automatic tumor detection and reports 98% accuracy. Hence, it is noticeable that there is manual feature extraction in the traditional machine learning method, which is time consuming and error prone. Traditional ML methods relies on hand-crafted functions that require reliable upfront information, such as the location of the tumor, and the potential for human error is high. Therefore, it is necessary to develop a robust and effective method that does not use manual features. The DL method has recently been widely used in the fields of medical imaging and brain tumor classification [35]. The DL method does not require hand-crafted features; however, sometimes it is necessary to perform preprocessing operations and use the correct architecture to achieve improved classification performance. CNNs are a type of deep neural network that is widely used for classification and detection. Recently, various researchers have proposed CNN to classify brain tumors using MRI [23, 61]. Several authors used a brain tumor dataset called Figshare [12] generated by Cheng to obtain an efficient method for the classification of brain tumors. We also used the same data set for experiments in this work. Cheng et al. [11] tried to use this data set to solve three types of problems in detecting brain tumors. They used GLCM and BoW model for feature extraction and SVM to improve the classification accuracy of brain tumors to 91.28%. In 2018, the Figshare dataset was used to classify brain tumors in [26, 61]. Anaraki et al. [4] proposed CNN based on genetic algorithm to classify brain tumor types. They achieved 94% classification accuracy in brain tumor datasets by using traditional neural networks. CNN is also used by Ahmad et al. [59] for the purpose of classifying brain tumors. The proposed method uses DWT and CNN model. The overall accuracy of the experimental results reaches 99.3%. Deepak and Ameer [17] used GoogleNet model and applied transfer learning technique for the purpose of extracting MRI features. They used Figshare dataset to train and test the proposed method, and the SVM classifier is used for classification and achieves an accuracy of 97%. Saxena et al. [60] applies the transfer learning method on three deep learning models, namely Inception V3, ResNet-50 and VGG-16 models, and classifies brain tumor data in their research. The Resnet-50 model achieves the highest accuracy rate of 95%. Francisco et al. [19] proposed a multi-path CNN architecture for automatic segmentation of brain tumors. They tested their proposed model using a publicly available MRI dataset and achieved an accuracy of 97.3%. Sajjad et al. [57] uses a deep convolutional neural network (CNN) and uses data augmentation to classify brain tumors. The overall accuracy of the proposed CNN model reached 94.5%. Maharjan et al. [42] has published a multi-class brain tumor classification study to avoid overfitting. The proposed CNN claims a 2% improvement in accuracy using a modified softmax loss function. Citak et al. [14] stated that they used SVM, multilayer perceptrons and logistic regression in the study of brain tumors. As a result, they achieved 93% accuracy and 96.4% sensitivity. Khwaldeh et al. [33] proposed a CNN model to classify brain tumors by modifying the architecture of the alexnet model, with an accuracy rate of 91%. Badža and Barjaktarović [6] proposed a 4-layer CNN for extracting features from brain tumor images and performed classification. They classified brain tumors with 97.39% accuracy. Zar et al. [68] has proposed a block-by-block fine-tuning strategy based on the TL paradigm using CNN. This method was more common because it achieved an average accuracy of 94.82% without the use of handcrafted features. Sultan et al. [65] proposed a deep learning model that relies on CNNs to classify brain tumors. The proposed model achieves 96.13% and 98.7% accuracy separately. In another study, eight CNN models [28] were designed and trained on a brain MRI dataset to classify a brain tumor. The proposed CNN models have achieved an accuracy of 90% to 99%. In the research conducted by Ruba et al. [55], the semantic segmentation network was first used to segment brain images, and then the GoogleNet transfer learning model was used to classify the images. They produced almost 99% classification performance for each category. Jaeyong Kang et al. [29] used 13 different pretrained deep convolutional neural networks and 9 different ML classifiers. They experimented with three different brain tumor datasets and achieved the highest classification accuracy of 98.50%. Naseer et al. [47] proposed a CNN model for the early diagnosis of brain tumors using MR images of the brain. They used different enhancement techniques and six different datasets to train and validate the proposed model. They achieved a classification accuracy of 98.8% for the detection of brain tumors. Ercan Avşar et al. [58] proposed a deep learning model based on Faster Region-based Convolutional Neural Networks (faster R-CNN). The authors trained and tested the model using 3064 MR images of the brain, achieving 91.66%. Tanzila Saba et al. [56] used VGG-19 model for brain tumor detection by applying transfer learning techniques. The proposed method was evaluated on different datasets such as BRATS 2015–17. they achieved 98.78%, 99.63% and 99.67% accuracy on BRATS 2015, BRATS 2016 and BRATS 2017. Aderghal et al. [1] proposed a CNN model using transfer learning techniques to classify brain scans focusing only on a small ROI. they used a shallow CNN architecture with fewer layers. They used two different transfer learning techniques, namely cross-domain and cross-modal, and achieved good results even on small datasets. Various classification schemes for brain tumors are presented in the literature. Achieving better classification accuracy is a difficult task in classifying brain tumor images. It can be seen from the above research that compared with the traditional ML techniques, the accuracy of brain tumor classification using DL is much higher. In addition, it was noticed that none of the models proposed above were validated. Thus, we identified a clear gap in studies of the multiclass classification of brain tumors using the TL method. we have proposed a new DL method using the TL technique to classify brain tumors. We use brain MRI to study five unique DL models, such as Xception, DenseNet201, DenseNet121, ResNet152V2, and InceptionResNetV2, and apply TL method on two publicly available benchmark datasets. Finally, we use various important parameters for brain tumor classification to investigate and compare the above models.

3 Materials and methods

This section describes the methods and materials used in this study. Figure 1 shows the proposed approach for classifying brain tumor disease based on deep transfer learning technique. Section 3.1 details the brain tumor imaging datasets used to train the proposed method. MR images are preprocessed and cropped in Section 3.2. Section 3.3 introduces data augmentation procedures to solve the problem of limited datasets and improve classification performance. The deep transfer learning model for feature extraction and classification is introduced in Section 3.4. Finally, Section 3.5 presents various performance indicators to analyze the effectiveness of the proposed method.

Fig. 1
figure 1

Proposed approach for brain tumor classification

3.1 Datasets for this study

Most of the latest models use the Figshare benchmark brain tumor dataset [11] to assess performance. Therefore, we also considered the same dataset to evaluate the effectiveness and robustness of the proposed method. Two different MRI datasets that are publicly available have been used to perform a set of experiments. The first data set used in this article contains 3064 T1-weighted contrast-enhanced MRI images obtained from the Nanfang Hospital and General Hospital of Tianjin Medical University, China between 2005 and 2010. It was developed by Cheng [11] in 2017 to create a classification model for brain tumors. The dataset includes 3064 brain MRI slices of anonymous 233 cancer patients. It contains three types of brain tumors: glioma (1426 images), meningioma (708 images), and pituitary tumor (930 images). The second dataset, called the Brain Tumor Classification, was downloaded from the Kaggle open source data source repository [9]. The data set contains 3264 brain MRI slices, divided into four classes: normal (500 images), glioma (926 images), meningioma (937 images), and pituitary (901 images). Figure 2 shows an example of brain MR images in 3 and 4 class datasets. In the images, the tumor is marked with a red outline.

Fig. 2
figure 2

An example of a brain MRI image from a class-labeled brain tumor dataset

3.2 MRI data preprocessing and cropping

Before processing the image into the proposed structure, both datasets were preprocessed at various stages to ensure maximum accuracy. Almost all images in our brain MRI dataset contain unwanted space and noise, which can reduce performance. Our goal is to crop the image to remove unwanted areas, make sure that all images are of the same type, and the focus is only on the central part of the brain. Extreme point calculation and finding contour are used to perform the above preprocessing. Figure 3 shows the process of cropping the MR image at each step. First, we load the original MR image from the dataset. After that, the MR image is converted to a binary image by applying a threshold. Then erosion and dilation operations are performed to remove any small noisy parts of the MR image, then the largest contour is selected and the four extreme points of the image (extreme right, extreme bottom, extreme top and extreme left) are calculated. Finally, we cropped the images after combining contour points and extreme points to ensure that the brain parts were in focus in each image.

Fig. 3
figure 3

The cropping process of MR images

The MR images from the dataset are of different sizes, and it is recommended to adjust them to the same height and width for best results. Different models have different input requirements. For example, the DenseNet201, DenseNet121, and Resnet152v2 architectures expect an image size of 224 × 224, while the Xception and InceptionResNetv2 architectures expect an input size of 299 × 299. The resize function is used to resize all brain tumor images to the shape (224 × 224) so that all architectures used in this study can accept a common size. Data partitioning also plays an important role in image classification. To start the training phase of a deep learning model, the image data is divided into three parts; training, testing and validation. According to the Pareto principle [20, 66], 80% of the images are reserved for training and validation purposes, and 20% are reserved for testing purposes. This dataset split ratio (80–20) is one of the most common split ratios in deep learning and has been used in similar studies on medical images [2, 29]. Table 1 shows the details of the images in the dataset and the distribution of the data used to train and test the model.

Table 1 Dataset details and distribution

3.3 Data augmentation

We use image augmentation to ensure that each model receives enough input images to avoid overfitting problems due to limited images in the data set. By augmenting existing data instead of collecting new data, the classification performance of the DL model can be significantly improved. However, this study used three augmentation strategies to generate a new training set: 1) The images was rotated by an angle of 90 degrees, 2) All images are horizontally flipped, and 3) Random contrast is used to randomly adjust the contrast during training by a factor of 0.2.s.

3.4 Proposed deep transfer learning models for feature extraction and classification

Designing a CNN from scratch is a challenging task. This process requires multiple iterations, a lot of experience to ensure correct convergence, and involves careful setting of many hyperparameters (such as architecture depth). Therefore, leveraging existing recognized pre-trained models (Xception, DenseNet201, DenseNet121, ResNet152V2, InceptionResNetV2, etc.) for the classification of brain tumors is an alternative solution. In the field of medical imaging, there is a lack of labelled data and this is a major challenge for a reliable and accurate detection system. Therefore, using pre-trained models with TL technique to quickly learn new jobs and solve these challenges. In this research, data augmentation technique and deep transfer learning are carried out to overcome the problem of insufficient training data and reduce the problem of overfitting. In TL, a CNN trained for a specific task can be reused for another related task. Moreover, the TL approach was found to be much faster and simpler than a network trained from scratch. Here, we examine five pre-trained models Xception, DenseNet201, DenseNet121, ResNet152V2, InceptionResNetV2 using MR images and apply TL technique on the given dataset. Figure 4 shows the details of the layers and their order in the deep dense block (DDB). First, we remove the fully connected layer from these architectures, leaving only the convolutional layer and the pooling layer. These two types of layers are responsible for extracting features. In Table 2, we also provide architectural details of the model, such as the required input image size and the number of spatial feature maps extracted from the convolution base. All parameters are initialized with weights obtained from the ImageNet dataset. We have introduced a deep dense block to improve the accuracy of the brain tumor classification. In the deep dense block, first we added a global average pooling layer as a better alternative to the flattening layer. It transforms the (H × W × N) feature map into a (1 × N) feature map, where (H × W) represents the size of the image and N represents the number of filters. Instead of adding a fully connected layer, it is more meaningful and interpretable because it enforces the correspondence between feature maps and categories. Other benefits of the global average pooling layer are that it better solves the overfitting problem and enables a direct mapping between output channels and feature categories, reducing the number of parameters and eliminating the need for parameter optimization [40]. Then, three layers of batch normalization, dropout, and three dense layers are added to the network, where the first and second dense layers are composed of 512 and 256 neurons with ReLU activation functions. The parameters used in the dense layer are learned at each epoch and incorporate features of brain tumors that help improve classification accuracy. ReLU is a commonly used activation function in dense layers because it can improve training and testing performance. Overfitting is a big problem in deep networks, which occurs when a model is over trained on the training data and negatively affects the test data. The dropout layer prevents the model from being overfitted. We drop 20% of the neurons after the first dropout layer. In addition, such an operation greatly helps to speed up the training process of the models [64]. L2 regularization [15] was used in the first dense layer with a value of 0.0001. The batch normalization layer is used after each dense layer to normalize the extracted features to the mean and standard deviation, which plays an important role in our classification model. Batch normalization performs very well when used immediately after dense layers [25]. It is used to train models faster and balance activation values, and to achieve better generalization performance. In the deep dense block, the last dense layer contains 3 and 4 neurons for the classification of 3 and 4 classes of brain tumor. The softmax activation function is used to classify the image into its corresponding class. The softmax function converts the resultant values between 0 and 1 so that they can be interpreted as probabilities. It is defined as the following equation:

$$softmax\ {(x)}_i=\frac{{\mathit{\exp}}^{x_i}}{\sum_{j=1}^k{e}^{x_j}}$$
(1)
Fig. 4
figure 4

Layers type used in the deep dense block

Table 2 Details of the model, including input size and number of features extracted

Figures 5, 6, 7, 8 and 9 highlights the basic architecture and customization of deep transfer learning models, which have finally been deployed to obtain the classification results of brain MR images. Below we briefly describe the architecture of these models.

Fig. 5
figure 5

Basic architecture and customization in Xception for multiclass classification of brain tumors

Fig. 6
figure 6

Basic architecture and customization in DenseNet121 network architecture

Fig. 7
figure 7

Basic architecture and customization in DenseNet201 network architecture

Fig. 8
figure 8

Basic architecture and customization in ResNet152V2 network architecture

Fig. 9
figure 9

Basic architecture and customization in InceptionResNetV2 network architecture

3.4.1 Xception

Xception takes Inception’s method to the extreme, developed by Francois Chollet [13]. It is proposed as an improved version of Inception V3. This architecture completely relies on the depth of the separable convolutional layer, and strongly assumes that the spatial and cross-channel correlations can be separated. A depthwise separable convolution consists of a deep convolution, which runs independently on all input channels., followed by a pointwise convolution to map the correlation between the channels. The network consists of 14 modules. There are linear residual connections around all other modules except the first and last modules. When training on the ImageNet dataset, the top-1 accuracy reported by the Xception framework is 79.0%, and the top-5 accuracy is 94.5%. The classification performance of Xception networks on ImageNet datasets is slightly better than InceptionV3. Due to its outstanding performance in different image classification tasks, we use the Xception model to classify brain tumors.

3.4.2 DenseNet

DenseNet, an abbreviation for Dense Convolutional Network, requires fewer parameters than traditional CNNs because it does not learn redundant feature maps. Huang et al. [24] introduced the DenseNet network, which connects each layer of the network to each other layer in a feed-forward manner. DenseNets has the same advantages as ResNets and has some attractive properties. For example, the problem of vanishing gradients, achieving high performance and a significant reduction in the overall training parameters of the network. Deep DenseNet is built with multiple dense blocks, where each layer is a sequence of convolution operations, batch normalization, and ReLU activation. DenseNet introduces a bottleneck layer to prevent the number of feature maps from growing exponentially. In order to resolve the difference in the size of feature maps, DenseNet applies a transition layer between dense blocks. DenseNet has four different variants: DenseNet264, DenseNet201, DenseNet169, and DenseNet121. In our research, we experimented with two DenseNet variants: 121-layer and 201-layer architectures. We use the DenseNet121 CNN model, which requires fewer parameters and is computationally efficient. It can improve the training time by finding the gradient values directly from the loss function. DenseNet169 has over 14 million parameters and a model size of 57 MB, while the DenseNet121 network has about 8 million parameters and a model size of 33 MB, which significantly reduces the computational cost and makes it a superior choice.

3.4.3 ResNet152V2

ResNet152V2 is a deep residual network developed by He et al. [22] as an updated version of ResNet152. ResNet contains a large number of layers and has powerful performance. We choose ResNet152V2 because it has the highest accuracy in the ResNet family [22]. Although the depth is greatly increased, ResNet with 152 layers is still less complex than many other architectures such as VGG16 and VGG19. Deeper models will lead to better feature extraction performance. But due to back propagation, very deep models are difficult to train due to vanishing gradients. ResNet solves this problem by adding residual connections to reduce the impact of vanishing gradient. The significant difference between ResNet-V1 and ResNet-V2 is that ResNet-V2 uses batch normalization and ReLU activation before each weight layer. When using this architecture to train on the ImageNet dataset, it reported a top-1 error rate of 21.1% and a top-5 error rate of 5.5%.

3.4.4 InceptionResNetV2

InceptionResNetV2 [69] is a modified version of the Inception model that includes the idea of residual learning to improve model performance. Residual connections also shorten training time. This network is built by integrating a combination of Inception and ResNet architectures. The batch normalization is only used on the top of the traditional layer. InceptionResNetV2 replaces the filter concatenation stage with residual connections to take advantage of the two approaches (i.e. get deeper and wider) while maintaining the same computational efficiency. A 1 × 1 convolutional layer follows each inception block, and no activation is performed to match the dimensionality of various feature maps. The model contains three different types of blocks, namely. The InceptionResNet block, the Reduction block and the Stem block. These blocks contain convolutional layers, pooling layers and activation functions. The stem block accepts the input and computes three 3 × 3 convolutions on the input data. This is followed by three inception blocks, where the first and third blocks consist of two paths: 3 × 3 convolution operations and max pooling. The second block includes two paths with 1 × 1 and 3 × 3 convolution operations, while the other path has 3 × 3, 1 × 7 and 7 × 1 convolution operation. InceptionResNetV2 has three types of inception modules, i.e. InceptionResNet-A uses 35 × 35 grid modules, Inception-ResNet-B uses 17 × 17 grid modules, and Inception-ResNet-C uses 8 × 8 grid blocks. Finally, there are two reduction modules in InceptionResNetV2 that use convolution and maximum pooling to reduce the number of features. The Reduction-A block has two paths for convolution and one path for maximum pooling. Reduction-B block has 3 paths for convolution operations and 1 max pooling.

3.5 Performance evaluation metrics

The performance of the models for classifying brain tumors was assessed based on several indicators: accuracy, sensitivity, precision, specificity, and F1-score. Correspondingly, a confusion matrix is introduced to visualize diagnostic instances of MR images from the proposed model. In the equations below, the overall performance of the trained model using the proposed method was calculated using test data.

$$Accuracy\ (ACC)=\frac{TP+ TN}{TP+ TN+ FP+ FN}$$
(2)
$$Senstivity\ (SEN)=\frac{TP}{TP+ TN}$$
(3)
$$Precision\ (PRE)=\frac{TP}{TP+ FP}$$
(4)
$$Specificity\ (SPE)=\frac{TN}{TN+ FP}$$
(5)
$$F1- Score=2\times \frac{Precision\times Recall}{Precision+ Recall}$$
(6)

In the equations above, TP means true positives, FP means false positives, TN means true negatives, and FN means false negatives.

4 Results and experiments

This section presents the experimental setup and results of the five architectures used in the study. We analyzed the effectiveness of the deep transfer learning models proposed in this study along with the competitive model. Two publicly available brain tumor datasets are considered for evaluation of the proposed method. These datasets are more visible and most useful for this area. The main goal of this task is to improve the accuracy of multiclass brain tumor classification.

4.1 Experimental settings

We trained the proposed deep transfer learning models using the Python programming language and the Keras framework. All experimental studies are performed on the Google Colaboratory notebook using the GPU runtime type. This software is provided by Google for research activities and is free to use. The model was trained using the NVIDIA Tesla K80 with 12 GB of memory and 16 GB of RAM.

4.2 Hyperparameter and optimization techniques

The main goal of this task is to design the optimal model for multiclass brain tumor classification. This can be achieved by finding the best hyperparameter configuration so that the model can have increased recognition capability. The set of parameters that can affect model training are called hyperparameters. Parameters including the number of layers, learning rate, number of epochs and activation functions play a vital role in the performance of the model. In this study, we trained five deep transfer learning models and adjusted hyperparameters for optimal configuration. In the training process, we first trained only the DDB that were added on top of the pre-trained models. The convolutional base of the pre-trained models was completely frozen, so the weights of these layers did not change during training. It is necessary to freeze the convolutional base by setting the model trainable parameters to false in order to avoid destroying the pre-learned filters. We use the Adam optimizer with a learning rate of 1e-2 to train our new DDB on 50 epochs using our data augmentation method. The initial training of the model runs fast because we keep the convolution base frozen and only train the DDB. After training the DDB, we unfreeze some layers of the convolutional base of the models and jointly train these unfrozen layers and the novel DDB. This time the models was trained once more on the same dataset for 50 epochs using the Adam optimizer with a low learning rate of 1e-3. Fine-tuning the entire network is not recommended because the risk of overfitting is high due to the large number of parameters and the small dataset. Table 3 shows the complete details of the hyperparameters used to train the models.

Table 3 Configuration details of the parameters used to train the models

Most of the above selected hyperparameters that we used were motivated by related work on the classification of brain tumors [5, 21, 29, 36, 44, 48, 49, 53]. In order to avoid overfitting during training, a regularization function (L2) is used. We use L2 regularization with a fixed value of 0.001 in the first dense layer. This means using function solvers appropriately to prevent the network from overfitting. We have used a variety of techniques to prevent overfitting of the models. As discussed earlier, TL is an effective method when there is a risk of overfitting. The image data is also augmented to avoid overfitting due to the limited data size. After that, batch normalization and global average pooling are also applied to prevent overfitting of the models. We chose Adam as the optimizer function because it combines the advantages of the AdaGrad and RMSProp algorithms, and it has been found to work quite well in practice. AdaGrad is suitable for computer vision problems and works well for sparse gradients, and RMSProp works well for non-stationary settings. The Adam optimizer is quite computationally efficient and is specifically designed for training deep models [34]. Most similar studies have also used the Adam optimizer and have achieved better results than other optimizers. Also, Adam is currently recommended as the default algorithm as it generally performs better than other algorithms. Therefore, we use the Adam optimizer with an initial learning rate of 1e-3 to train the proposed model. In this work, a categorical cross-entropy loss function is utilized since our work is based on 3 and 4 class classification of brain MRI datasets. Overfitting can also be reduced by introducing dropout during training. We dropped 20% of the neurons in the dropout layer and found this to be the best. All models are trained for 50 epochs. we use the early stopping method to stop training if the accuracy of the validation dataset does not change within a predefined number of epochs to avoid overfitting the system. It improves generalization and helps the model to avoid overfitting by doing useless epochs, which would take a long time and reduce accuracy [8]. We chose this number of epochs because experimental observations showed that the proposed model converged well and achieved the desired accuracy within 50 epochs. The code used in this work is available to facilitate future research (https://github.com/sohaibasif1592/M-R-I).

4.3 Experimental results: Three class classification

In our study, a total of five deep transfer learning models were developed, and the performance of each model was assessed in terms of the indicators described in Section 3.5. We report a comparative analysis of each individual architecture. The main purpose of this study is to test the success of the deep transfer learning model proposed in this study for the multiclass classification of brain tumors and compare it with the performance of the most advanced CNN model in the literature. As described in Section 3.4, TL was used to train all deep learning models. All experimental studies were performed on both the original and cropped datasets. In this context, the average accuracy, sensitivity, specificity, precision, and F1-score obtained by all models on the test dataset for the original and cropped datasets are given in Table 4, respectively. The values marked in bold in Table 4 represent the best model for the relevant performance criteria. In the original dataset, the average accuracy values for all models are very close to each other, as shown in Table 4. The original dataset shows that the Xception + DDB and ResNet152V2 + DDB outperform other models in almost every performance metric. As shown in Table 4, in the cropped dataset, the Xception + DDB model achieves the best overall performance with an accuracy of 99.67% Moreover, the model achieves the highest sensitivity, specificity, precision and F1-score of 99.54%, 99.83%, 99.69% and 99.62%. The Xception + DDB model also showed good sensitivity, which is important because we want to limit the rate of misdiagnosis of brain tumors as much as possible. The results show that Xception + DDB can more accurately distinguish between brain tumor types. The possible reason is that in Xception, the depth-wise separable convolution is replaced by the general convolution, thus making the model computationally efficient. Depth-wise separable convolutions are more productive and have a stronger expressive ability than classical convolutions. The depth-wise separable convolution in the Xception model makes the model highly efficient in learning several distinct and high-level features that some simpler models may ignore. DenseNet201 + DDB performed quite well, with an accuracy rate of 97.06%, while the sensitivity, specificity, and F1-score reached 96.28%, 98.24%, and 97.10%, respectively. While all the models used in the study increased the accuracy with varying differences on the cropped dataset, the Xception + DDB model, which offered the best performance on the cropped dataset, improved in accuracy and sensitivity with 4.06% and 4.13%, respectively. This showed us that cropping consistently outperformed the original dataset. Therefore, we only continue to experiments with the cropping strategy.

Table 4 The performance comparison of various deep transfer learning models using different indicators for multi-class classification. Results are shown in percentages and best values are shown in bold

Figure 10 shows the accuracy obtained in the test set of the original dataset and the cropped dataset of all models. The test accuracy shown in this figure was calculated as the ratio of the number of correctly classified patients to the number of all patients. It can be clearly seen from the figure that the Xception + DDB model is superior to the other four proposed models in terms of accuracy. It can be seen from the test accuracy curve that the success rate of all models on the cropped dataset is higher than that of the original dataset. Our results show that cropping images is the best strategy to provide superior classification over the original dataset.

Fig. 10
figure 10

Comparison of the classification accuracy of the proposed models on the cropped and original dataset

The class-wise performance of the models is presented in Table 5 with best result highlighted in bold. A total of 2451 images were used for training and 613 images were used for testing. Three different classes are analyzed such as glioma, meningioma and pituitary. From the results table, we can observe that the Xception + DDB model performed well in all classes. The model achieves an average precision of 1.0, an average sensitivity of 1.0, and an average F1-score of 1.0. It can be seen that this model achieves a precision of 1.0 for the glioma and meningioma classes and 0.99 for the pituitary class. An ideal sensitivity of 1.0 is achieved for the pituitary and glioma classes, and a sensitivity of 0.99 is achieved for the meningioma class. When considering the macro-average scores of all evaluation indicators, we observe that the Xception + DDB model provides better performance than other models. The confusion matrix in Fig. 11 summarizes the details class-wise results of the Xception + DDB model. By observing the confusion matrix, we get an idea of the results for specific classes in terms of the number of correctly classified and misclassified images. From the confusion matrix, we can conclude that the Xception + DDB model only made 2 misclassifications on the test dataset. These misclassifications occurred in the class of meningioma. We see that out of 613 tests, and our model is correct on 611 tests. Therefore, from the evaluation metrics achieved by the proposed model, we can conclude that Xception + DDB performs better than other models in all aspects. The Xception + DDB model shows an average increase of 2–3% in all evaluation parameters such as accuracy, precision, sensitivity, and F1-score.

Table 5 Class-wise Precision, Sensitivity and F1-Score for all the models
Fig. 11
figure 11

a Confusion matrix of highest accuracy for Xception + DDB model b Normalized Confusion matrix

4.3.1 Performance comparison with baseline models

To highlight the advantages of the proposed model, we compared our proposed method by benchmarking its performance against five base models: Xception, DenseNet121, DenseNet201, InceptionResNetV2, and ResNet152V2. We use the same dataset to build the base model for comparison, but without DDB to train the network. Table 6 reports the classification performance between the baseline and the proposed model in terms of classification accuracy, sensitivity, precision and F1-score with best result highlighted in bold. It was found that the accuracy of the proposed Xception + DDB model is 7.15% higher than the baseline model, the accuracy of DenseNet201 + DDB is 7.79% higher than the baseline, and the accuracy of DenseNet121 + DDB is 5.03% higher than the baseline, which proves that the multiple dense layers used in DDB lead to enhanced learning ability of the model, thus improving the accuracy of the model. It can be seen that DDB significantly improves the detection rate and has better stability than the baseline for the classification of brain tumors. The baseline model performs very poorly on the test set, providing accuracies between 86% and 93%. The sensitivity and F1-scores are also below the acceptable range. There are several reasons why base models perform worse: (a) different classes in the dataset, (b) The biggest reason behind their poor performance is overfitting, and (b) difficulty in extracting features from MRI images due to the high degree of similarity between tumors. The results in Table 6 show that after adding the DDB to the models, the classification performance has improved significantly compared to the baseline. It is clear that the approach proposed in this work is much better than that of the baseline models.

Table 6 Performance comparison of the proposed method and baseline models on three-class classification problem

4.4 Experimental results: Four class classification

This section presents the classification results of the four classes. There are a total of 2611 images for training and 653 images for testing. Four different classes are analyzed, such as glioma, meningioma, pituitary and normal. Table 7 summarizes the average evaluation metrics of the competitive deep transfer learning models on the test dataset with best result highlighted in bold. All values are shown as percentages and best results are shown in bold. The Xception + DDB model outperforms other models in almost every performance metric, including accuracy, precision, sensitivity, F1-score and specificity. In this case, we see that the proposed model based on Xception architecture achieves a very impressive average classification accuracy of about 95.87% on the test dataset for classifying glioma, meningioma, pituitary, and normal patients. The model also achieved an average sensitivity rate of 95.60% and an average specificity rate of 98.56%, which are two very important performance indicators in medical applications. ResNet152V2 + DDB performed poorly on the test dataset and gave the lowest accuracy values of 93.11%. Compared to other models, the proposed model delivers satisfactory performance with an average improvement of about 1% on all metrics.

Table 7 Performance comparison of different models for detecting brain tumors using different metrics

Table 8 shows the class-wise performance of the models with best result highlighted in bold. The different classes used in the study are glioma, meningioma, pituitary and normal. It can be seen from the result table that the Xception + DDB model performs well in all classes. As shown in Table 8, the model achieves an average precision of 0.96, an average sensitivity of 0.96, and an average F1-score of 0.96. It can be seen that this model achieves 0.99 precision in both the pituitary and normal classes and 0.94 and 0.93 precision in the glioma and meningioma classes. However, the model receives a sensitivity rate of 0.98 for the pituitary class. The model also has a high sensitivity to the others classes. When considering the macro-average scores of all evaluation indicators, we observe that the proposed model provides better performance than other models. The DenseNet121 + DDB network was found to be the second-best predictor of brain tumors, with an average precision of 0.95, an average sensitivity of 0.95, and an F1-score of 0.95. The confusion matrix is the main tool for evaluating errors in classification problems. We have constructed a confusion matrix for the proposed model based on Xception as shown in Fig. 12. The figure shows that the proposed model can successfully classify four patient classes (glioma, meningioma, pituitary, and normal) with the highest ratio to pituitary images (0.9824), then meningioma (0.9551), then glioma (0.9476) and finally normal (0.9375). By observing the confusion matrix, the results obtained from the test dataset are good. We see that out of 653 test images, the proposed model correctly classified 626 cases and misclassified 27 cases. It produced acceptable results with an overall accuracy of 95.87%. This result ensures that the classification is performed correctly for the four classes. Real-time detection of the presence of a tumor in the human brain can be performed using this model.

Table 8 Class-wise Precision, Sensitivity and F1-Score for all the models
Fig. 12
figure 12

a Confusion matrix of the proposed model b Normalized Confusion matrix

4.4.1 Performance comparison with baseline models

To show the effectiveness of our proposed method in classifying brain tumors into 4 classes, we compared the performance of the proposed model with the base models. Table 9 shows the detailed comparison results of the proposed model with the baseline models. The performance analysis shows that the proposed model shows a performance improvement over the baseline models in terms of accuracy, sensitivity, precision, and F1-score. It was found that the accuracy of the proposed Xception + DDB model was 6.44% higher than the Xception baseline model, and the accuracy of DenseNet121 + DDB was 6.74% higher than the DenseNet121 baseline, which proved that DDB can significantly improve the accuracy of the model. It can be seen that DDB significantly improves the detection rate and has better stability than the baseline for classifying brain tumors.

Table 9 Performance comparison of the proposed method and baseline models on four-class classification problem

4.5 Proposed model comparison with different optimizers

In this experimental setting, various optimizers, namely Adam, SGD and RMSProp, are explored to obtain the superior classification accuracy of the proposed Xception + DDB model. Initially, the training phase of the proposed model is carried out by empirically selecting the Adam optimizer. To evaluate the effectiveness of the proposed model using the Adam optimizer, its results are compared with two popular optimization methods, namely RMSProp and SGD. Table 10 shows the effect of different optimizers on the proposed model using two different brain MRI datasets, i.e., 3-class and 4-class datasets. The proposed model with Adam optimizer achieves 99.67% classification accuracy on the 3-class brain MRI dataset. Then, using RMSProp and SGD, the classification accuracy of brain MRI was 95.61% and 96.75%, respectively. Adam has better classification accuracy compared to RMSProp and SGD. However, the proposed model achieved a similar classification accuracy of 95.87% on a 4-class brain MRI dataset using the Adam and RMSProp optimizers. The classification accuracy of brain MRI using SGD was 93.11%. We can see that the proposed model adapted all three optimizers very well for 3-class and 4-class classification of MRI brain images.

Table 10 Classification performance of the proposed model among different optimizers

4.6 Computational cost

In addition to the classification performance of the models proposed for brain tumor classification, comparisons were also made in terms of computational costs. Table 11 compares the performance of the proposed models with the base models in terms of total training time, time per epoch, test time, and number of parameters. The total training time of the Xception + DDB model proposed for 3-class and 4-class classification took 3950.63 seconds and 1772.87 seconds. On the other hand, although the base models finished training before our method, they did not get the best classification results. As can be seen from the table, our proposed method showed the best performance among the base models, except for the training time. However, our aim here is to improve the accuracy of the system. The results show that after adding DDB to the model, the performance is significantly improved compared to the baseline, while the number of parameters is only slightly increased, not exceeding 1 M. Small increase in model parameters is acceptable compared to large increase in the diagnostic outcome.

Table 11 Computational time and parameter comparison between the proposed model and the base model

4.7 Comparison with state-of-the-art methods

The performance of the proposed model based on Xception architecture is compared with the most recent competitive models. Several papers use the same dataset to classify brain tumors. To compare our results with those of previous studies, we selected only those articles that used the Figshare brain tumor dataset. It is noted that accuracy is the main metric used to compare classification results. Table 12 contains a comparison of the proposed model with benchmark studies in the literature using the same dataset. This table shows that the proposed model is superior to the existing models with 99.67% accuracy on 3-class dataset and 95.87% on 4-class dataset. In addition, most of the existing work focuses on three or four class classifications. As far as the author knows, there is no similar study that addresses both three- and four-class classification of brain tumors. The proposed model can effectively solve the three and four classes problem.

Table 12 Comparative analysis of the proposed model with state-of-the-art models using the same dataset

4.8 Strengths, limitations and future work

Until now, most brain tumor research has focused on binary or three-class problems. As previously described, experimental studies of the system were conducted using two publicly available datasets that classified brain tumors into three or four tumor classes. According to the Pareto principle, 80% of the images are reserved for training and validation, and 20% are reserved for testing. We use image augmentation to ensure that each model receives enough input images to avoid overfitting problems due to the limited number of images in the dataset. For best performance, these datasets were trained using transfer learning based on five deep neural networks: Xception, DenseNet201, DenseNet121, ResNet152V2, and InceptionResNetV2 for predicting brain tumors in MR images. The last layer of these architectures has been modified with our deep dense block and the softmax layer as the output layer. The deep dense block contains three dense layers with output layer to improve the classification accuracy of the deep transfer learning model proposed for multi-class classification. The dense layer used in the deep dense block adapts the features of brain tumors that significantly improves classification performance. The proposed model does not require separate feature extraction because the model uses a deep neural network. We use the global averaging pooling in our deep dense block as a flattening layer to convert the multidimensional feature map into a one-dimensional feature vector. It helps to reduce overfitting by reducing the total number of parameters and does not require parameter optimization. We use batch normalization immediately after each dense layer to increase the stability of the model. The motivation for using the early stop method is to end training if there is no improvement in order to avoid the system being overfitted and to prevent poor generalization performance. Dropout layers and L2 regularization are used to minimize overfitting and allow it to produce meaningful predictions with reasonable accuracy. The proposed model uses ADAM, which is one of the most popular gradient descent optimization algorithms, as it combines the advantages of AdaGrad and RMSProp. It is computationally efficient for deep neural networks. The classification results of the proposed models are calculated using various evaluation metrics such as accuracy, sensitivity, precision, and F1-score. This study compares five DL models, and it turns out that Xception + DDB has advantages over the other models. On the one hand, the deep separable convolution in Xception is more productive than the general convolution. It makes the model highly efficient in learning high-level features that some simpler models may ignore. On the other hand, point wise convolution, i.e. 1 × 1 convolution, is performed on each channel before using the depth-wise convolution to make the model computationally efficient. Another reason is the use of depth-wise separable convolutional layers and residual connections, which enables the model to learn richer representations from brain MR images. In addition, the absence of any non-linearity makes the model highly efficient in all performance measures. As shown in Tables 4 and 7, the proposed model Xception + DDB provides good accuracy, reaching 99.67% for 3-class classification of brain tumors and 95.87% for 4-class classification. Furthermore, the obtained results are compared with some existing methods. As shown in Table 12, it is clear that our proposed model outperforms the benchmark studies in terms of classification accuracy. Based on these encouraging results, we believe our “Xception + DDB” model will benefit doctors in diagnosing and detecting brain tumors. We believe that the model is effective in classifying MRI brain tumors with low misdiagnosis rates and helps doctors make accurate decisions. The method proposed above for classifying brain tumors is a major strength of this article.

In the absence of brain tumor data, this study is limited to single-institutional data. MR images need to be increased for better model training. In addition, the study used single protocol T1W MRI data. The system can be made more robust by merging multiple MRI protocols.

5 Conclusion

The article focuses on the development of an automated deep learning system for the multiclass classification of brain tumors. Due to the high similarity between tumor types, the multi-class classification of brain tumors is a complex task. In earlier works, the four-class paradigm was absent for the classification of brain tumors. In this work, we conducted experiments on the classification of 3 and 4 types of brain tumor patients; and all images were enhanced using image processing techniques in the pre-processing stage. Five popular deep learning architectures utilizing deep transfer learning technique are adopted for brain tumor detection by analyzing MR images. The last layer of these architectures has been modified with our deep dense block along with softmax layer to improve classification performance. We propose a deep learning model based on the Xception architecture to detect brain tumor cases using MR images. The proposed model uses depthwise-separable convolution, which makes the model highly efficient in learning several distinct and high-level features, while the deep dense block significantly improves the performance. The proposed model demonstrates fast learning through the use of Adam optimizer, while batch normalization, data augmentation, global average pooling and dropouts avoid the model overfitting issues. The proposed model achieves 99.67% and 95.87% overall classification accuracy on 3-class and 4-class dataset. From the results, it is clear that the proposed method gives the best performance on the selected dataset among all other models used in this study. Our proposed model is superior to the existing models in terms of classification accuracy. Therefore, the proposed model can be used as a tool that can accurately identify multiple types of brain tumors. In future work, we will extend this work to experiment with more brain MRI data without compromising performance.