1 Introduction

Lung cancer is one of the most serious health problems in the world. The mortality rate of lung cancer is the highest among all other types of cancer. Almost 80–85% of lung cancer belongs to non-small cell type lung cancer (NSCLC) [1]. However, the survival rate of lung cancer patients can be increased if nodules are diagnosed accurately.

Lung nodule detection is a time-consuming and complex process, and radiologists must carefully analyze the nodules in CT scan images. Doi [2] shows that 30% of lung nodules overlay with the other common anatomic structures, which may lead to the missing of the nodules by radiologists. The lung images are obtained through several imaging techniques. Among these, the CT image is a basic imaging method for the detection of lung nodules used by many researchers.

A number of methods have been developed for lung nodule detection and classification. Gong et al. [3] have proposed a probabilistic-based Naïve Bayes classifier which is used for effective binary classification. The performance of this method is improved over Fisher’s linear discriminant analysis (FLDA) in terms of accuracy and sensitivity. The K-nearest neighbor (K-NN) method, a probabilistic-based method suitable for 2-class classification, has been proposed by Mao et. al. [4]. This method provides a classification accuracy of 89% which is better than methods used for lung nodule detection such as random forest, regression tree, and learning vector quantization. Support vector machine (SVM)-based method, a powerful method for classification problems, has been proposed for lung nodule classification by Han et al. [5]. The rule-based filtering process combined with features of the region of interest (ROI) is used to reduce the false-positive (FP) rate. The adaptive neuro fuzzy inference system (ANFIS) combines the merits of ANN (artificial neural network) and FIS (fuzzy inference system) to get better performance of lung nodule detection. Tariq et al. [6] proposed a neuro fuzzy classifier approach which provides accurate and effective detection of lung nodules.

Recently, deep learning-based methods are extensively used to provide effective solutions for many applications such as natural language processing, speech recognition, and image analysis. The advantage of deep learning in CAD systems is that it can perform nodule detection by learning with the automatically extracted features during training. The deep learning architectures consist of different types of neural networks with increased hidden layers when compared with traditional machine learning processes. Convolutional neural network (CNN) is widely used for object recognition and classification with a remarkable increase in accuracy [7, 8].

The CNN architecture consists of convolution layers to perform the convolution operation to attain an improved understanding of input images for effective classification. Here, every neuron is associated with others such that it reacts to the receptive field surrounding it. In addition to this, the learnable parameters are extracted with convolution and pooling processes in CNN. As a result, the CNN approach produces improved accuracy for image classification. Many CNN-based approaches are proposed, including 7 layers Imagenet [9], 8 layers AlexNet [10], 25 layers VGG net [11], and 152 layers ResNet [12, 13], for various applications of pattern classification.

Kumar et al. [14] presented a deep learning technique with a stacked autoencoder and achieved an accuracy of 75.01% for lung nodule classification. Shen et al. [15] analyzed lung cancer detection using the LIDC database with a multiscale CNN method and achieved an accuracy of 86.84% for lung nodule detection. Shin et al. [16] evaluated the performance of transfer learning on lung CT images with CNN and attained 91.1% as nodule detection accuracy. Li et al. [17] proposed an 8-layer CNN-based deep learning method for the effective detection of lung nodules and achieved an accuracy of 92.4%. Jiang et al. [18] developed a deep learning model with CNN for computer-assisted diagnosis of lung cancer and achieved an accuracy of 94% with reduced false positives.

Since detection of lung nodules more accurately will lead to better and prompt treatment by oncologists, new approaches with improved performance are always needed. For improving the performance of the lung nodule detection using CT images, a new CNN-based deep learning method, CNN-5CL, which consists of 11 layers, is proposed in this paper. This work focuses on developing a unified method for ROI segmentation of the lung region from the database image along with the classification of nodules. CNN constructed with the rectified linear unit (ReLu) function [19] is used for automatic feature extraction from the normalized lung region and classification of cancerous lung images with greater accuracy.

This introduction section briefs the need for this proposed method and related works. Section II describes the proposed CAD system for lung nodule detection. Section III provides the results obtained through the implementation. Section IV offers the discussion on the results of the proposed method. Section V provides the conclusion of the paper.

2 Proposed method

The proposed CNN-5CL architecture for the lung nodule classification consists of 11 layers. The classification of lung nodules is carried out using LIDC/IRDI database images [20, 21].

2.1 Convolutional neural networks (CNNs)

A convolutional neural network (CNN) is a multilayer neural network, which has many convolution layers and also has a required number of fully connected layers as a typical multilayer neural network. CNN deep learning methods are developed with the concepts of local perception, the sharing of weights, and sampling in terms of space or time. Local perception can detect maximum local features of the image as basic features for classification or analysis. Another important merit of CNN is that more features of the input can be used for training to make an efficient decision for classification [22].

The proposed CNN-5CL architecture is given in Fig. 1. CNN architecture is evaluated with a different number of layers, neurons, and filters, and the effects of these changes in the performance are discussed in Sections III and IV. The selected layers, filters, and neurons in each layer after the evaluation are given in Fig. 1. It is composed of 5 convolutional layers; each convolution layer is composed of a set of filters for convolution. These convolution operations create a set of feature maps. The number of feature maps is equal to the number of filters used for convolution. Every feature map is related with one convolution kernel (i.e., weight), and each convolution kernel represents a feature, such as the edge of the image. The size of input images used in this work is 256 × 256 in grayscale form.

Fig. 1
figure 1

Proposed 2D CNN-5CL architecture

It is always advantageous to use the max pooling layer after the convolution layers. Instead of using all the convolutional layers together, if the convolutional layers are used in small numbers followed by a max pooling layer, the performance is improved [23]. This reduces the number of parameters used for processing. The 11 layers are split into 4 groups. The first and second groups have two convolutional layers and one max pooling layer. The third group contains one convolutional layer and one max pooling layer. The last group consists of a fully connected dense layer, dropout layer, and fully connected softmax layer.

2.2 Convolution layer

The input image is split into small sizes based on the convolutional kernel, and a convolution operation is carried out on the image. The convolution layer has sparse interactions with the input and kernel function of the filter. It can be seen in CNN architecture that the kernel size is smaller than the size of the input. The advantages of this sparse interaction are the detection of useful features such as edges, reduction in memory space, and accuracy of classification. The convolution function has a weight-sharing capability, and this results in the extraction of diverse features of an image and efficient parameter learning similar to that of fully connected networks. The output of the convolutional layer is obtained based on Eq. (1) [24].

$$\left(n \times n \times {n}_{c}\right)* \left(f \times f \times {n}_{c}\right)= \left(n-f+1\right)* \left(n-f+1\right)*{{n}_{c}}^{^{\prime}}$$
(1)

where \(n \times n\) is the size of the input image, \(f \times f\) is the size of the filter kernel, \({n}_{c}\) is the number of channels (nc = 1 for 2D grayscale image), and \({{n}_{c}}^{^{\prime}}\) is the number of filters used for convolution.

The activation function used in the convolutional layer is ReLU (rectified linear unit) because of its effectiveness in computation. The ReLU activation function is given in Eq. (2) [25].

$$f(x)= \left\{\begin{array}{c}x, if x\ge 0,\\ 0, otherwise\end{array}\right.$$
(2)

where x is the input and \(f(x)\) is the activation function.

2.3 Pooling layer

The pooling layer is added in the CNN to find similar elements, and it is used to reduce the size of the pooling layer output image. The pooling layer is used to reduce the feature map of the specific position in convolution output. Three types of operations are involved in the pooling layer: min pooling, mean (average) pooling, and max pooling. Min pooling estimates the neighborhood within a minimum of feature points, average pooling estimates the average neighborhood within the feature points, and max pooling estimates the neighborhood within a maximum of feature points. The max pooling operation is chosen due to its ability to reduce the output error [3]. The max pooling operation is performed on the output image of the convolution layer as mentioned in Eq. (3) [26].

$${f}_{\mathrm{max}}\left(x\right)=\mathrm{max}\left({x}_{i}\right)$$
(3)

where xi is the ith element of the max pooling output: i = 1, 2, 3, 4 for 2 × 2 kernel.

2.4 Fully connected layer

The fully connected layers along with the dropout layer are used for classification in the last section of CNN. The operation of the fully connected layer is global which is different from convolution and poling layer operations. Therefore, this operation produces a nonlinear grouping of features which are used for classification. The fully connected layer is used to flatten the output of the max pooling layer in the form of an array as it represents the feature vector of the input from all the neurons of the previous layer.

2.5 Dropout layer

The dropout process is a practice of random selection of neurons from the previous layer for training. The dropout layer uses max-norm regularization, which improves the efficiency of classification through random skipping of connections with selected probability. Here, the dropout layer is used to reduce the number of parameters extracted through convolution and max pooling operations, and this layer can avoid the over fitting of features during the training process. Dropout rate represents the fraction of parameters considered from the available parameters. A proper dropout rate is to be chosen for achieving improved performance.

Finally, the fully connected softmax layer is used to identify the nodule availability based on the probability of the class label [27] given in Eq. (4).

$${S}_{i}= \frac{{e}^{{z}_{i}}}{\sum_{j=1}^{k}{e}^{{z}_{j}}}$$
(4)

where \({S}_{i}\) is the probability value of softmax output for class i (i = 1, 2), \({z}_{i}\) is the softmax layer output value of neuron i (i = 1, 2), and k is the number of neurons in the output layer (k = 2).

2.6 Dataset

Images from the Lung Image Database Consortium (LIDC) [28] are used for evaluating the performance of the proposed system. The dataset constructed from the LIDC database consists of 1018 images of non-nodules and nodules with a minimum nodule diameter of 3 mm along with the annotations [29]. Among the 1018 images, 135 images did not have nodules, and the remaining 983 images have 2669 nodules. The input segmented database contains 5338 nodule images and 1796 non-nodule images with sizes of 256 × 256. The datasets with segmented images are further divided into a training set (80%) and a testing set (20%) for the evaluation. The number of images used for training and testing is 5867 and 1467, respectively. The sample input images of the nodule and non-nodule categories are shown in Fig. 2.

Fig. 2
figure 2

Sample input images with a size of 256 × 256: a nodule category and b non-nodule category

2.7 Performance metrics

The metrics considered for performance evaluation are accuracy (correctly classifies nodule and non-nodule images), sensitivity (correctly determines the nodule images), and specificity (correctly determines the non-nodule images). The mathematical relationships of these parameters are provided in Eqs. (5) to (7).

$$\mathrm{Accuracy}=\frac{\mathrm{TN}+\mathrm{TP}}{\left(\mathrm{TN}+\mathrm{TP}+\mathrm{FN}+\mathrm{FP}\right)}$$
(5)
$$\mathrm{Sensitivity}=\frac{\mathrm{TP}}{\left(\mathrm{TP}+\mathrm{FN}\right)}$$
(6)
$$\mathrm{Specificity}=\frac{\mathrm{TN}}{\left(\mathrm{TN}+\mathrm{FP}\right)}$$
(7)

where TP (true positive) denotes the images are correctly classified as nodules; FP (false positive) denotes the images are wrongly classified as nodules; TN (true negative) denotes the images are correctly classified as non-nodules; and FN (false negative) denotes the images are wrongly classified as non-nodules.

Receiver operating characteristics (ROC), another way of evaluating the performance, is plotted between true positive rate (TPR) and false positive rate (FPR). This is used to find the overall performance of the classifier. From this, the area under the ROC curve (AUC) is calculated which represents the aggregate measure of performance across all possible classifications.

3 Results

The execution of the proposed CNN-5CL method is implemented in Python 3.7.4. The implementation platform uses Keras 2.1.3 with tensorflow as backend along with CPU acceleration. The experiments are conducted with different batch sizes of input. The batch size represents the number of input images that are used for training the network at a single instance.

The experiments are conducted with diverse batch sizes and convolutional layers, and the attained performances are listed in Table 1. With 128 batch size inputs and 5 convolutional layers, better performance is achieved.

Table 1 Performances of the CNN-5CL method for various number of convolution layers and input batch sizes

Now the 5 convolutional layers are split into 3 groups, with two each in the first and second groups and one in the third group. All these groups will have one max pooling layer. The fourth group consists of a fully connected layer, dropout layer, and softmax layer.

The selection of the number of filters, filter sizes, and the number of neurons in each layer is discussed in Section IV.

The performance of the proposed CNN method for various dropout rates is presented in Table 2. This analysis is carried out to select a suitable dropout rate to achieve better performance. From the results, the dropout rate of 0.5 is found to be suitable to attain the best performance.

Table 2 Performances of the CNN-5CL method for various dropout rates

With the proper selection of various components of the architecture, the layer specifications of the proposed CNN are given in Table 3.

Table 3 Layer specifications of the proposed model for 256 × 256 input

The ROC curve represents the performance of classification plotted between true positive rate (TPR) and false positive rate (FPR), where TPR is the same as sensitivity and FPR is 1–specificity. The ROC curve of the proposed method is shown in Fig. 3. The area under the ROC curve (AUC) value achieved is 0.928 which indicates the overall performance of the predictions.

Fig. 3
figure 3

ROC of the proposed method

The performance of the proposed CNN-5CL is compared with K-nearest neighbor (Mao et. al. [4]), Naïve Bayes (Gong et. al. [3]), SVM (Han et. al. [5]), ANFIS (Tariq et. al. [6]) methods, and other CNN approaches proposed by Li et. al. [17] and Jiang et. al.[18]. The results are shown in Table 4.

Table 4 Comparison of performance of CNN-5CL with other methods

From the results shown in Table 4, it is seen that the proposed method outperforms other methods. The accuracy, sensitivity, specificity, and AUC of the proposed method are 98.88%, 99.62%, 93.73%, and 0.928, respectively, which are better than other machine learning and deep learning methods. Also, our method has considered all the images in the database LIDC/IDRI, whereas other methods considered a slightly lesser number of images for their experiments.

The comparison of computation time for training and testing of the proposed CNN-5CL method with other methods is shown in Table 5. It can be seen that the proposed method takes lesser time than the other methods for both training and testing. The computation time of 30.27 min and 19.23 s is achieved by our proposed method for training and testing, respectively.

Table 5 Comparison of computation time for training and testing of the proposed CNN-5CL method with other methods

4 Discussion

The architecture of the proposed CNN-5CL model is designed by conducting various experiments by varying the number of convolution layers, number of filters, and neurons in each layer and dropout in the dropout layer. Also, the batch size of the input for training is chosen to achieve the best performance.

The convolution layer consists of neurons which take the input of the selected area, called the receptive arena of the neuron, from the preceding layer. The area of the receptive arena of the neuron is square, and it is defined by the size of the filters used in convolution layers. Here, the different sizes of the filters including 3 × 3, 5 × 5, and 7 × 7 are used for evaluation. The sizes of the receptive arena of each neuron are 9, 25, and 49, respectively. It is observed that the performance is high with the filter size 3 × 3 when compared with filter sizes of 5 × 5 and 7 × 7 as it extracts local features of images effectively.

When the number of convolution layers is less, it leads to under fitting of the network parameters, and the performance is reduced. If the number of layers is more, it leads to the over fitting of the network parameters which affects the training process, and the performance is degraded. With a 3 × 3 filter size, experiments are conducted for 4 and 5 convolution layers. For 4 layers, 2 convolutional layers in group 1 and 1 convolution layer each in groups 2 and 3 is used for implementation. For 5 layers, the architecture shown in Fig. 1 is used. The results are shown in Table 1. Since the performance is better for 5 layers, this is selected. Hence, 2 convolution layers in the first and second groups and one convolution layer in the third group are selected as shown in Fig. 1.

Three max pooling layers, one for each group, are used in the model. When different sizes of filters such as 2 × 2, 3 × 3, and 4 × 4 are used for the evaluation, the filter size of 2 × 2 is found to be suitable as it preserves key features from the input. In the first, second, and third groups, one max pooling layer in each group is used. The number of neurons in each layer is the product of the size of the input and the number of filters in that layer.

The dropout rate is to be chosen in the dropout layer to achieve better performance. With different dropout rates (0.4, 0.45, 0.5, 0.55, and 0.6) the performance is studied. From Table 2, it can be seen that the performance measures such as accuracy, sensitivity, and specificity are better for the dropout rate of 0.5 when compared with other dropout rates. Hence, the dropout rate of 0.5 is selected.

After selecting the various layers, number of filters, filter sizes, and neurons, the architecture of the CNN-5CL model is proposed. Table 3 shows the layer specifications of the proposed model. This architecture is used for performance comparison with other methods.

Table 4 shows the comparison of performances of CNN-5CL with other methods. The Naïve Bayes (Gong et. al. [3]) and K-nearest neighbor classifier (Mao et. al. [4]) are probabilistic-based methods. Classification is carried out by these methods based on the configuration of feature-independent probability. These methods do not require a large size of samples for effective training. The SVM classifier (Han et. al. [5]) is a powerful method for classification even for the larger dataset and high dimensional features. However, the selection of hyperplanes for classification is difficult. The ANFIS classifier (Tariq et. al. [6]) gives good classification accuracy. However, the formation of rules for proper classification is very challenging. Deep learning methods proposed by Li et. al. [17] and Jiang et. al. [18] have produced better nodule detection accuracy. However, they have used only selected images for classification. Our method considers all the images in the LIDC database and offers improved classification results. Also, the automatic parameter tuning capability is achieved. The accuracy, sensitivity, specificity, and AUC of the proposed method are 98.88%, 99.62%, 93.73%, and 0.928, respectively.

The execution time taken for both training and testing is relatively less in the proposed method which is shown in Table 5. Automatic feature extraction reduces the execution time when compared with machine learning techniques. The proposed architecture requires less execution time when compared with other deep learning methods.

5 Conclusion

In this paper, the 11-layer CNN-5CL method is proposed for the effective classification of nodules and non-nodules from the LIDC/IDRI database of lung images. CNN-5CL architecture is proposed from the results of experiments conducted by varying the number of convolution layers, number of filters, number of neurons, and dropout rates. The performance of the proposed method is compared with the performance of other methods. The experimental results show that the proposed CNN-5CL method performs better when compared with K-nearest neighbor proposed by Mao et. al. [4], Naïve Bayes proposed by Gong et. al. [3], SVM proposed by Han et. al. [5], ANFIS method proposed by Tariq et. al. [6], and other CNN approaches proposed by Li et. al. [17] and Jiang et. al.[18]. The accuracy, sensitivity, specificity, and AUC of the proposed method are 98.88%, 99.62%, 93.73%, and 0.928, which are better than those of the other machine learning and deep learning methods. This method can be extended to develop and design high-performance CAD systems for the categorization of lung nodule types. Also, attempts can be made to apply this approach to other image formats of lung images for nodule classification.