Keywords

1 Introduction and Background

Cancer is a condition when cells grow out of control. Cells in any part of the body are prone to cancer and have the tendency to easily spread to surrounding tissues and organs. Lung cancer is generally recognized with name carcinoma. It is defined by a peculiar developement in cells. Carcinoma is the cause of high cancer-related death rate. This process can be unqualified, but eventually spontaneous to deliberate. When left untreated or undiagnosed, cancer can become deadly by time and might be fatal to the person.A carcinoma is a type of cancer that begins in the cells that line the skin or an organ such as the lungs or kidneys. Cancers of the lungs are classified as small-cell lung carcinoma or non small-cell lung carcinoma.

Cancer is generally a genetic disease, which does not exclude it from being caused by external affiliations. It is more in relation to the changes that are associated to the gene of the body that might even be caused due to certain stimulations the body responds to, which might alter the cell behavior, especially how they divide. However, most lung cancers are not genetically inherited diseases, although there is a genetic component determining this pathological tendency. The primary cause of lung cancer is smoking. The occurrence of cancer in a few cases was found in people with no smoking history but with symptoms of exposure to air pollution, secondary smoking, and some toxic gases. Before, the twelfth-century occurrence of lung cancer was very rare. But comparing the stats, it is usually caused due to exposure to polutant chemical air, in many forms. Like any other cancer, early stage diagnosis is very important to treat lung cancer. The X-ray image of lungs can show the abnormal growth of cells in lungs due to cancer. Cancer can also be detected by sputum cytology, a method of analyzing the subject’s sputum to identify the affected cells. Biopsy is another method of collecting a tissue sample from the suspected area and running tests to determine the possibility. Each of these methods have different accuracy rates, based on the stage of cancer.

In conventional method of detection of lung cancer, we are using radio graph techniques and computed tomography (CT). We can achieve better results using CTscan compared to a normal X-ray because in CTscan, we use a computer and an X-ray to visualize the lungs. Using CTscan, doctors can visualize the lungs in different angles since it gives the clear image compared to a normal one. In this study, 20 lungs images samples have been taken into consideration. Thus, the study could help in the early detection of lung cancer with greater accuracy.

1.1 Literature Survey

The project proposed here uses an ML model with convolutional neural networks that emphasize the method of utilizing filtering features to initialize a cognitive methodology for faster decision-making and accuracy. Several other methods involving but not limiting to identification of the case follow similar approach with a little or no directive process. Predicting optimal subset of genes is a systematic procedure of analyzing gene microarray of expression data to predict the most probable cancer causing agent. While several classification techniques like multi-layer perceptron and random subspace are compatible with various subset of attributes, the precision recall value of sequential minimal optimization (SMO) is much better. Also, support vector machine (SVM) classifier is used for multi-stage classification for the detection. However, other methods like electrical impedence tomography proved to be superficially better at understanding the cancer development by visualizing the respiratory system. Around the exact idea of promoting the salient feature extraction and classification, though biomedical methods have near field monitoring and direct analysis, using CNN and machine learning to analyze the possibility of cancer development can be considered as an early stage of developing a sustained diagnosis. Other existing procedures are time consuming, and relying solely on them is waling on a thin line especially where time is a deciding factor. CNN algorithm works on various layers of classification which involves various criteria like max-pooling and ReLU over various stages. While biopsy methods are still proven to be most used, deep learning models are have high success rate since they work in a hierarchy. Descriptors like GLCM, MPLNN, and DT help in propagating the further ML structure to maximum accuracy possible after filtering. Cancer prediction using CNN follows image processing with segmentation and multiple channeling where the data is used to train the model, and then, the features are extracted using CNN. The extracted data is compared and validated after classification, thus predicting the cancer.

The factor of achieving the required verification through deep prognosis is different in various methods of justification and imaging techniques. Although certain demographics suggest the importance of the area of analysis using MRI or CT, the concrete results are always varied based on the topology. Studies show that sensitivity of positive prediction through CT and MRI is exclusively higher than biopsies, thus identifying the structural value with image sourcing through CT helps the classifier machine acquire high initial predictive discrimination followed up by constructive feature extraction for accurate classification. The different image processing techniques employed through GLCM after de-noising takes minimum time to produce the result (Figs. 1 and 2).

2 Proposed Model and Method

2.1 Model Development Using CNN

In our study, we are using convolutional neural network algorithm which applies gradient descent for training the model [1, 2]. CNNs are widely used for classifying two-dimensional structure like images and that is why CNNs are treated as most powerful algorithms. Mostly, CNN algorithms are called as end-to-end algorithms which contains various sub-sampling, pooling and convolutional layers which are all connected to a final layer which is fully connected [3]. Firstly images of size of 64 \(\times \) 64 pixels are given as input for the input layer. We can see the architecture of CNN and training process of the algorithm in Fig. 3. The CNN architecture has 6 layers in total: The primary layer is the input layer, the last layer is called as output layer, layer two and four are called as convolutional layers, and layer three and layer five are sub-sampling layers made of max-pooling layers and ReLU activation [4]. We use batch normalization for feature extraction, and the output layer is made up of sigmoid activation which is connected to one neuron. This output layer classifies the image in to benign tumor or malignant tumor which are the categories of classes available to classify. In convolution layers, we extract the features by preserving spacial relationships of the pixels of the input images which in turn reduces the size of the images and at the same time preserving the features of the images which helps in reducing the computation cost. To be most specific on the design of the CNN, the 2nd layer has 32 feature maps with 3 \(\times \) 3 dimensions, which is followed by max-pooling layer of 2 \(\times \) 2 pixels [5, 6]. Down-sampling for the convolved image is performed by the pooling layer. With the help of max-pooling layer, we can reduce the dimensions of the image by preserving the features which in turn reduces the computational cost. And at last, the output layer has 1 \(\times \) 1 matrices with sigmoid function having which has one neuron and that is responsible for matching the output to the available categories (benign, malignant). In turn, each and every neuron receives linear combinations of input from its corresponding neurons, with weights and biases based on the prior layer’s input. Finally, the output layer computes the probability of strength based on the data from the prior layer.

Fig. 1
figure 1

Architecture of CNN algorithm

Fig. 2
figure 2

Activation diagram

Fig. 3
figure 3

Training process of CNN

2.2 Methodology

Dataset Collection

We have collected the dataset from Kaagle. The data in the dataset is gathered from a hospital with in a span of 3 months. This dataset contains the lung CT scan images of different patients which were diagnoised from lung cancer in benign or malignant stages. This dataset has a total of 1100 CT scan images which are from slices of 105 cases. The cases are divided into two categories: benign and malignant. Thirty five cases are malignant, and seventy are benign. We de-identified all images before analyzing them. A scan consists of several slices. There is a range of 70–110 slices, with each slice representing a different side or angle of a human chest.

Preprocessing of Image

In the process of training a model, the first step we do is preprocessing. It is like cleaning the data such that there will not be any corner cases which breaks our model. Preprocessing makes our model robust by making it to accept the data of any dimensionality. Few preprocessing techniques we have implemented are checking for the false data and remove it, resizing and re-scaling the images to the desired dimensions, and data augmentation for making the images of different views.

Image Resizing

Every image which is given as an input image to the convolutional neural network must be resized to a specific dimensionality which can be done using image resizing. Image resizing is one of the crucial preprocessing techniques. We can resize the images in two different approaches by down-scaling the images and by cropping the borders. In the second approach, we may end up losing the border data which can become the crucial features of the image. In the first approach also, we may end up getting the deformed images. But deformed images are mostly a reasonable choice than cropped images. So, the first approach is more feasible for most of the applications. Here, we get the deformed images, and we are not at the risk of loosing the patterns or border features. Here, we are performing re-scaling and resizing to make input image match with the desired dimensions of model.

Data Augmentation

To avoid overfitting, CNN requires large data. Overfitting happens when a model gives better accuracy on training data but gives least accuracy on test data. This may happen due to less quantity of training data. Data augmentation is a technique which prevents overfitting, and it helps in generating new lung CT images from the data what we have by applying certain mathematical functions. In this model, every image is transformed using zooming, rotation, and flipping. Horizontal axis flipping is mostly used than vertical axis flipping. Usually, augmentation through rotation is implemented by making the image rotate either left or right. We have implemented the augmentation process with the help of the image generator of keras library.

3 Results

We made comparison between CNN and SVM algorithms. And we got the accuracy difference of 2.5, CNN being the better one among two. We used 30, 20, and 10% of images from the proposed dataset of 1100 images for testing purpose.

4 Scope

We made some observation to support future work

  1. 1.

    Though the end results, we acquired on the study were good, as we focused on limited data with restricted count of algorithms. Hence, we can increase the scope with a large volume of data could help in elevating the performance.

  2. 2.

    A rather radical investigation on optimal input dimensions in deep learning models can yield better accuracy.

  3. 3.

    Broadening the input dimension to 3D data feature and building the relevant CNN model (Fig. 4).

Fig. 4
figure 4

Accuracy comparision between different model based on the number testing images