1 Introduction

Oral cancer is a severe global health issue, with a high mortality rate, particularly in lower- and medium-income nations. Early detection and diagnosis are crucial for improving survival rates. Automated detection systems utilizing deep neural networks have shown promise in identifying patterns associated with oral cancer. Transfer learning, which leverages knowledge from related domains, can enhance the performance of such systems.

An inability to treat oral cancer has indeed been proven [1]. In the initial phases of mouth cancer, known as ulcers, sterile cells can be found in the oral tissue. Dead cells are found in remote areas or within the body whenever it comes to metabolic activities. 90% of crab cells are categorized as oral cancer, one of many forms of cancer [2]. The attractiveness designs and preconceptions that do not necessitate discoloration can be used to detect physiological models, and also clinical forms of linked and lesion-free tumor types. Machine learning algorithms were used to anticipate numerous physiological concepts for oral cancer, which were used to categorize non-cancerous and cancerous specimens, which were then evaluated for the oral cancer stage [3]. To determine the validity of the connection, the causal factor will use three rationalization screening tests and different stages of cancer. The early identification of oral cancer is aided by the ability to use sample reasoning to assess tumor parts and the emergence of lesions in the tissues [4].

One does not require any specialized equipment to perceive the oral cavity. The graphic representations of cancerous tumors are used by experts in medical care to make suspicious diagnoses and treatments of oral cancer [5,6,7]. Generally, oral cancer tumors are white spots accompanied by red patches or mixed white spots in a few instances. Often, the mucous membrane surface becomes increasingly uneven, granular, and inflamed. Other types of oral mucus diseases may be mistakenly interpreted by non-specialist medical personnel, however [8]. There is no founded eyesight method for detecting oral cancer. Even in developed nations, general practitioners and community-based settings may not have access to oral biopsy, which can take a very long time but is not always available [9]. To put it another way, many people with Oral Cancer are unable to get timely diagnoses and referrals. These findings suggest that deep learning may be able to catch perfectly alright facets of oral cancer lesions, which could be helpful in the early detection of the disease.

In the theory of transfer learning, understanding or details from a related discipline can be transmitted to enhance an idea in a different domain, which is the basis of the concept. Think about 2 different people who want to learn the flute [10]. When it comes to music, one of the people involved is completely a novice while the other is an accomplished satirist. Learning to play the flute will be easier for someone who has previously studied music because of their prior musical training. We determined by calculating computational efficiency and especially in comparing the prototype to the expected mean of six oral cancer experts on clinical test data on both internal and external validation datasets. Using a deep learning model, we were able to identify distinct visual patterns in oral cancer lesions. Oral cancer can now be detected at the point of service in a less invasive, less expensive, and more effective way.

The transfer learning improves histopathologic prediction of oral cancer based on oral squamous cell carcinoma biopsies. The study's limitation lies in its reliance on a specific type of cancer biopsy, potentially limiting the generalizability of the findings. Additionally, the performance of the transfer learning approach should be validated on larger and more diverse datasets to assess its robustness [11]. The lack of real-time data integration, the small sample size, and the lack of external validation in the deep transfers learning-driven approach to oral cancer detection and classification research all have an impact on how generalizable the findings are [12].

1.1 Organization of the Paper

The remaining paragraphs are organised as follows: The outline of several traditional strategies is provided in Sect. 2 [13,14,15,16,17]. A thorough discussion of the suggested technique is given in Sect. 3. The outcomes of the approach are shown in Sect. 4. Section 5 provides a conclusion to the technique that has been provided.

2 Relevant Work

The image-based computer-controlled prognosis of oral cancer has been dominated by unique imaging techniques including computed tomography, multispectral imaging, and fluorescence intensity imaging. According to a few studies as shown in Table 1, the majority of which are focused on oral lesions, white-light photography can be used to identify them. Initially used methods and resources for cancer detection, classification, and machine learning approach evaluation are largely covered in the associated analysis. Consequently, finding oral cancer is critical to the diagnosis of the disease. Deep Learning approaches are now being used to better understand how oral cancer progresses at each stage.

Table 1 Relative approach

3 Methodology

For the categorization of oral cancer patients, this patient was pressed into service in an algorithm that generated two terms of enhancement ranging from 0 to 1.

To create reliable identification and categorization networks, it was required to employ models that were previously trained on the ImageNet dataset, which contains tens of millions of photos, and then adjusted on the creation of the dataset. TensorFlow is a multi-purpose, large-scale machine learning system. To define processing, shared states, and the operations which modify that states, TensorFlow employs stream processing graphs. We used all of TensorFlow's packages to categorize Oral Cancer in this analysis as listed in Table 2.

Table 2 Basic TensorFlow packages

Oral cancer images were separated from non-oral cancer images for this paper. Scaling, horizontal flipping, saturation changes, and exposure modifications were all part of the image's preliminary processing. There were no validation datasets that went through this procedure. Reduced network training time and reduced overfitting are two advantages of our deep learning method based on transfer learning. All of a picture's pixels must be classified by semantic segmentation, which includes the background. Lesion limits and pixel-by-pixel segmentation of skeletal anatomy can be delineated with advanced automation diagnostic systems thanks to this feature. Although semantic segmentation can be effective for finding lesions in pictures of the oral cavity, it cannot distinguish between the many lesions that could be shown in a picture.

3.1 Proposed System for Detection of Oral Cancer Using CNN

Data mining is the process of discovering patterns in huge datasets through computation. This approach combines machine learning, database management systems, deep learning, and statistical data. For data mining to be useful, it must be possible to extract data from large datasets and then apply transformations to create a memorable structure. The main contenders for this classification algorithm are large medical data sets. To accurately diagnose and predict oral cancer in a patient, a variety of data mining methods are employed simultaneously.

The purpose of data mining is to identify the most efficient technologies and methodologies for classifying data. Categorization is ultimately accomplished through the use of a machine-learning approach known as CNN. The participant's diagnosis helps determine whether or not the disease can be successfully treated as shown in the figure. Figure 1 illustrates the block diagram of oral cancer detection using Convolutional Neural Networks (CNN). The diagram shows the overall workflow of the detection process, starting from input data to the final prediction. The input images are processed through multiple layers of convolution and pooling, followed by fully connected layers for classification. The output represents the prediction of oral cancer presence or absence.

Fig. 1
figure 1

Block diagram of oral cancer detection using CNN

3.2 Dataset

This dataset is divided into two parts. People with cancer with oral cancer are included in the first data set, which includes samples from both healthy and cancerous individuals. Test results of healthy individuals and test results of healthy oral cavities in cancer patients are included in the second data set. This division provides a sense of how things are progressing. Tolerable highlights for cancer of the mouth are included in this index. This information includes the usual high points. The data on oral cancer is gathered from a single clinic or a variety of disease organizations as listed in Table 3. Several different types of oral cancer data are involved, including those from the brain, mouth, and throat. Using UTI medical data sets, medical data relating to oral cancers can be acquired.

Table 3 Attributes of input

Each of the three describes a range of classifications, mild, moderate, and severe, which is represented by a total of 30 photos in the oral data source, which is evaluated in the same way for all 30 individuals. Each patient's anomalies and normal mucosal regions are shown in Table 4 (RoI: Region of Interest) (54 patients in total in both datasets).

Table 4 The number of instances and the area of interest (ROI) for both normal and suspicious cases

3.3 Pre-processing

When data mining methods are employed in the pre-processing stage, they can help to find target data by analyzing a large data set. Cleaning data, integration, transformation, reduction, and discretization are all processes that fall under the purview of the pre-processing step. Noise is removed from the data during the data cleaning stage, and the data is made consistent and coherent. This method includes the detection of absent values as well as the detection of outliers.

  • Data integration: This is accomplished through the use of data boxes or folders, as well as numerous distinct databases.

  • Data transformation: This refers to the process of standardizing and consolidating data.

  • Data reduction: As a result, the volume is significantly reduced yet the quality of the analytical data obtained is unaffected.

  • Data discretization: As a consequence, statistical qualities and numerical features of a collection of data are substituted for each other.

In the actual world, data is scarce, untrustworthy, and cluttered, to name just a few drawbacks of the data available. Pre-processing in data analysis is a regular effort to fill in the gaps and level out the noise where there is a paucity of data. Anomalies and mistakes in data are found at this phase. A global constant can be used to replace all of the missing data, or it can be used to replace them all separately.

3.4 Fuzzy C-Means Clustering

Cascading refers to the process of grouping objects based on similarity. In terms of exploratory data mining, there is no more important task than this. Clustering is one of the most typical descriptive jobs, in which a limited number of clusters are used to describe data. The process of clustering includes putting things together that share similar features. The approach uses the mean of each cluster. In a cluster, similar data points are gathered together. A fuzzy clustering method known as C-Means is employed in the current analysis. Fuzzy Partitioning is a feature of FCM, which stands for Fuzzy Clustering. In this situation, particular information can be included in any groups that have various grades of involvement ranging from 0 to 1. Incremental design is inherent in FCM. Its goal is to locate centroid or cluster centers to reduce the effect of the measure of differentiation.

3.5 Feature Selection

To increase classification accuracy, just a subset of the characteristics in a dataset is considered for inclusion in a feature selection model. Feature selection is important before investigations including medical data related to oral cancer. The method of feature selection identifies characteristics that are relevant and improve the classifier's performance. Feature extraction can be used to uncover patterns in data and emphasize the commonalities and differences in the data. Contrast this with strategies that use information from a variety of classes.

3.6 Neural Networks

Image categorization is improved by employing a variety of neural network topologies. VGG16, VGG19, DenseNET121 and DenseNET169 are used, as are EfficientNetB0, EfficientNetB1 and EfficientNetB2 as well as InceptionV3 and ResNet101. These include ResNetV3, Mobilenet, Exception, and Dense neural networks.

In the long run layer, DenseNet combines the output from previous layers. One can choose from several Dense Net editions to meet your specific needs. It is the number of layers employed in each version that differentiates between them as shown in Fig. 2. It is necessary to use a combination of several DenseNet layers to get the best results.

Fig. 2
figure 2

CNN Architecture

After the success of ResNet, the idea for InceptionResNet was born. As a result, a hybrid model of origination was born. InceptionResNetv1 is the first version, InceptionResNetv2 is the second, and Inception ResNetv3 is the final. Inception ResNetv3 before our analysis.

ResNet: More layers are added in Convolutional Neural Networks to enhance quality and precision. Incorporating these layers is done so that the result will be more accurate and the loss will be decreased. According to Fig. 3 and 4, which uses picture identification as an example, the first layer recognizes edges, the second layer recognizes textures, the third layer recognizes objects, and so on. Rather, the standard deep neural networks model has been found to have a good depth threshold. ResNet's Skip associations feature resolves the problem of DNN gradients fading by providing a fast route for the slope to travel through.

Fig. 3
figure 3

Architecture of InceptionResNet

Fig. 4
figure 4

Basic ResNet50 Architecture

The first convolution is divided into depth-wise and point-wise convolutions by Energy Channels to reduce the cost of computation and ensure maximum accuracy. It then utilizes sequential initiation in the final layers of all blocks to eliminate Relu losses before expanding and then constricting channel lengths to bypass levels with fewer channels as shown in Fig. 5. From B0 to B7, the Efficient Net model family includes eight models, each with a different number of parameters and degree of accuracy. Models from B0 to B7 were used for this analysis. Using the InceptionV3 framework, CNNs can be built with 48 layers. Pre-trained with images from ImageNet, it was able to quickly and accurately classify creatures in our images.

Fig. 5
figure 5

Architecture of InceptionV3

VGG Network: Convolutional neural networks, in which integrated pictures smartly quantify the convolution layer, flattening, pooling, and fully-linked layers before removing the CNN images are trying to separate the picture will be used first as shown in Fig. 6. It is necessary to use the image-adding process if Convolutional networks were built from the ground up. Consequently, VGG-16, one of the model variables, is often used in our analysis to detach the picture and evaluate training and validation information's accuracy.

Fig. 6
figure 6

Basic VGG Architecture

In this type of factorization, depth-wise separate and distinct convolution is used. Mobile Net is unique in that it requires less computational power to run and apply to learn than traditional networks. It's for this reason that machines without GPUs or with limited calculation efficiency are ideal candidates for this algorithm as shown in Fig. 7. There are three versions of the Mobile Net architectural style: MobileNet-V3, MobileNet-V1, and MobileNet-V2. MobileNet-V2 and MobileNet-V1 use 53 layers of classification for classifications. Contrasted to MobileNet-V2, and MobileNet-V1 is particularly faster.

Fig. 7
figure 7

Architecture of MobileNet

3.7 Performance Evaluation Metrics

By determining and analyzing some metrics, the consistency of the system is measured and evaluated. Table 1 mentions a few of these variables. A small percentage of all oral lesions, like mouth sores or tongue lesions, have been studied in the available literature.

Sensitivity, precision, specificity, and likelihood of misclassification are used to evaluate the proposed system's performance. Mean square error, accuracy, and overall performance are all calculated using Eqs in the following Table 5. In the world of medicine, true negatives are people who have not been diagnosed with any disease (TN). False Negative patients, on the other hand, are predicted to be non-disease patients who have a disease (FN). True Positives are those who have the disease and have been diagnosed with it (TP). Finally, the False Positives are those patients who were expected to have the disease but were found to be healthy (FP).

Table 5 Performance metrics of CNN classifier

4 Result

The analysis of medical image analysis encompasses a wide range of application domains, including object recognition as one of several. Automated robotic navigation and deception, comprehension of geographic position, and many more applications have great potential. The Deep Learning Expertise is a collection of courses designed to assist students to learn about the capabilities, problems, and repercussions of transfer learning and to get them ready to contribute to the advancement of cutting-edge AI technology. AI has boosted the medical sector's main strengths and technologies. Figure 8 displays the distinguishing feature between Oral Cancer & Normal.

Fig. 8
figure 8

Segregation between Cancer & Non-Cancer

After additional training on two datasets, we equated the accuracy and F1-score of the pre-trained CNN models, such as Inception Res-NetV2, Inception-V3, VGG-16, and ResNet-101. Table 6 presents the F1-Score, Recall, and Precision for Patches, Patients, and Roi. Datasets for the oral dataset are dominated by Inception-V3 and MobileNet F1. Even though MobileNet has the highest ROI, Inception-V3 has the most patients. The finding from the Inception-V3 test is the most accurate since the numbers of regions may vary from one person to the next. Recall and accuracy for patches, patients, and ROI are shown in Figs. 9, 10, 11 and 12 for ResNet50, Inception-V3, VGG-16, and MobileNet.

Table 6 Describe the F1-score, recall, and precision for patches, patients, and RoI
Fig. 9
figure 9

ResNet 50's performance measures for three parameters are displayed

Fig. 10
figure 10

Inception-V3 performance measures for three parameters are displayed

Fig. 11
figure 11

VGG-16 performance measures for three parameters are displayed

Fig. 12
figure 12

MobileNet performance measures for three parameters are displayed

The study presents a comparative analysis of different DL techniques for the predictions and diagnosis of oral cancer. The results indicate that Inception-V3 outperforms ResNet 50, VGG-16, and MobileNet as regards F1-Score, Sensitivity, and Precision across various categories (patches, patient-level, and region of interest). These findings demonstrate the advancement of the proposed work in improving the state-of-the-art in automated oral cancer detection.

5 Conclusion

It is always said that “happiness is the highest form of health” and one should always take care of his/her health in every way possible. Healthcare is one of the foremost domains in the current scenario which was needed to focus and development into the sector is the leading task. Out of these diseases, Cancer is one of the major diseases which are affecting human society rapidly. Oral cancer is a sensitive disease and it needs to be prevented and cared for by early diagnosis. We have attempted to explore a technique that uses Deep Transfer Learning to automate the detection of oral cancer during the course of the research. Among the range of algorithms explored, including the sequential approach [convolutional neural network], ResNet-50, VGG-19, and others, the prominent trends in the field of oral cancer detection involve the utilization of Inception-V3 and MobileNet architectures. Notably, the oral dataset exhibits impressive F1 scores. While MobileNet demonstrates the highest return on investment (ROI), Inception-V3 boasts a larger patient population. The accuracy of the Inception-V3 test scores is particularly notable due to the variability in the number of regions across individuals.