Supremacy of attention-based transformer in oral cancer classification using histopathology images

Deo, Bhaswati Singha; Pal, Mayukha; Panigrahi, Prasanta K.; Pradhan, Asima

doi:10.1007/s41060-023-00502-9

Supremacy of attention-based transformer in oral cancer classification using histopathology images

Regular Paper
Published: 02 February 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Supremacy of attention-based transformer in oral cancer classification using histopathology images

Download PDF

Bhaswati Singha Deo¹,
Mayukha Pal²,
Prasanta K. Panigrahi³ &
…
Asima Pradhan^1,4

228 Accesses
1 Citation
Explore all metrics

Abstract

Oral cancer has emerged as one of the ubiquitous malignant tumors globally. Timely detection and treatment reduces the mortality rate of oral cancer. This study utilizes a vision transformer (ViT) framework to classify oral squamous cell carcinoma (OSCC) and healthy oral histopathology images. The proposed approach is implemented on a public database consisting of 4946 oral histopathology images. Although ViT architectures have been extensively used in the medical imaging field, they have not yet been explored in oral cancer detection. Though transformer architecture needs large dataset to attain better performance, our modified architecture accomplishes an accuracy, specificity and sensitivity of 97.78%, 96.72%, and 98.80%, respectively, on a relatively smaller medical dataset. The evaluation metrics of the proposed method have also been compared with eight pre-trained deep learning models, namely Xception, Resnet50, InceptionV3, InceptionResnetV2, Densenet121, Densenet169, Densenet201 and EfficientNetB7. It is observed that the modified ViT model performs better than the deep learning models, demonstrating the ability to extract various features from the histopathology images for the classification. The results of the proposed approach would aid the clinical community for detection of oral cancer in patients of diverse origin.

Enhancing oral squamous cell carcinoma detection: a novel approach using improved EfficientNet architecture

Article Open access 23 May 2024

Vision transformer and its variants for image classification in digital breast cancer histopathology: a comparative study

Article 05 October 2023

Transfer learning for histopathology images: an empirical study

Article 05 July 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Oral cancer is indeed a fatal condition with a complex etiology and a high death rate. The world cancer research fund (WCRF) international claims that malignancies of the oral cavity and lip are one of the most prevalent type of cancers with more than 377,700 cases recorded globally in 2020. The malignancies of the oral cavity and lip are the 11th and 18th most frequently occurring in men and women, respectively. A well-formulated strategy is required for addressing oral cancer which includes early detection, risk factor management, and health literacy. Risk factors include contact with human papillomavirus (HPV), consuming alcohol, smoking, lack of dental hygiene, geographical location, lifestyle, and ethnicity [1].

Squamous cell carcinoma (SCC) may develop from precancerous lesions such as erythroleukoplakia, oral leukoplakia, and verrucous hyperplasia [2]. 90% of all oral cancers are SCCs [3]. The most accurate way to diagnose oral cancer is through biopsy, however, this method is painful, and in cases of extensive or many lesions, selecting the appropriate site and size for surgical treatment of the biopsy sample could be challenging [4]. Additionally, due to lesion variability, the prepared histology specimen may not accurately reflect the identification of the entire lesion. To achieve a successful cure, higher chances of survival, reduced death and morbidity rates, oral squamous cell carinoma (OSCC) must be detected early [5]. The average survival rate stands at 50% for OSCC [6, 7]. The accepted approach for diagnosing OSCC is tissue sample histopathological examination based on microscopy [8, 9]. However, the clinical value of this approach is constrained by the histopathologists interpretation, which is frequently laborious and prone to error [10]. Therefore, it is crucial to offer efficient diagnostic techniques to support pathologists in the evaluation and diagnosis of OSCC.

Recently, deep learning (DL) algorithms have become the state-of-the-art in field of computer vision and image processing owing to their strength in processing vast volumes of data [11,12,13]. As a result, numerous investigations have been conducted to aid pathologists through DL techniques specially convolutional neural networks (CNNs) in medical image classification, segmentation and localization [14,15,16]. Although CNNs excels at feature extraction, they are unable to encode the relative positions of distinct features. Convolution operations fails to recognize global information [17] and long-range relationships across an entire image [18]. Many researchers came up with different architectural changes for an effective solution in due course and eventually [19] proposed attention mechanism that learns the correlation between output and input patterns without relying on repetition. This enables efficient parallelization of Transformer implementations. In response to the popularity of Transformers in natural language processing (NLP) tasks, Transformer architecture was redesigned by [20], referred to as vision transformer (ViT). In the adapted version, the transformer accepts a series of fixed-size image patches as input to extricate complex features of the image. It pays global attention to the entire image overcoming the long-range dependency issue of CNNs. The potential of ViT has been explored by several researchers in diverse computer vision applications say point cloud classification, image enhancement, object detection and many more. In addition to success of ViT in NLP, it has made significant contribution in medical computer vision in a variety of medical imaging modalities.

In the realm of histopathological image classification, ViTs have demonstrated notable success in field of cancer diagnosis, i.e, renal cell carcinoma, breast cancer, cancerous esophagus tissues, glioblastoma, bladder urothelial carcinoma, lower grade glioma, and lung cancer [21, 22]. Despite the widespread utilization of ViT in various disease diagnoses, its potential in the domain of oral cancer has been underexplored. The application of ViTs to oral cancer classification introduces a novel dimension, emphasizing the distinct histopathological characteristics and clinical considerations unique to oral tissues. Oral cancer presents its own set of challenges, marked by specific cellular compositions, anatomical variations, and staining patterns that differentiate it from other cancers. The prevalence of oral cancer, often associated with risk factors like tobacco use, underscores the critical need for accurate diagnostic tools. While ViTs have been leveraged in other cancer types, the adaptation and application of ViTs to oral cancer represent a pioneering effort, addressing a notable gap in the existing literature. By recognizing the unique characteristics of oral cancer and harnessing the power of ViTs, this research contributes to advancing our understanding of oral cancer pathology and heralds a promising avenue for improved clinical outcomes. While Transformers outperform CNNs in interpreting contextual information, their computational demands and the necessity for extensive datasets present challenges in the medical imaging field. The scarcity of publicly accessible imaging datasets for oral cancer further intensifies these difficulties. Considering these constraints, the motivation emerges to employ a fine-tuned ViT for creating an automated diagnostic framework for the detection of oral cancer.

The contributions of the paper are listed as:

1.
The performance of the proposed fine-tuned ViT model is either superior or comparable to that of state-of-the-art models in binary-class oral cancer classification across various publicly available oral cancer histopathology datasets.
2.
We have performed a comparative analysis of the deep learning (DL) models with the fine-tuned ViT, and it is inferred that ViT model performs better in comparison to DL models for classification of oral cancer.
3.
The fine-tuned ViT performs well with a smaller dataset, challenging the assumption that transformer models require large datasets for optimal performance.

The rest part of this manuscript is organized as follows: Sect. 2 discusses prior art of oral cancer classification and ViT in medical domain. Section 3 discusses the methodology utilized in the work. Section 4 presents the results of the proposed methodology and eight pre-trained deep learning models. Section 5 summarizes the work and outlines the future scope of our proposed approach.

Table 1 Prior art related to publicly available oral cancer image databases

Supremacy of attention-based transformer in oral cancer classification using histopathology images

Abstract

Similar content being viewed by others

Enhancing oral squamous cell carcinoma detection: a novel approach using improved EfficientNet architecture

Vision transformer and its variants for image classification in digital breast cancer histopathology: a comparative study

Transfer learning for histopathology images: an empirical study

Explore related subjects

1 Introduction

2 Related works

3 Proposed methodology

3.1 Dataset description

3.2 Preprocessing and data augmentation

3.3 Description of the ViT-14 model used in our proposed work

3.4 Pre-trained deep learning models for comparison

3.4.1 Xception

3.4.2 Resnet50

3.4.3 InceptionV3

3.4.4 InceptionResnetV2

3.4.5 Densenet121/169/201

3.4.6 EfficientNetB7

4 Experiments and analysis

4.1 Evaluation indicators

4.2 Model parameters

4.3 Ablation study on model parameters

4.3.1 Impacts of different parameters

4.4 Results

4.4.1 Impacts of split ratio

4.5 Comparative analysis of model performance across different datasets

4.6 Comparison with deep learning models

4.7 Comparison with previous works

5 Conclusion

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation