Keywords

1 Introduction

Cancers in the oral cavity or the oropharynx are among the most common malignancies in the world [27, 30]. Similar as for cervical cancer, visual inspection of brush collected samples has shown to be a practical and effective approach for early diagnosis and reduced mortality [25]. We, therefore, work towards introducing screening of high risk patients in General Dental Practice by dentists and dental hygienists. Computer assisted cytological examination is essential for feasibility of this project, due to large data and high involved costs [26].

Whole slide imaging (WSI) refers to scanning of conventional microscopy glass slides to produce digital slides. WSI is gaining popularity among pathologists worldwide, due to its potential to improve diagnostic accuracy, increase workflow efficiency, and improve integration of images into information systems [5]. Due to the very large amount of data produced by WSI, typically generating images of around 100,000 \(\times \) 100,000 pixels with up to 100,000 cells, manipulation and analysis are challenging and require special techniques. In spite of these challenges, the advantage to reproduce the traditional light microscopy experience in digital format makes WSI a very appealing choice.

Deep learning (DL) has shown to perform very well in cancer classification. An important advantage, compared to (classic) model-based approaches, is absence of need for nucleus segmentation, a difficult task typically required for otherwise subsequent feature extraction. At the same time, the large amount of data provided by WSI makes DL a natural and favorable choice. In this paper we present a complete fully automated DL based segmentation-free pipeline for oral cancer screening on WSI.

2 Background and Related Work

A number of studies suggest to use DL for classification of histology WSI samples, [1, 2, 18, 28]. A common approach is to split tissue WSIs into smaller patches and perform analysis on the patch level. Cytological samples are, however, rather different from tissue. For tissue analysis the larger scale arrangement of cells is important and region segmentation and processing is natural. For cytology, though, the extra-cellular morphology is lost and cells are essentially individual (especially for liquid based samples); the natural unit of examination is the cell.

Cytology generally has slightly higher resolution requirements than histology; texture is very important and accurate focus is therefore essential. On the other hand, auto-focus of slide scanners works much better for tissue samples being more or less flat surfaces. In cytology, cells are partly overlapping and at different z-levels. Tools for tissue analysis rarely allow z-stacks (focus level stacks) or provide tools for handling such. In this work we present a carefully designed complete chain of processing steps for handling cytology WSIs acquired at multiple focus levels, including cell detection, per-cell focus selection, and CNN based classification.

Malignancy-associated changes (MACs) are subtle morphological changes that occur in histologically normal cells due to their proximity to a tumor. MACs have been shown to be reproducibly measured via image cytometry for numerous cancer types [29], making them potentially useful as diagnostic biomarkers. Using a random forest classifier [15] reliably detected MACs in histologically normal (normal-appearing) oropharyngeal epithelial cells located in tissue samples adjacent to a tumor and suggests to use the approach as a noninvasive means of detecting early-stage oropharyngeal tumors. Reliance on MAC enables using patient-level diagnosis for training of a cell-level classifier, where all cells of a patient are assigned the same label (either cancer or healthy)[32]. This hugely reduces the burden of otherwise very difficult and laborious manual annotation on a cell level.

Cell Detection: State-of-the-art object detection methods, such as the R-CNN family [7, 8, 23] and YOLO [20,21,22], have shown satisfactory performance for natural images. However, being designed for computer vision, where perspective changes the size of objects, we find them not ideal for cell detection in microscopy images. Although appealing to learn end-to-end the classification directly from the input images, s.t. the network jointly learns region of interest (RoI) selection and classification, for cytology WSIs this is rather impractical. The classification task is very difficult and requires tens of thousands of cells to reach top performance, while a per-cell RoI detection is much easier to train (much fewer annotated cell locations are needed), requires less detail and can be performed at lower resolution (thus faster). To jointly train localization and classification would require the (manual) localization of the full tens of thousands of cells. Our proposal, relying on patient-level annotations for the difficult classification task, reaches good performance using only around 1000 manually marked cell locations. Methods for detecting objects with various size and the bounding boxes also cost unnecessary computation, since all cell nuclei are of similar size and bounding box is not of interest in diagnosis. Further, these methods tend to not handle very large numbers of small and clustered objects very well [36].

Many DL-based methods specifically designed for the task of nucleus detection are similar to the framework summarized in [12]: first generate a probability map by sliding a binary patch classifier over the whole image, then find nuclei positions as local maxima. However, considering that WSIs are as large as 10 giga-pixels, this approach is prohibitively slow. U-Net models avoid the sliding window and reduce computation time. Detection is performed as segmentation where each nucleus is marked as a binary disk [4]. However, when images are noisy and with densely packed nuclei, the binary output mask is not ideal for further processing. We find the regression approach [16, 33, 34], where the network is trained to reproduce fuzzy nuclei markers, to be more appropriate.

Focus Selection: In cytological analysis, the focus level has to be selected for each nucleus individually, since different cells are at different depth. Standard tools (e.g., the microscope auto-focus) fail since they only provide a large field-of-view optimum, and often focus on clumps or other artifacts. Building on the approaches of Just Noticeable Blur (JNB) [6] and Cumulative Probability of Blur Detection (CPBD) [19], the Edge Model based Blur Metric (EMBM) [9] provides a no-reference blur metric by using a parametric edge model to detect and describe edges with both contrast and width information. It claims to achieve comparable results to the former while being faster.

Classification: Deep learning has successfully been used for different types cell classification [10] and for cervical cancer screening in particular [35]. Convolutional Neural Networks (CNNs) have shown ability to differentiate healthy and malignant cell samples [32]. Whereas the approach in [32] relies on manually selected free lying cells, our study proposes to use automatic cell detection. This allows improved performance by scaling up the available data to all free lying cells in each sample.

3 Materials and Methods

3.1 Data

Three sets of images of oral brush samples are used in this study. Dataset 1 is a relatively small Pap smear dataset imaged with a standard microscope. Dataset 2 consist of WSIs of the same glass slides as Dataset 1. Dataset 3 consist of WSIs of liquid-based (LBC) prepared slides. All samples are collected at Dept. of Orofacial Medicine, Folktandvȧrden Stockholms län AB. From each patient, samples were collected with a brush scraped at areas of interest in the oral cavity. Each scrape was either smeared onto a glass (Datasets 1 and 2) or placed in a liquid vial (Dataset 3). All samples were stained with standard Papanicolau stain. Dataset 3 was prepared with Hologic T5 ThinPrep Equipment and standard non-gynecologic protocol. Dataset 1 was imaged with an Olympus BX51 bright-field microscope with a \(20\times \), 0.75 NA lens giving a pixel size of 0.32 \(\upmu \text {m}\). From 10 Pap smears (10 patients), free lying cells (same as in “Oral Dataset 1” in [32]) are manually selected and \(80\,\times \,80\,\times \,1\) grayscale patches are extracted, each with one centered in-focus cell nucleus. Dataset 2: The same 10 slides as in Dataset 1 were imaged using a NanoZoomer S60 Digital slide scanner, \(40\times \), 0.75 NA objective, at 11 z-offsets (±2 \(\upmu \text {m}\), step-size 0.4 \(\upmu \text {m}\)) providing RGB WSIs of size \(103936 \,\times \,107520\,\times \,3\), 0.23 \(\upmu \text {m}\)/pixel. Dataset 3 was obtained in the same way as Dataset 2, but from 12 LBC slides from 12 other patients.

Slide level annotation and reliance on MAC appears as a useful way to avoid need for large scale very difficult manual cell level annotations. Both [15] and [32] demonstrate promising results for MAC detection in histology and cytology. In our work we therefore aim to classify cells based on the patient diagnosis, i.e., all cells from a patient with diagnosed oral cancer are labeled as cancer.

3.2 Nucleus Detection

The nucleus detection step aims to efficiently detect each individual cell nucleus in WSIs. The detection is inspired by the Fully Convolutional Regression Networks (FCRNs) approach proposed in [33] for cell counting. The main steps of the method are described below, and illustrated on an example image from Dataset 3, Fig. 1.

Training: Input is a set of RGB images \(I_i, i = 1\ldots K\), and corresponding binary annotation masks \(B_i\), where each individual nucleus is indicated by one centrally located pixel.

Each ground truth mask is dilated by a disk of radius r [4], followed by convolution with a 2D Gaussian filter of width \(\sigma \). By this, a fuzzy ground truth is generated. A fully convolutional network is trained to learn a mapping between the original image I (Fig. 1a) and the corresponding “fuzzy ground truth”, D (Fig. 1b). The network follows the architecture of U-Net [24] but with the final softmax replaced by a linear activation function.

Inference: A corresponding density map \(D'\) (Fig. 1c) is generated (predicted) for any given test image I. The density map \(D'\) is thresholded at a level T and centroids of the resulting blobs indicate detected nuclei locations (Fig. 1d).

Fig. 1.
figure 1

A sample image at different stages of nucleus detection (Color figure online)

3.3 Focus Selection

Slide scanners do not provide sufficiently good focus for cytological samples and a focus selection step is needed. Our proposed method utilizes N equidistant z-levels acquired of the same specimen. Traversing the z-levels, the change between consecutive images shows the largest variance at the point where the specimen moves in/out of focus. This novel focus selection approach provides a clear improvement over the Edge Model based Blur Metric (EMBM) proposed in [9].

Following the Nucleus detection step (which is performed at the central focus level, \(z=0\)) we cut out a square region for each detected nucleus at all acquired focus levels. Each such cutout image is filtered with a small median filter of size \(m\!\times \!m\) on each color channel to reduce noise. This gives us a set of images \(P_i\), \(i = 1,\dots ,N\), of an individual nucleus at the N consecutive z-levels. We compute the difference of neighboring focus levels, \(P'_{i} = P_{i+1} - P_{i}\), \(i=1,\ldots ,N\!-\!1\). The variance, \(\sigma ^2_i\), is computed for each difference image \(P_i'\):

$$\begin{aligned} \textstyle \sigma ^2_i = \frac{1}{M} \sum _{j=1}^{M}\left( p'_{ij}-\mu _i \right) ^{2} \,, \text { where } \mu _i = \frac{1}{M} \sum _{j=1}^{M} p'_{ij} \,, \end{aligned}$$

M is the number of pixels in \(P'_i\), and \(p'_{ij}\) is the value of pixel j in \(P'_i\). Finally the level l corresponding to the largest \(\sigma ^2_i\) is selected,

To determine which of the two images in the pair \(P'_l\) is in best focus, we use the EMBM method [9] as a post selection step to choose which of images \(P_l\) and \(P_{l+1}\) to use.

Fig. 2.
figure 2

Example of focus sequences for experts to annotate

3.4 Classification

The final module of the pipeline is classification of the generated nucleus patches into two classes – cancer and healthy. Following recommendation from [32], we evaluate ResNet50 [11] as a classifier. We also include the more recent DenseNet201 [13] architecture. In addition to random (Glorot-uniform) weight initialization, we also evaluate the two architectures using weights pre-trained on ImageNet.

Considering that texture information is a key feature for classification [15, 31], the data is augmented without interpolation. During training, each sample is reflected with 50% probability and rotated by a random integer multiple of 90°.

4 Experimental Setup

4.1 Nucleus Detection

The WSIs at the middle z-level (\(z=0\)) are used for nucleus detection. Each WSI is split into an array of \(6496\!\times \!3360\!\times \!3\) sub-images using the Open Source tool ndpisplit [3]. The model is trained on 12 and tested on 2 sub-images (1014 resp. 119 nuclei) from Dataset 3. The manually marked ground truth is dilated by a disk of radius \(r\!=\!15\). All images, including ground truth masks, are resized to \(1024\!\times \!512\) pixels, using area weighted interpolation. A Gaussian filter, \(\sigma \!=\!1\), is applied to each ground truth mask providing the fuzzy ground truth D.

Each image is normalized by subtracting the mean and dividing by the standard deviation of the training set. Images are augmented by random rotation in the range \(\pm {}{30^{\circ }}{}\), random horizontal and vertical shift within \(30\%\) of the total scale, random zoom within the range of \(30\%\) of the total size, and random horizontal and vertical flips. Nucleus detection does not need the texture details, so interpolation does not harm. To improve stability of training, batch normalization [14] is added before each activation layer. Training is performed using RMSprop with mean squared error as loss function, learning rate \(\alpha = 0.001\) and decay rate \(\rho = 0.9\). The model is trained with mini-batch size 1 for 100 epochs, the checkpoint with minimum training loss is used for testing.

Performance of nucleus detection is evaluated on Dataset 3. A detection is considered correct if its closest ground truth nucleus is within the cropped patch and that ground truth nucleus has no closer detections (s.t. one true nucleus is paired with at most one detection).

Fig. 3.
figure 3

Results of nucleus detection

4.2 Focus Selection

100 detected nuclei are randomly chosen from the two test sub-images (Dataset 3). Every nucleus is cut to an \(80\!\times \!80\!\times \!3\) patch for each of the 11 z-levels. For EMBM method the contrast threshold of a salient edge is set to \(c_T=8\), following [9].

To evaluate the focus selection, 8 experts are asked to choose the best of the 11 focus-levels for each of the 100 nuclei (Fig. 2). The median of the 8 assigned labels is used as true best focus, \(l_{GT}\). A predicted focus level l is considered accurate enough if \(l \in [l_{GT}-2,l_{GT}+2]\).

4.3 Classification

The classification model is evaluated on Dataset 1 as a benchmark, and then on Dataset 2, to evaluate effectiveness of the nucleus detection and focus selection modules in comparison with the performance on Dataset 1. The model is also run on Dataset 3 to validate generality of the pipeline. Datasets are split on a patient level; no cell from the same patient exists in both training and test sets. On Dataset 1 and 2, three-fold validation is used, following [32]. On Dataset 3, two-fold validation is used. Our trained nucleus detector with threshold \(T = 0.59\) (best-performing in Sect. 5.1) is used for Dataset 2 and 3 to generate nucleus patches. Some cells in Dataset 2 and 3 lie outside the \(\pm {2\,\mathrm{{\upmu \text {m}}}}\) imaged z-levels, and the best focus is still rather blurred. We use the EMBM to exclude the most blurred ones. Cell patches with an EMBM score \(<0.03\) are removed, leaving 68509 cells for Dataset 2 and 130521 for Dataset 3.

We use Adam optimizer, cross-entropy loss and parameters as suggested in [17], i.e., initial learning rate \(\alpha =0.001\), \(\beta _1=0.9\), \(\beta _2=0.999\). \(10\%\) of the training set is randomly chosen as validation set.

When using models pre-trained on ImageNet, since the weights require three input channels, the grayscale images from Dataset 1 are duplicated into each channel. Pre-trained models are trained (fine-tuned) for 5 epochs. The learning rate is scaled by 0.4 every time the validation loss does not decrease compared to the previous epoch. The checkpoint with minimum validation loss is saved for testing.

Fig. 4.
figure 4

Accuracy of focus selection

A slightly different training strategy is used when training from scratch. ResNet50 models are trained with mini-batch size 512 for 50 epochs on Dataset 2 and 3, and with mini-batch size 128 for 30 epochs on Dataset 1, since it contains fewer samples. Because DenseNet201 takes larger GPU memory, mini-batch sizes are set to 256 on Dataset 2 and 3. To mitigate overfitting, DenseNet201 models are trained for only 30 epochs on Dataset 2 and 3, and 20 epochs on Dataset 1. When the validation loss has not decreased for 5 epochs, the learning rate is scaled by 0.1. Training is stopped after 15 epochs of no improvement. The checkpoint with minimum validation loss is saved for testing.

5 Results and Discussion

5.1 Nucleus Detection

Results of nucleus detection are presented in Fig. 3. Figure 3a shows Precision, Recall, and F1-score as the detection threshold T varies in [0.51, 0.69]. At \(T = 0.59\), F1-score reaches 0.92, with Precision and Recall being 0.90 and 0.94 respectively. Using \(T=0.59\), 94,685 free lying nuclei are detected in Dataset 2 and 138,196 in Dataset 3.

The inference takes \({0.17\,\mathrm{s}}\) to generate a density map \(D'\) of size \(1024\!\times \!512\) on an NVIDIA GeForce GTX 1060 Max-Q. To generate a density map of the same size based on the sliding window approach (Table 4 of [12]), takes \({504\,\mathrm{s}}\).

5.2 Focus Selection

Performance of the focus selection is presented in Fig. 4. The “human” performance is the average of the experts, using a leave-one-out approach. We plot performance when using EMBM to select among the \(2(k+1)\) levels closest to our selected pair l; for increasing k the method approaches EMBM.

Fig. 5.
figure 5

Cell classification results per microscope slide; green samples (bars to the left) are healthy, red samples (bars to the right) are from cancer patients. ResNet50 is used for (a)–(c) and DenseNet201 pre-trained on ImageNet is used for (d)–(f). (Color figure online)

It can be seen that EMBM alone does not achieve satisfying performance on this task. Applying a median filter improves the performance somewhat. Our proposed method performs very well on the data and is essentially at the level of a human expert (accuracy 84% vs. 85.5%, respectively) using \(k=0\) and a \(3\!\times \!3\) median filter.

5.3 Classification

Classification performance is presented in Table 1 and Fig. 5. The two architectures (ResNet50 and DenseNet201) perform more or less equally well. Pre-training seems to help a bit for the smaller Dataset 1, whereas for the larger Datasets 2 and 3 no essential difference is observed. Results on Dataset 2 are consistently better than on Dataset 1. This confirms effectiveness of the nucleus detection and focus selection modules; by using more nuclei (from the same samples) than those manually selected, improved performance is achieved. The results on Dataset 3 indicate that the pipeline generalizes well to liquid-based images. We also observe that our proposed pipeline is robust w.r.t. network architectures and training strategies of the classification.

In Fig. 6 we plot how classification performance decreases when nuclei are intentionally selected n focus levels away from the detected best focus. The drop in performance as we move away from the detected focus confirms the usefulness of the focus selection step.

If aggregating the cell classifications over whole microscopy slides, as show in Fig. 5, comparing Fig. 5a–5b and Fig. 5d–5e, we observe that the non-separable slides 1, 2, 5, and 6 in Dataset 1 become separable in Dataset 2. Global thresholds can be found which accurately separate the two classes of patients in both datasets processed by our pipeline.

Fig. 6.
figure 6

The impact of defocused testset on Dataset 2, fold 1 (ResNet50)

Table 1. Classification performance. The best F1-score for each dataset is presented in bold.

6 Conclusion

This work presents a complete fully automated pipeline for oral cancer screening on whole slide images; source code (utilizing TensorFlow 1.14) is shared as open source. The proposed focus selection method performs at the level of a human expert and significantly outperforms EMBM. The pipeline can provide fully automatic inference for WSIs within reasonable computation time. It performs well for smears as well as liquid-based slides.

Comparing the performance on Dataset 1, using human selected nuclei and Dataset 2, using computer selected nuclei from the same microscopy slides, we conclude that the presented pipeline can reduce human workload while at the same time make the classification easier and more reliable.