Introduction

Early detection of gastric cancer is one of the most effective ways to reduce cancer death. Patients with advanced gastric cancer have a poor prognosis, whereas the 5-year survival rate is greater than 90% when the lesion area is localized as a result of being detected at an early stage [1,2,3,4]. Gastrointestinal endoscopy plays a critical role in improving the detection rate of EGC and enables early intervention with endoscopic submucosal dissection (ESD) through precise area diagnosis. However, endoscopic diagnosis including detection and demarcation is difficult even for well-trained gastroenterologists because a lesion’s morphology is usually only slightly depressed or elevated and, because of chronic inflammation, lesions exhibit only a minor color change relative to the background [5,6,7].

Research in computer-aided lesion detection using endoscopic images or videos to assist endoscopists and equalize the diagnostic ability has been increasing in recent years [8,9,10,11,12,13]. Such research has strong potential to reduce cancer-care disparities caused by differences in endoscopists’ proficiency or by differences among equipped endoscopes. In particular, substantial improvements in the detection of gastrointestinal lesions have been reported since the introduction of convolutional neural networks (CNNs) [14]. In a previous study, we developed an automatic detection scheme for EGC; this scheme was based on an efficiently trained CNN using a small training dataset, revealing the approximate extents of lesions [15]. Although this previous study indicated the feasibility of computer-aided lesion detection, the performance of the proposed method was insufficient for practical use because not only lesion detection but also demarcation of lesions is critical for curative resection for both ESD and partial gastrectomy [16,17,18,19].

In this study, we enhanced our detection scheme so that more detailed regions can be demarcated while keeping the efficient CNN trained using a small dataset and then carried out delineation of the horizontal extent of lesions. The proposed scheme was validated using 1-year consecutive cases. In addition, we conducted a pilot study with endoscopists to compare their ability to demarcate EGC with that of our enhanced scheme.

Methods

Acquisition of training and test datasets with accurately annotated images

To train the CNN, we collected 150 cancer images and 150 normal images randomly selected from 68 consecutive patients (926 images) with EGC who were treated with ESD [20,21,22] at National Cancer Center Hospital East (NCCHE) from November 2016 to April 2017. In the present study, “normal” means that the image does not contain cancerous regions. We next collected a test dataset completely independent from the training dataset; the test dataset comprised 462 images (137 patients) from consecutive EGC cases that underwent ESD from April 2017 to March 2018 at NCCHE. The dataset also included 396 normal images containing no cancer regions. All the images in both datasets were 24-bit, 1000 × 870 pixels, full-color images originally obtained from endoscopic still images captured under white light (GIF-H290Z or GIF-H260Z; Olympus, Tokyo, Japan) using a standard video endoscopy system (EVIS LUCERA ELITE; Olympus). On the basis of the ESD pathological results, an experienced and certified gastroenterologist (K.H.) annotated the lesions on all cancer images to train and validate the CNN. The datasets included different lesion characteristics such as size, macroscopic type, location, invasion depth, and histological type (see Table 1 for details).

Table 1 Patient and lesion characteristics of the training and test datasets

This study used a retrospective design, and patients’ personal information was removed. The protocol was approved by the Institutional Review Board of the National Cancer Center (2017-090). In this study, the patients provided informed consent to participate in the study using an opt-out method through disclosure of a public document on the website of the facility because of the retrospective nature of the study using completely anonymized data.

Efficient increase in training dataset

A CNN is one of the machine learning tools used in various tasks related to computer vision, such as image classification and segmentation. CNNs generally require huge amounts of training data. By contrast, an extremely small number of endoscopic images—specifically, only 150 cancer and 150 normal images—were collected for training the CNN in this study. The collected images were divided into approximately 1.13 million small images using the following procedure. First, each image was cropped automatically and randomly into approximately 100 images with a size of 224 × 224 pixels. At least 80% of the pixels of each cropped image labeled as a cancer included lesions to enable the characteristics of lesions (e.g., the pit pattern, margin, and the central depression) to be identified. Conversely, the cropped images labeled as normal contained no cancer regions. Next, data augmentation, which increases the number of images via geometric or appearance image transformation, was performed to increase the cropped images artificially. We used Keras [23], an open-source neural network library, for data augmentation. Finally, 534,926 cancer images and 593,874 normal images with sizes of 224 × 224 pixels were obtained (see Online Resource Fig. S1a).

Training network by fine-tuning

Fine-tuning [24], a type of transfer learning, is a powerful tool for efficiently training a neural network model. GoogLeNet [25], which is one of the simplest models and is trained using the datasets of the ImageNet large-scale visual recognition challenge [26], was adopted in this study as a pre-trained network for fine-tuning. The aforementioned training dataset, totaling approximately 1.13 million images, was then used to tune the pre-trained network to fit the network with the datasets of this study (Online Resource Fig. S1b). The training time was approximately 10 h with an Intel Xeon (2.4 GHz, 128 GB) and NVIDIA GeForce GTX1080 (8 GB × 2 memory). After this training, the trained network could classify whether an input was cancer or normal (noncancerous). We evaluated the generalization ability of the trained network by cross-validation—a technique used to evaluate machine learning algorithms—and confirmed that the trained network could perform well on previously unseen input (Online Resource Fig. S2).

Detection and delineation by per-pixel prediction

We implemented a sliding-window procedure that performed an existence probability prediction of lesions for an unknown input image using the trained CNN, leading to semantic segmentation—a technique used for annotating images at the pixel level. The precise segmentation eventually enabled the delineation of lesions (i.e., demarcation of EGC). A short summary is presented as follows. An input endoscopic image with an original size of 1000 × 870 pixels was divided into small blocks of W × H (horizontal × vertical) images with a size of 224 × 224 pixels. Each block, which was not resized and therefore preserved fine image features, passed through the trained network and was then classified with a predicted probability between 0.0 and 1.0 (i.e., block-based prediction). A block with high probability was considered to be cancer, whereas a block with low probability was considered to be normal (Online Resource Fig. S3). After all the blocks had been applied to the trained network, the predicted probability of each block was assigned to each pixel inside the original input image as the existence probability of EGC, according to a weighted function (Online Resource Fig. S4). Each block in which the probability prediction was performed was defined by being slid around the input image from the top left to the bottom right, and the variable sliding pitch determined the number of blocks. The neighboring blocks partially overlapped according to the sliding pitch, leading to multiple predictions for each pixel and increasing the prediction reliability. In this study, we divided each input image into 43 × 39 (horizontal × vertical) blocks by sliding around the input image every 28 pixels, resulting in 1677 predictions (6.7 s) being performed per input image.

The assigned and summed probabilities based on the image-based prediction were used and then translated into a pseudo-color that was subsequently superimposed onto the input image. The result provided the exact region of EGC as a heat map by selection of pixels whose probability was greater than 0.5 because our trained network is based on the concept of two-class classification, where the predicted pixel with a probability greater than 0.5 is regarded as cancer. If no misdetection is acceptable, the probability should be set lower than 0.5 to reduce the number of misdetections. However, our detection goal is the accurate delineation of EGC, and we therefore set the probability to 0.5. The detected cancer region was depicted as a rainbow-colored region, whereas the noncancerous (normal) region was depicted as a blue-colored region (Online Resource Fig. S1c).

Evaluation methods and outcome measures

To validate our proposed scheme, we performed two types of evaluations for the test dataset: one evaluation for detection ability and another for demarcation accuracy. Detection ability refers to whether our scheme can correctly detect the presence of cancer in an input, and demarcation accuracy refers to the overlap rate of the detected region with the exact cancer region. The detection ability was calculated on the basis of the true-positive rate (TP), true-negative rate (TN), false-positive rate (FP), and false-negative rate (FN). The TP and FP reflect the numbers of images correctly and incorrectly recognized as EGC, respectively, whereas the TN and FN represent the numbers of images correctly and incorrectly recognized as normal, respectively. We calculated five metrics—sensitivity (TP/(TP + FN)), specificity (TN/(FP + TN)), accuracy ((TP + TN)/(TP + FN + FP + TN)), positive predictive value (PPV = TP/(TP + FP)), and negative predictive value (NPV = TN/(FN + TN))—for evaluating the detection ability. These evaluations were performed for block-based prediction, image-based prediction, and case-based prediction of the test dataset (Online Resource Table S1). The evaluation of block-based prediction reflects the detection ability before the predicted probabilities are summed across all the blocks in the sliding-window procedure. We examined the detection ability of 1,438,866 blocks obtained from 462 cancer and 396 normal images by counting the number of blocks when the prediction result matched the ground truth. The ground truth, which is co-located with the target block, is regarded as a cancer block when at least 80% of pixels of a block include cancer regions. For the evaluation for image-based prediction, if at least one cancer pixel is detected in the input target image, then the target image is counted as correctly detected. Regarding the evaluation for case-based prediction, for each image of the 137 cases, if at least one cancer image is detected correctly, then the target case is counted as a correctly detected one.

The demarcation accuracy was measured for the cancer images of the test dataset in terms of the mean intersection over union (mIoU) [27], which is a measure commonly used to evaluate the accuracy of semantic segmentation, which can quantify the degree of overlap between a segmented region and a ground truth. The images that show the lesion extent annotated by experienced gastroendoscopists on the basis of the pathological examination of ESD were used as the ground truth for calculating the mIoU.

We also compared the automatically delineated regions of our scheme with the regions delineated via manual annotation by six experienced endoscopists. The average experience number of upper gastrointestinal endoscopy procedures performed by the endoscopists was 6000 (min 3000 to max 12,000). The randomly selected 38 cancer images among the successfully detected cases in the test dataset, and their ground truths, which were precisely detected lesions annotated on the basis of pathological examination, were used for the comparison. When annotating each image, the endoscopists were permitted to reference a white-light-only endoscopic image without magnification. The ratio of lesion characteristics was in approximate accordance with that of the test dataset. In terms of each selected image, the mIoU, sensitivity, specificity, accuracy, PPV, and NPV were calculated between the ground truth and the manual annotation.

Results

Evaluation of the detection ability using consecutive cases throughout 1 year

Table 2 presents the detection ability based on the numbers of images and cases. For the image-based result, EGC detection was accomplished in a total of 387 (83.8%) out of 462 cancer images. For the normal cases, 307 (77.5%) out of 396 normal images were predicted correctly. A PPV of 81.3%, NPV of 80.4%, and accuracy of 80.9% were obtained, all of which achieved stable detection accuracy. The PPV of the block-based results was low (Online Resource Table S1), whereas that of the image-based results was improved. This result means that the detection ability was substantially improved in terms of FN occurrence because of the sliding-window procedure. We speculate that multiple predictions performed at each pixel by the sliding-windows procedure led to stable detection accuracy. Notably, from the perspective of the case-based results, our proposed method achieved a detection accuracy of 94.9%.

Table 2 Detection ability based on the numbers of images and cases

Heat map of cancer probability delineating the horizontal extent of lesion

Figure 1a–d shows the results of demarcation for three macroscopic types of gastric cancer: 0-I, 0-IIa, 0-IIb, and 0-IIc. A heat map shows not only the extents of lesions but also the probability of lesions. In addition, the margins of cancer regions can be observed at the pixel level. Although the lesion size ranged from approximately 70 square pixels to 650 square pixels in each image, the demarcation results were accurate irrespective of the lesion size. Especially in the cases of types 0-IIa, 0-IIb, and 0-IIc, a few salient features were observed in the shapes. Nevertheless, our proposed scheme provided accurate pixel-level demarcation of the lesion across various macroscopic types of EGC. In terms of incorrect detection, irregular textured patterns on the gastric surfaces resulted in over-segmentation and mis-segmentations were observed in regions with shadows or in deeper areas (Online Resource Fig. S5).

Fig. 1
figure 1

Demarcation results of four early gastric cancer (EGC) types: a protruding type (type 0-I); b superficial elevated type (type 0-IIa); c superficial flat type (type 0-IIb); d superficial depressed type (type 0-IIc). Each row shows the following images: an input image, a ground truth, a heat map showing the predicted probability of EGC, and a pseudo-colored TP–TN–FP–FN image. In the ground truth images, the manually annotated regions, which were based on the pathological examination by a gastroenterologist (K.H.), are marked in green. The heat map images, which were produced fully automatically by our proposed scheme, indicate the potential cancer and normal regions detected on the basis of the predicted probability. In the TP–TN–FP–FN images, light blue, white, red, and yellow indicate true positive, true negative, false positive, and false negative, respectively

The mIoU calculated from the successfully detected cancer images (387/462) was 66.5% on average. In general, with respect to the degree of overlap of two regions, we considered an overlap greater than 65% for the target region to be sufficient because moving the two regions diagonally by only 10% gives an mIoU of 65% (Online Resource Fig. S6).

Analyzing the results in terms of lesion characteristics

Figure 2 shows the demarcation accuracy expressed using the mIoU in terms of lesion size, depth, location, and histological type. The mIoU was significantly better in images of lesions of pSM2 (vs. pM, p < 0.0001; vs. pSM1, p < 0.05), lesions at the upper stomach (vs. middle, p < 0.05), and lesions of the histologically differentiated type (vs. undifferentiated type, p < 0.001). There were no significant differences in mIoU with regard to the lesion size.

Fig. 2
figure 2

Differences in mean intersections over unions (mIoUs) of each lesion characteristic. Scatter plots of mIoUs in terms of a lesion size, b lesion depth, c lesion location, and d histological type of lesion. The differentiated lesions include tub, tub1, tub2, pap, and carcinoma, and the undifferentiated lesions include sig, por, and por2. In each plot, the p-value was calculated using the Kruskal–Wallis test for the size, depth, and location and using the Mann–Whitney U test for the histological types. *p<0.05, ***p<0.001, ****p<0.0001. ns no significance, p≥0.05

We also analyzed the correct-detection rate in terms of lesion characteristics such as the size, depth, location, and histological type of the lesion (Table 3). Similar to the results in Fig. 2, lesions with deeper invasion depths and that were located in the upper part of the stomach and were of the differentiated type were likely to be detected more precisely using our scheme. This tendency of detection ability was the same as that of the endoscopists. Notably, according to the case-based results, our proposed method resulted in a correct-detection rate of approximately 95%.

Table 3 Correct-detection rate based on the lesion characteristics

Comparison with the manual demarcation by experienced endoscopists

Table 4 shows the results of a comparison between the computer detection and the averaging results of six endoscopists in terms of six evaluation metrics. In the three evaluation metrics of mIoU, sensitivity, and NPV, our scheme demonstrated superior diagnostic ability compared with that of the endoscopists. However, in the other three metrics of specificity, accuracy, and PPV, the diagnostic ability of the endoscopists was superior to that of our scheme. The small differences regarding specificity and accuracy verified that the diagnostic ability of our scheme was comparable to that of the endoscopists.

Table 4 Comparison of the delineation accuracy with six endoscopists

Figure 3 shows three representative examples of lesion demarcation. Similar to the trends shown in Table 3, our scheme tends to lead to over-segmentation compared with the lesion demarcations of the endoscopists. However, the lesion demarcations of the endoscopists tend to be mis-segmented. The results show that the annotated regions of lesions varied substantially from each other.

Fig. 3
figure 3

Comparison of the demarcation accuracy of the proposed method with that achieved by experienced endoscopists. Three representative examples (Cases 1–3) of the computer detection and those of manual detection by six endoscopists. In each result, the upper row shows the input endoscopic image, the results of manual detection by three endoscopists (D1–D3), and the results of detection by our method (Ours). The lower row shows the ground truth indicating the region of EGC, the results of manual detection by three endoscopists (D4–D6), and the merged border of regions detected by six endoscopists and by our method. In the merged border image, the green line shows the real border of EGC and the red line shows the border detected by our scheme

Discussion

We developed a scheme for the computer-aided demarcation of EGC that demonstrates area diagnosis ability similar to that demonstrated by endoscopists. The demarcation accuracy of the correctly detected EGC, expressed using mIoU, was 66.5%, which is considered a sufficient degree of overlap with exact cancer regions because mIoU is an extremely strict evaluation metric. For lesions with ambiguous boundaries, such as those in EGC, our pixel-level lesion delineation will assist endoscopists in understanding the spread of lesions and aid in the development of an appropriate treatment plan for ESD.

In a previously reported method that enabled the real-time detection of upper gastrointestinal cancer, greater than 90% sensitivity and specificity were achieved using approximately 1 million images (approximately 84,000 cases) [28]. However, the target lesions for automatic detection were primarily advanced cancer (81% of the total), which deviated from the concept of automatic detection at the early stage of gastric cancer. Ping et al. proposed an automatic delineation of EGC [29]. However, their aim was to delineate the lesion area only in the cancer image; they did not attempt to detect the presence of lesions. We attribute the difficulties of automatic EGC detection and delineation to two primary reasons. First, there is a lack of large amounts of high-quality data for use in training datasets [30]. By contrast, our scheme does not focus on collecting large amounts of training data but instead uses precise training data based on pathological examination—in this study, originally from only 150 cancer images and 150 normal images. The high-quality training data appear to have positively affected the training accuracy, even for one of the simplest CNNs such as GoogLeNet. Our scheme, which overcomes the problem regarding training data collection, has the potential to be readily applied in a wide range of detection tasks in other hospitals and clinics, even with other gastrointestinal diseases. Normally, the appearance of images acquired by different facilities or devices differs because of differences in the imaging conditions and methods; thus, re-training using the images from those facilities or devices is necessary for accurate detection. In such cases, the ability to train with a small training dataset is considered a major advantage. Second, many previous studies utilized a one-stage detection framework typified by the single-shot multibit detector (SSD) [31]. We speculate that the SSD-based scheme might be linked to the low PPV [10, 28]. The SSD uses low-resolution images derived from compressed images (e.g., the SSD-300 model compresses the original image to 300 × 300 pixels); thus, the detailed texture pattern of the images is missing and the lesions with flat and indecisive boundaries, typified by EGC, may not be correctly detected. Our scheme, by contrast, detected the minor differences in the local texture patterns around the lesion boundary as a result of multiple predictions derived from the sliding-window procedure performed using a CNN trained using uncompressed images. Our particularly striking result is that the occurrence of both FPs and FNs was strongly suppressed for both cancer and normal images and that high-performance results in both PPV and NPV were eventually achieved.

Although we did not explore methods to shorten the detection time, numerous options are available for this purpose. For example, our scheme can offer coarse or fine segmentation results based on the sliding pitch selected for the sliding windows (Online Resource Fig. S7). As for the probability prediction of each block in the sliding-window procedure, the coarse sliding pitch is first adopted for rough prediction and then only the blocks with high probabilities are re-predicted for the fine prediction. We speculate that this approach will shorten the detection time to enable real-time diagnosis in the future. Alternatively, this pitch selection could be applied to meet the needs of endoscopists to balance segmentation quality and execution time. For example, the advantage of making the sliding pitch small is that more accurate segmentation results with a smooth outline can be obtained, which would be useful for area diagnosis before ESD in our method. Conversely, a large sliding pitch has the advantage of decreased execution time as the number of predictions decreases. The large sliding pitch might contribute to the application of mass screening programs or real-time detection using high-quality video endoscopy in the future. Because a tradeoff exists between the segmentation level and execution time, parameters that suit the needs of the endoscopist must be selected. In addition, the sliding-window procedure can be performed in parallel via GPU implementation, thereby accelerating the detection speed.

This study had several limitations. First, our dataset was biased in the number of training and test images according to the lesion characteristics. For example, the number of undifferentiated-type images was less than one-tenth that of differentiated-type images. This difference might be responsible for the lower detection accuracy of the undifferentiated-type images compared with that of the differentiated-type images. The effect of this bias on the detection accuracy should be investigated in future work. Second, the proposed scheme was evaluated using retrospective datasets at a single institution with a relatively small number of patients. Additional studies with larger sample sizes from multiple institutions would increase the generalization capability of our proposed scheme. Third, the scheme was trained and tested on images from limited image environments captured using an Olympus video endoscopy system. Additional studies using datasets from multiple image environments may lead to hardware-agnostic prediction. Fortunately, our scheme does not require a large amount of training data and can be readily applied to images from other institutions and imaging environments. In addition, we recently developed a system to accumulate annotated endoscopic images from a daily reporting system, which could make training data collection more convenient in the future [32].

In conclusion, we demonstrated that our scheme achieved a high performance of computer-aided area diagnosis of EGC via a CNN trained using an extremely small number of endoscopic images. A particularly striking result is that our scheme offered pixel-level demarcation of EGC with well-balanced accuracy for both cancer and normal images and demonstrated area diagnosis ability similar to that of experienced endoscopists. In the future, we believe that our scheme will offer substantial assistance in the decision-making of gastroendoscopists for curative resection of EGC by ESD.