Abstract
Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set. In the context of digital pathology, the leakage typically lurks in weakly designed experiments not accounting for the subjects in their data partitioning schemes. This issue is then exacerbated when fractions or subregions of slides (i.e. “tiles”) are considered. Despite this aspect is largely recognized by the community, we argue that it is often overlooked. In this study, we assess the impact of data leakage on the performance of machine learning models trained and validated on multiple histology data collection. We prove that, even with a properly designed DAP (\(10 \times 5 \) repeated cross-validation), predictive scores can be inflated up to \(41\%\) when tiles from the same subject are used both in training and validation sets by deep learning models. We replicate the experiments for 4 classification tasks on 3 histopathological datasets, for a total of 374 subjects, 556 slides and more than 27, 000 tiles. Also, we discuss the effects of data leakage on transfer learning strategies with models pre-trained on general-purpose datasets or off-task digital pathology collections. Finally, we propose a solution that automates the creation of leakage-free deep learning pipelines for digital pathology based on histolab, a novel Python package for histology data preprocessing. We validate the solution on two public datasets (TCGA and GTEx).
N. Bussola and A. Marcolini—Joint first author.
G. Jurman and C. Furlanello—Joint last author.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Bioinformatics on high-throughput omics data has been plagued by uncountable issues with reproducibility since its early days; Ioannidis and colleagues [1] found that almost 90% of papers in a leading journal in genetics were not repeatable due to methodological or clerical errors. Although the landscape seems to have improved [2], and broad efforts have been spent across different biomedical fields [3], computational reproducibility and replicability still fall short of the ideal. Lack of reproducibility has been linked to inaccuracies in managing batch effects [4, 5], small sample sizes [6], or flaws in the experimental design such as data normalization simultaneously performed on development and validation data [7, 8]. The MAQC-II project for reproducible biomarker development from microarray data demonstrated, through a community-wide research effort, that a well-designed Data Analysis Plan (DAP) is mandatory to avoid selection bias flaws in the development of models for high-dimensional datasets [9].
Among the various types of selection bias that threaten the reproducibility of machine learning algorithms, data leakage is possibly the most subtle one [10]. Data leakage refers to the use of information from outside the training dataset during model training or selection [11]. A typical leakage occurs when data in the training, validation and/or test sets share indirect information, leading to overly optimistic results. For example, one of the preclinical sub-dataset in the MAQC-II study consisted of microarray data from mice triplets. These triplets were expected to have an almost identical response for each experimental condition, and therefore they had to be kept together in DAP partitioning to circumvent any possible leakage from training to internal validation data [9].
The goal of this study is to provide evidence that similar issues are still lurking in the grey areas of preprocessing, ready to emerge in the everyday practice of machine learning for digital pathology. The BreaKHis [12] dataset, one of the most popular histology collection of breast cancer samples, has been used in more than 40 scientific papers to date [13], with reported results spanning a broad range of performance. In a non-negligible number of these studies, overfitting effects due to data leakage are suspected to impact their outcomes.
Deep learning pipelines for histopathological data typically require Whole Slide Images (WSIs) to be partitioned into multiple patches (also referred to as “tiles” [14]) to augment the original training data, and to comply with memory constraints imposed by GPU hardware architectures. For example, a single WSI of size \(67,727\times 47,543\) pixels can be partitioned in multiple \(512 \times 512\) tiles, which are randomly extracted, and verified such that selected subregions preserve enough tissue information. These tiles are then processed by data augmentation operators (e.g. random rotation, flipping, or affine transformation) to reduce the risk of overfitting. As a result, the number of multiple subimages originating from the very same histological specimen is significantly amplified [15, 16], consequently increasing the risk for data leakage. Protocols for data partitioning (e.g. a repeated cross-validation DAP) are not naturally immune against replicates, and so the source originating each tile should be considered to avoid any risk of bias [17].
In this work, we quantify the importance of adopting Patient-Wise split procedures with a set of experiments on digital pathology datasets. All experiments are based on DAPPER [18], a reproducible framework for predictive digital pathology composed of a deep learning core (“backbone network”) as feature encoder, and multiple task-related classification models, i.e. Random Forest or Multi-Layer Perceptron Network (see Fig. 1). We test the impact of various data partitioning strategies on the training of multiple backbone architectures, i.e. DenseNet [19], and ResNet models [20], fine-tuned to the histology domain.
Our experiments confirm that train-test contamination (in terms of modeling) is a serious concern that hinders the development of a dataset-agnostic methodology, with impact similar to the lack of standard protocols in the acquisition and storage of WSIs in digital pathology [21]. Thus, we present a protocol to prevent data leakage during data preprocessing. The solution is based on histolab, an open-source Python library designed as a reproducible and robust environment for WSI preprocessing, available at https://github.com/histolab/histolab. The novel approach is demonstrated on two public large scale datasets: GTEx [22] (i.e. non-pathological tissues), and TCGA [23] (i.e. cancer tissues).
2 Data Description
We tested our experimental pipeline on three public datasets for image classification in digital pathology, namely GTEx [22], Heart Failure (HF) [24], and BreaKHis [12]. Descriptive statistics of the datasets are reported in Table 1, and Fig. 1.
The GTEx Dataset. The current release of GTEx (v8) includes a total of 15, 201 H&E-stained WSIs, retrieved with an Aperio scanner (\(20\times \) native magnification) and gathered from a cohort of 838 nondiseased donorsFootnote 1. In this work, we consider a subset of 265 WSIs randomly selected from 11 histological classes, for a total of 83 subjects. From this subset, we randomly selected a balanced number of WSIs per tissue: adrenal gland (\(n=24\)); bladder (\(n=19\)); breast (\(n=26\)); liver (\(n=26\)); lung (\(n=21\)); ovary (\(n=26\)); pancreas (\(n=26\)); prostate (\(n=24\)); testis (\(n=26\)); thyroid (\(n=26\)); uterus (\(n=21\)).
We implemented a data preprocessing pipeline to prepare the tile dataset from the GTEx collection. First, the tissue region is automatically detected in each WSI; this process combines the Otsu-threshold binarization method [25] with the dilation and hole-filling morphological operations. A maximum of 100 tiles of size \(512 \times 512\) is then randomly extracted from each slide. To ensure that only high-informative images are used, tiles with tissue area that accounts for less than the 85% of the whole patch are automatically rejected. At the end of this step, a total of 26, 174 random tiles is extracted from the WSIs, each available at different magnification levels (i.e., \(20\times , 10\times , 5\times \)). In this paper we limit experiments and discussions to tiles at \(5\times \) magnification, with no loss of generality.
The HF Dataset. The Heart Failure collection [24] originates from 209 H&E-stained WSIs of the left ventricular tissue, each corresponding to a single subject. The learning task is to distinguish images of heart failure (\(n=94\)) from those of non-heart failure (\(n=115\)). Slides in the former class are categorized according to the disease subtype: ischemic cardiomyopathy (\(n=51\)); idiopathic dilated cardiomyopathy (\(n=41\)); undocumented (\(n=2\)). Subjects with no heart failure are further grouped in: normal cardiovascular function (\(n=41\)); non-HF and no other pathology (\(n=72\)); non-HF and other tissue pathology (\(n=2\)). WSIs in this dataset have been acquired with an Aperio ScanScope at \(20\times \) native magnification, and then downsampled at \(5\times \) magnification by authors. From each WSI, 11 non-overlapping patches of size \(250\times 250\) were randomly extracted. The entire collection of 2, 299 tiles is publicly available on the Image Data Resource RepositoryFootnote 2 (IDR number: idr0042).
The BreaKHis Dataset. The BreakHis histopathological dataset [12] collects 7, 909 H&E-stained tiles (size \(700\times 460\)) of malignant or benign breast tumour biopsies. Tiles correspond to regions of interest manually selected by expert pathologists from a cohort of 82 patients, and made available at different magnification factors, i.e., \(40\times \), \(100\times \), \(200\times \), \(400\times \)) [12]. To allow for a more extensive comparison with the state of the art, only the \(200\times \) magnification factor will be considered in this paper. The BreakHis dataset currently contains 4 histological distinct subtypes of benign, and malignant tumours, respectively: Adenosis (\(n=444\)); Fibroadenoma (\(n=1,014\)); Tubular Adenoma (\(n=453\)); Phyllodes Tumor (\(n=569\)); Ductal Carcinoma (\(n=3,451\)); Lobular Carcinoma (\(n=626\)); Mucinous Carcinoma (\(n=792\)); Papillary Carcinoma (\(n=560\)). This dataset is used for two classification tasks: (BreaKHis-2) binary classification of benign and malignant tumour samples; (BreaKHis-8) classification of the 8 distinct tumour subtypes.
3 Methods
The pipeline used in this work is based on the DAPPER framework for digital pathology [18], extended by (i) integrating specialised train-test splitting protocols, i.e. Tile-Wise and Patient-Wise; (ii) extending the feature extractor component with new backbone networks; (iii) applying two transfer learning strategies for feature embedding. Figure 1 shows the three main blocks of the experimental environment defined in this paper: (A) dataset partition in train and test set; (B) feature extraction procedure with different transfer learning strategies; (C) the DAP employed for machine learning models.
A. Dataset Partitioning Protocols. The tile dataset is partitioned in the training set and test set, considering \(80\%\) and \(20\%\) split ratio for the two sets, respectively. We compare two data partitioning protocols to investigate the impact of a train-test contamination (Fig. 1A): in the Tile-Wise (TW) protocol, tiles are randomly split between the training and the test sets, regardless of the original WSI. The Patient-Wise (PW) protocol splits the tile dataset strictly ensuring that all tiles extracted from the same subject are found either in the training or the test set. To avoid other sources of leakage due to class imbalance [26], the two protocols are both combined with stratification of samples over the corresponding classes, and any class imbalance is accounted for by weighting the error on generated predictions.
B. Deep Learning Models and Feature Extraction. The training set is then used to train a deep neural network for feature extraction (Fig. 1B), i.e. a “backbone” network whose aim is to learn a vector representation of the data (features embedding). In this study, we consider two backbone architectures in the residual network (ResNet) family, namely ResNet-152 [20] and DenseNet-201 [19]. Given that the DenseNet model has almost the double of parametersFootnote 3, and so a higher footprint in computational resources, diagnostic experiments and transfer learning are performed only with the ResNet-152 model. Similarly to [16], and [18], we started from off-the-shelf version of the models, pre-trained on ImageNet, and then fine-tuned to the digital pathology domain using transfer learning. Specifically, we trained the whole network for 50 epochs with a learning rate \(\eta = 1e-5\), and Adam optimizer [27], in combination with the categorical cross-entropy loss. The \(\beta _{1}\) and \(\beta _{2}\) parameters of the optimizer are respectively set to 0.9 and 0.999, with no regularization. To reduce the risk of overfitting, we use train-time data augmentation, namely random rotation and random flipping of the input tiles.
The impact of adopting a single or double-step transfer learning strategy in combination with the Patient-Wise partitioning protocol is also investigated in this study. Two sets of features embeddings (FE) are generated: \(FE_{1}\), backbone model fine-tuned from ImageNet; \(FE_{2}\), backbone model sequentially fine-tuned from ImageNet and GTEx.
C. Classification and Data Analysis Plan (DAP). The classification is finally performed on the feature embedding within a DAP for machine learning models (Fig. 1C). In this work, we compare the performance of two models: Random Forest (RF) and Multi-Layer Perceptron Network (MLP). In particular, we apply the 10 \(\times \) 5-fold CV schema proposed by the MAQC-II Consortium [9]. In the DAP setting, the input datasets are the two separate training and test sets, as resulted from the 80–20 train-test split protocol. The test set is kept completely unseen to the model, and only used for the final evaluation. The training set further undergoes a 5-fold CV iterated 10 times with a different random seed, resulting in 50 separated internal validation sets. These validation sets are generated adopting the same protocols used in the previous train-test generation, namely Tile-Wise or Patient-Wise. The overall performance of the model is evaluated across all the iterations, in terms of average Matthews Correlation Coefficient (MCC) [28] and Accuracy (ACC), both with \(95\%\) Studentized bootstrap confidence intervals (CI). Moreover, results have been reported both at tile-level and at patient-level, in order to assess the ability of machine learning models to generalise on unseen subjects (see Sect. 4).
As an additional caution to check for selection bias, the DAP integrates a random labels schema (RLab) (Fig. 2). In this setting, the training labels are randomly shuffled and presented as reference ground truth to the machine learning models. In particular, we consistently randomize the labels for all the tiles of a single subject, thus they would all share the same random label (Fig. 2A); then we alternatively use the Patient-Wise (Fig. 2B1) or the Tile-Wise (Fig. 2B2) splits within the DAP environment. Notice that an average MCC score close to zero (\(MCC \approx 0\)) indicates a protocol immune from sources of bias, including data leakage; we focus on the RLab validation to emphasise evidence of data leakage derived from the TW and the PW protocols.
Performance Metrics. Several patient-wise performance metrics have been defined in the literature [12, 24, 29]. Two metrics are considered in this study: (1) Winner-takes-all (WA), and (2) Patient Score (PS).
In the WA metric, the label associated to each patient corresponds to the majority of the labels predicted for their tiles. With this strategy, standard metrics based on the classification confusion matrix can be used as overall performance indicators. In this paper, ACC is used for comparability with the PS metric. The PS metric is defined for each patient [12] as the ratio of the \(N_c\) correctly classified tiles over the \(N_P\) total number of tiles per patient, namely \(PS=\frac{N_c}{N_P}\). The overall performance is then calculated using the global recognition rate (RR), defined as the average of all the PS scores for all patients:
In this paper, the WA metric and the PS metric are used for comparison of patient-level results on the HF dataset and the BreaKHis dataset, respectively.
Preventing Data Leakage: The histolab Library. As a solution to the data leakage pitfall, we have developed a protocol for image and tile splitting based on histolab, an open source software recently developed for reproducible WSI preprocessing in digital pathology. This library implements a tile extraction procedure, whose reliability and quality result from robust design, and extensive software testing. A high level interface for image transformation is also provided, making histolab an easy-to-adopt tool for complex histopathological pipelines.
In order to intercept data leakage conditions, the protocol is designed to create a data-leakage free collection (tile extraction with the Patient-Wise split) that can be easily integrated in a deep learning workflow (Fig. 3). The protocol is already customized for standardizing WSI preprocessing on GTEx and TCGA, two large scale public repositories that are widely used in computational pathology. The code can be also adapted to rebuild the training and test datasets from GTEx used in this study, thus extending the HINT collection presented in [18].
4 Results
Data Leakage Effects on Classification Outcome. The results of the four classification tasks using the ResNet-152 pre-trained on ImageNet as backbone model (i.e. feature vectors \(FE_1\)) are reported in Table 2 and Table 3, with the Tile-Wise and the Patient-Wise partitioning protocols, respectively. The average cross validation \(\mathrm {MCC_v}\) and \(\mathrm {ACC_v}\) with 95\(\%\) CI are presented, along with results on the test set (i.e. \(\mathrm {MCC_t}\), and \(\mathrm {ACC_t}\)). State of the art results (i.e. Others) are also reported for comparison, whenever available.
As expected, estimates are more favourable for the TW protocol (Table 2) with respect to the PW one (Table 3), both in validation and in test and consistently for all the datasets. Moreover, the inflation of the Tile-Wise estimates is amplified in the multi-class setting (see BreaKHis-2 vs BreaKHis-8). Notably, these results are comparable with those in the literature, suggesting the evidence of a data leakage for studies adopting the Tile-Wise splitting strategy. Results on the GTEx dataset do not suggest significant differences using the two protocols; however both MCC and ACC metrics lie in a very high range. Analogous results (not reported here) were obtained using the DenseNet-201 backbone model, further confirming the generality of the derived conclusions.
Random Labels Detects Signal in the Tile-Wise split. A data leakage effect is signalled for the Tile-Wise partitioning with a MCC consistently positive in the RLab validation schema (Sect. 3). For instance, as for BreaKHis-2 coupled with MLP, \(\mathrm {MCC_{RL}}\) \(=0.354\) (0.319, 0.392) in the Tile-Wise setting, to be compared with \(\mathrm {MCC_{RL}}\) \(=-0.065\) \((-0.131, 0.001)\) using the Patient-Wise protocol. Full \(\mathrm {MCC_{RL}}\) results considering 5 trials of the RLab test are reported in Table 4, with corresponding \(\mathrm {ACC_{RL}}\) values also included for completeness. Notably, all the tests using the Patient-Wise split perform as expected, i.e. with median values near 0, whereas results of the Tile-Wise case exhibit a high variability, especially for the BreaKHis-2 dataset (Fig. 4).
Benefits of Domain-Specific Transfer Learning. The adoption of the GTEx domain-specific dataset for transfer learning (Table 5) proves to be beneficial over the use of ImageNet only (Table 3). Notably, the Patient-Wise partitioning protocol with the \(FE_{2}\) embedding have comparable performance with \(FE_{1}\) and the inflated TW splitting (Table 2). However, minor improvements are achieved on the BreaKHis-8 task, with results not reaching state of the art. It must be observed that the BreaKHis dataset is highly imbalanced in the multi-class task. As a countermeasure, authors in [34, 35] adopted a balancing strategy during data augmentation, which we did not introduce here for comparability with the other experiments.
To verify how much of previous domain-knowledge can be still re-used for the original task, we devised an additional experiment on the GTEx dataset: on the Feature Extractor component (i.e. Convolutional Layers) of the model trained on GTEx and fine-tuned on BreakHis-2, we add back the MLP classifier of the model trained on GTEx. Notably, this configuration recover high predictive performance (i.e. \(\mathrm {MCC_t}\) = 0.983) on the classification task after only a single epoch of full training on GTEx.
Patient-Level Performance Analysis. We report patient-wise performance using the ResNet-152 backbone model with either the \(FE_1\) feature embedding and both Tile-Wise and Patient-Wise protocols (Table 6), or with the \(FE_2\) strategy and the Patient-Wise split (Table 7).
5 Discussion
We report here a short description of the approach employed by comparable studies on the same datasets considered in this work; we refer to a Patient-Wise partitioning protocol when the authors clearly state the adoption of a train-test split consistent with the patient, or when the code is provided as reference. Notice that the different accuracy scores obtained for deep learning models applied on the same data can be explained by the adoption of diverse experimental protocols (e.g. preprocessing, data augmentation, transfer learning methods).
Nirschl et al. [24] train a CNN on the HF dataset to distinguish patients with or without heart failure. They systematically apply the Patient-Wise rule for the initial train-test split (50–50) and for the training partition into three-folds for cross-validation. Data augmentation strategies are also applied, including random cropping, rotation, mirroring, and staining augmentation. As for the BreaKHis dataset, Alom et al. [29] use a 70–30 Patient-Wise partitioning protocol to train a CNN with several (not specified) hidden layers, reporting average results from 5-fold cross-validation. Further, the authors apply augmentation strategies (i.e., rotation, shifting, flipping) to increase the dataset by a factor of \(21\times \) for each magnification level. The work of Han et al. [34] propose a novel CNN adopting a Tile-Wise partition with the training set accounting for the 50% of the dataset. Data augmentation (i.e. intensity variation, rotation, translation, and flipping) is used to adjust for imbalanced classes. Jiang et al. [30] train two different variants of the ResNet model to address the binary and the multi-class task, for each magnification factor. They adopt a Tile-Wise partitioning protocol for the train-test split, using 60% and 70% of the data in the training set for BreaKHis-2 and BreaKHis-8, respectively. Data augmentation is also exploited in the training process, and experiments are repeated 3 times.
Other authors employed a similar protocol to address the BreaKHis-8 task by training a CNN pretrained on ImageNet: Nawaz et al. [33] implemented a DenseNet-inspired model, while Nguyen et al. [36] choose a custom CNN model, instead. Both studies use a Tile-Wise partition on the BreaKHis dataset (70–30 and 90–10, respectively), and do not apply any data augmentation. Xie et al. [32] adapt a pre-trained ResNet-V2 to the binary and multiclass tasks of BreaKHis, at different magnification factors, using a 70–30 Tile-Wise partition. Data augmentation has been applied to balance the least represented class in BreaKHis-8. Jannesary et al. [31] used a 90–10 Tile-Wise train-test split with data augmentation (i.e. resizing, rotations, cropping and flipping) to fine-tune a ResNet-V1 for binary and multi-class prediction. Moreover, experiments in [31] were performed combining images at different magnification factors in a unified dataset. Finally, both [37] and [38] used a Tile-Wise train-test split for prediction of malignant vs benign samples using a pre-trained CNN and [38] also employed data augmentation (rotation and flipping).
6 Conclusions
Possibly even more than other areas of computational biology, digital pathology faces the risk of data leakage. The first part of this study clearly demonstrates the impact of weakly designed experiments with deep learning for digital pathology. In particular, we found that the predictive performance estimates are inflated if the DAP does not flawlessly concentrate the subject and/or the tissue specimen from which tiles are extracted either in the training or test datasets. Fortunately, many studies already adopt the correct procedure [12, 16, 17, 24, 34, 35]. However, we argue that this subtle form of selection bias still constitutes a threat to reproducibility of AI models that may have affected a considerable number of works. Indeed, a significant number of studies considered in this work do not explicitly mention the patient-wise strategy [30,31,32,33, 39, 40]. We encourage the community to adopt our code (https://github.com/histolab/histolab/tree/master/examples) as a launchpad for reproducibility of AI pipelines in digital pathology.
Notes
- 1.
- 2.
- 3.
DenseNet-201: \(\sim \)12M parameters; ResNet-152: \(\sim \)6M parameters.
References
Ioannidis, J.P.A., et al.: Repeatability of published microarray gene expression analyses. Nat. Genet. 41(2), 149 (2009)
Iqbal, S.A., et al.: Reproducible research practices and transparency across the biomedical literature. PLoS Biol. 14(1), e1002333 (2016)
National Academies of Sciences, Engineering, and Medicine, Policy and Global Affairs. Reproducibility and Replicability in Science. National Academies Press (2019)
Leek, J.T., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733 (2010)
Moossavi, S., et al.: Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota. bioRxiv:2020.04.20.052035 (2020)
Turner, B.O., et al.: Small sample sizes reduce the replicability of task-based fMRI studies. Commun. Biol. 1(1), 1–10 (2018)
Barla, A., et al.: Machine learning methods for predictive proteomics. Briefings Bioinform. 9(2), 119–128 (2008)
Peixoto, L., et al.: How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. 43(16), 7664–7674 (2015)
The MAQC Consortium: The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28(8), 827–838 (2010)
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface 15(141), 20170387 (2018)
Saravanan, N., et al.: Data wrangling and data leakage in machine learning for healthcare. Int. J. Emerg. Technol. Innov. Res. 5(8), 553–557 (2018)
Spanhol, F.A., et al.: A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 63(7), 1455–1462 (2016)
Shahidi, F., et al.: Breast cancer classification using deep learning approaches and histopathology image: a comparison study. IEEE Access 8, 187531–187552 (2020)
Cohen, S.: Artificial Intelligence and Deep Learning in Pathology. Elsevier, Amsterdam (2020)
Komura, D., et al.: Machine learning methods for histopathological image analysis. Comput. Struct. Biotechnol. J. 16, 34–42 (2018)
Mormont, R., et al.: Comparison of deep transfer learning strategies for digital pathology. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2343–234309. IEEE (2018)
Marée, R.: The need for careful data collection for pattern recognition in digital pathology. J. Pathol. Inform. 8(1), 19 (2017)
Bizzego, A., et al.: Evaluating reproducibility of AI algorithms in digital pathology with DAPPER. PLOS Comput. Biol. 15(3), 1–24 (2019)
Huang, G., et al.: Densely connected convolutional networks. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. IEEE (2018)
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016)
Barisoni, L., et al.: Digital pathology and computational image analysis in nephropathology. Nat. Rev. Nephrol. 16, 669–685 (2020)
The GTEx Consortium: The genotype-tissue expression (GTEx) project. Nat. Genet. 45(6), 580–585 (2013)
Tomczak, K., et al.: The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19(1A), A68 (2015)
Nirschl, J.J., et al.: A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of H&E tissue. PLOS ONE 13(4), e0192726 (2018)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Raschka, S.: Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808v3 (2020)
Kingma, D.P., et al.: Adam: a method for stochastic optimization. In: Published as a conference paper at ICLR 2015. arXiv:1412.6980 (2014)
Jurman, G., et al.: A comparison of MCC and CEN error measures in multi-class prediction. PLOS ONE 7(8), 1–8 (2012)
Alom, M.Z., Yakopcic, C., Nasrin, M.S., Taha, T.M., Asari, V.K.: Breast cancer classification from histopathological images with inception recurrent residual convolutional neural network. J. Digital Imaging 32(4), 605–617 (2019). https://doi.org/10.1007/s10278-019-00182-7
Jiang, Y., et al.: Breast cancer histopathological image classification using convolutional neural networks with small SE-ResNet module. PLOS ONE 14(3), e0214587 (2019)
Jannesari, M., et al.: Breast cancer histopathological image classification: a deep learning approach. In: Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2405–2412 (2018)
Xie, J., et al.: Deep learning based analysis of histopathological images of breast cancer. Front. Genet. 10, 80 (2019)
Nawaz, M., et al.: Multi-class breast cancer classification using deep learning convolutional neural network. Int. J. Adv. Comput. Sci. Appl. 9(6), 316–332 (2018)
Han, Z., et al.: Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7(1), 4172 (2017)
Alom, M.J., et al.: Advanced deep convolutional neural network approaches for digital pathology image analysis: a comprehensive evaluation with different use cases. arXiv:1904.09075 (2019)
Nguyen, P.T., et al.: Multiclass breast cancer classification using convolutional neural network. In: Proceedings of the 2019 International Symposium on Electrical and Electronics Engineering (ISEE), pp. 130–134. IEEE (2019)
Deniz, E., Şengür, A., Kadiroğlu, Z., Guo, Y., Bajaj, V., Budak, Ü.: Transfer learning based histopathologic image classification for breast cancer detection. Health Inf. Sci. Syst. 6(1), 1–7 (2018). https://doi.org/10.1007/s13755-018-0057-x
Myung, J.L., et al.: Deep convolution neural networks for medical image analysis. Int. J. Eng. Technol. 7(3), 115–119 (2018)
Pan, X., et al.: Multi-task deep learning for fine-grained classification/grading in breast cancer histopathological images. In: Lu, H. (ed.) ISAIR 2018. SCI, vol. 810, pp. 85–95. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-04946-1_10
Shallu, R.M.: Breast cancer histology images classification: training from scratch or transfer learning? ICT Exp. 4(4), 247–254 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bussola, N., Marcolini, A., Maggio, V., Jurman, G., Furlanello, C. (2021). AI Slipping on Tiles: Data Leakage in Digital Pathology. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12661. Springer, Cham. https://doi.org/10.1007/978-3-030-68763-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-68763-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68762-5
Online ISBN: 978-3-030-68763-2
eBook Packages: Computer ScienceComputer Science (R0)