AI Slipping on Tiles: Data Leakage in Digital Pathology

Bussola, Nicole; Marcolini, Alessia; Maggio, Valerio; Jurman, Giuseppe; Furlanello, Cesare

doi:10.1007/978-3-030-68763-2_13

Nicole Bussola^16,17,
Alessia Marcolini¹⁸,
Valerio Maggio¹⁹,
Giuseppe Jurman¹⁶ &
…
Cesare Furlanello¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12661))

Included in the following conference series:

International Conference on Pattern Recognition

2871 Accesses
11 Citations
1 Altmetric

Abstract

Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set. In the context of digital pathology, the leakage typically lurks in weakly designed experiments not accounting for the subjects in their data partitioning schemes. This issue is then exacerbated when fractions or subregions of slides (i.e. “tiles”) are considered. Despite this aspect is largely recognized by the community, we argue that it is often overlooked. In this study, we assess the impact of data leakage on the performance of machine learning models trained and validated on multiple histology data collection. We prove that, even with a properly designed DAP ($10 \times 5 $ repeated cross-validation), predictive scores can be inflated up to $41\%$ when tiles from the same subject are used both in training and validation sets by deep learning models. We replicate the experiments for 4 classification tasks on 3 histopathological datasets, for a total of 374 subjects, 556 slides and more than 27, 000 tiles. Also, we discuss the effects of data leakage on transfer learning strategies with models pre-trained on general-purpose datasets or off-task digital pathology collections. Finally, we propose a solution that automates the creation of leakage-free deep learning pipelines for digital pathology based on histolab, a novel Python package for histology data preprocessing. We validate the solution on two public datasets (TCGA and GTEx).

N. Bussola and A. Marcolini—Joint first author.

G. Jurman and C. Furlanello—Joint last author.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Biased data, biased AI: deep networks predict the acquisition site of TCGA images

Article Open access 17 May 2023

AI Support for Accelerating Histopathological Slide Examinations of Prostate Cancer in Clinical Studies

From whole-slide image to biomarker prediction: end-to-end weakly supervised deep learning in computational pathology

Article 16 September 2024

Keywords

1 Introduction

Bioinformatics on high-throughput omics data has been plagued by uncountable issues with reproducibility since its early days; Ioannidis and colleagues [1] found that almost 90% of papers in a leading journal in genetics were not repeatable due to methodological or clerical errors. Although the landscape seems to have improved [2], and broad efforts have been spent across different biomedical fields [3], computational reproducibility and replicability still fall short of the ideal. Lack of reproducibility has been linked to inaccuracies in managing batch effects [4, 5], small sample sizes [6], or flaws in the experimental design such as data normalization simultaneously performed on development and validation data [7, 8]. The MAQC-II project for reproducible biomarker development from microarray data demonstrated, through a community-wide research effort, that a well-designed Data Analysis Plan (DAP) is mandatory to avoid selection bias flaws in the development of models for high-dimensional datasets [9].

Among the various types of selection bias that threaten the reproducibility of machine learning algorithms, data leakage is possibly the most subtle one [10]. Data leakage refers to the use of information from outside the training dataset during model training or selection [11]. A typical leakage occurs when data in the training, validation and/or test sets share indirect information, leading to overly optimistic results. For example, one of the preclinical sub-dataset in the MAQC-II study consisted of microarray data from mice triplets. These triplets were expected to have an almost identical response for each experimental condition, and therefore they had to be kept together in DAP partitioning to circumvent any possible leakage from training to internal validation data [9].

The goal of this study is to provide evidence that similar issues are still lurking in the grey areas of preprocessing, ready to emerge in the everyday practice of machine learning for digital pathology. The BreaKHis [12] dataset, one of the most popular histology collection of breast cancer samples, has been used in more than 40 scientific papers to date [13], with reported results spanning a broad range of performance. In a non-negligible number of these studies, overfitting effects due to data leakage are suspected to impact their outcomes.

Deep learning pipelines for histopathological data typically require Whole Slide Images (WSIs) to be partitioned into multiple patches (also referred to as “tiles” [14]) to augment the original training data, and to comply with memory constraints imposed by GPU hardware architectures. For example, a single WSI of size $67,727\times 47,543$ pixels can be partitioned in multiple $512 \times 512$ tiles, which are randomly extracted, and verified such that selected subregions preserve enough tissue information. These tiles are then processed by data augmentation operators (e.g. random rotation, flipping, or affine transformation) to reduce the risk of overfitting. As a result, the number of multiple subimages originating from the very same histological specimen is significantly amplified [15, 16], consequently increasing the risk for data leakage. Protocols for data partitioning (e.g. a repeated cross-validation DAP) are not naturally immune against replicates, and so the source originating each tile should be considered to avoid any risk of bias [17].

In this work, we quantify the importance of adopting Patient-Wise split procedures with a set of experiments on digital pathology datasets. All experiments are based on DAPPER [18], a reproducible framework for predictive digital pathology composed of a deep learning core (“backbone network”) as feature encoder, and multiple task-related classification models, i.e. Random Forest or Multi-Layer Perceptron Network (see Fig. 1). We test the impact of various data partitioning strategies on the training of multiple backbone architectures, i.e. DenseNet [19], and ResNet models [20], fine-tuned to the histology domain.

Our experiments confirm that train-test contamination (in terms of modeling) is a serious concern that hinders the development of a dataset-agnostic methodology, with impact similar to the lack of standard protocols in the acquisition and storage of WSIs in digital pathology [21]. Thus, we present a protocol to prevent data leakage during data preprocessing. The solution is based on histolab, an open-source Python library designed as a reproducible and robust environment for WSI preprocessing, available at https://github.com/histolab/histolab. The novel approach is demonstrated on two public large scale datasets: GTEx [22] (i.e. non-pathological tissues), and TCGA [23] (i.e. cancer tissues).

2 Data Description

We tested our experimental pipeline on three public datasets for image classification in digital pathology, namely GTEx [22], Heart Failure (HF) [24], and BreaKHis [12]. Descriptive statistics of the datasets are reported in Table 1, and Fig. 1.

The GTEx Dataset. The current release of GTEx (v8) includes a total of 15, 201 H&E-stained WSIs, retrieved with an Aperio scanner ($20\times $ native magnification) and gathered from a cohort of 838 nondiseased donors^{Footnote 1}. In this work, we consider a subset of 265 WSIs randomly selected from 11 histological classes, for a total of 83 subjects. From this subset, we randomly selected a balanced number of WSIs per tissue: adrenal gland ($n=24$); bladder ($n=19$); breast ($n=26$); liver ($n=26$); lung ($n=21$); ovary ($n=26$); pancreas ($n=26$); prostate ($n=24$); testis ($n=26$); thyroid ($n=26$); uterus ($n=21$).

We implemented a data preprocessing pipeline to prepare the tile dataset from the GTEx collection. First, the tissue region is automatically detected in each WSI; this process combines the Otsu-threshold binarization method [25] with the dilation and hole-filling morphological operations. A maximum of 100 tiles of size $512 \times 512$ is then randomly extracted from each slide. To ensure that only high-informative images are used, tiles with tissue area that accounts for less than the 85% of the whole patch are automatically rejected. At the end of this step, a total of 26, 174 random tiles is extracted from the WSIs, each available at different magnification levels (i.e., $20\times , 10\times , 5\times $). In this paper we limit experiments and discussions to tiles at $5\times $ magnification, with no loss of generality.

Table 1. Statistics of the datasets considered in this study.

Full size table

The HF Dataset. The Heart Failure collection [24] originates from 209 H&E-stained WSIs of the left ventricular tissue, each corresponding to a single subject. The learning task is to distinguish images of heart failure ($n=94$) from those of non-heart failure ($n=115$). Slides in the former class are categorized according to the disease subtype: ischemic cardiomyopathy ($n=51$); idiopathic dilated cardiomyopathy ($n=41$); undocumented ($n=2$). Subjects with no heart failure are further grouped in: normal cardiovascular function ($n=41$); non-HF and no other pathology ($n=72$); non-HF and other tissue pathology ($n=2$). WSIs in this dataset have been acquired with an Aperio ScanScope at $20\times $ native magnification, and then downsampled at $5\times $ magnification by authors. From each WSI, 11 non-overlapping patches of size $250\times 250$ were randomly extracted. The entire collection of 2, 299 tiles is publicly available on the Image Data Resource Repository^{Footnote 2} (IDR number: idr0042).

The BreaKHis Dataset. The BreakHis histopathological dataset [12] collects 7, 909 H&E-stained tiles (size $700\times 460$) of malignant or benign breast tumour biopsies. Tiles correspond to regions of interest manually selected by expert pathologists from a cohort of 82 patients, and made available at different magnification factors, i.e., $40\times $, $100\times $, $200\times $, $400\times $) [12]. To allow for a more extensive comparison with the state of the art, only the $200\times $ magnification factor will be considered in this paper. The BreakHis dataset currently contains 4 histological distinct subtypes of benign, and malignant tumours, respectively: Adenosis ($n=444$); Fibroadenoma ($n=1,014$); Tubular Adenoma ($n=453$); Phyllodes Tumor ($n=569$); Ductal Carcinoma ($n=3,451$); Lobular Carcinoma ($n=626$); Mucinous Carcinoma ($n=792$); Papillary Carcinoma ($n=560$). This dataset is used for two classification tasks: (BreaKHis-2) binary classification of benign and malignant tumour samples; (BreaKHis-8) classification of the 8 distinct tumour subtypes.

3 Methods

The pipeline used in this work is based on the DAPPER framework for digital pathology [18], extended by (i) integrating specialised train-test splitting protocols, i.e. Tile-Wise and Patient-Wise; (ii) extending the feature extractor component with new backbone networks; (iii) applying two transfer learning strategies for feature embedding. Figure 1 shows the three main blocks of the experimental environment defined in this paper: (A) dataset partition in train and test set; (B) feature extraction procedure with different transfer learning strategies; (C) the DAP employed for machine learning models.

A. Dataset Partitioning Protocols. The tile dataset is partitioned in the training set and test set, considering $80\%$ and $20\%$ split ratio for the two sets, respectively. We compare two data partitioning protocols to investigate the impact of a train-test contamination (Fig. 1A): in the Tile-Wise (TW) protocol, tiles are randomly split between the training and the test sets, regardless of the original WSI. The Patient-Wise (PW) protocol splits the tile dataset strictly ensuring that all tiles extracted from the same subject are found either in the training or the test set. To avoid other sources of leakage due to class imbalance [26], the two protocols are both combined with stratification of samples over the corresponding classes, and any class imbalance is accounted for by weighting the error on generated predictions.

B. Deep Learning Models and Feature Extraction. The training set is then used to train a deep neural network for feature extraction (Fig. 1B), i.e. a “backbone” network whose aim is to learn a vector representation of the data (features embedding). In this study, we consider two backbone architectures in the residual network (ResNet) family, namely ResNet-152 [20] and DenseNet-201 [19]. Given that the DenseNet model has almost the double of parameters^{Footnote 3}, and so a higher footprint in computational resources, diagnostic experiments and transfer learning are performed only with the ResNet-152 model. Similarly to [16], and [18], we started from off-the-shelf version of the models, pre-trained on ImageNet, and then fine-tuned to the digital pathology domain using transfer learning. Specifically, we trained the whole network for 50 epochs with a learning rate $\eta = 1e-5$, and Adam optimizer [27], in combination with the categorical cross-entropy loss. The $\beta _{1}$ and $\beta _{2}$ parameters of the optimizer are respectively set to 0.9 and 0.999, with no regularization. To reduce the risk of overfitting, we use train-time data augmentation, namely random rotation and random flipping of the input tiles.

The impact of adopting a single or double-step transfer learning strategy in combination with the Patient-Wise partitioning protocol is also investigated in this study. Two sets of features embeddings (FE) are generated: $FE_{1}$, backbone model fine-tuned from ImageNet; $FE_{2}$, backbone model sequentially fine-tuned from ImageNet and GTEx.

C. Classification and Data Analysis Plan (DAP). The classification is finally performed on the feature embedding within a DAP for machine learning models (Fig. 1C). In this work, we compare the performance of two models: Random Forest (RF) and Multi-Layer Perceptron Network (MLP). In particular, we apply the 10 $\times $ 5-fold CV schema proposed by the MAQC-II Consortium [9]. In the DAP setting, the input datasets are the two separate training and test sets, as resulted from the 80–20 train-test split protocol. The test set is kept completely unseen to the model, and only used for the final evaluation. The training set further undergoes a 5-fold CV iterated 10 times with a different random seed, resulting in 50 separated internal validation sets. These validation sets are generated adopting the same protocols used in the previous train-test generation, namely Tile-Wise or Patient-Wise. The overall performance of the model is evaluated across all the iterations, in terms of average Matthews Correlation Coefficient (MCC) [28] and Accuracy (ACC), both with $95\%$ Studentized bootstrap confidence intervals (CI). Moreover, results have been reported both at tile-level and at patient-level, in order to assess the ability of machine learning models to generalise on unseen subjects (see Sect. 4).

As an additional caution to check for selection bias, the DAP integrates a random labels schema (RLab) (Fig. 2). In this setting, the training labels are randomly shuffled and presented as reference ground truth to the machine learning models. In particular, we consistently randomize the labels for all the tiles of a single subject, thus they would all share the same random label (Fig. 2A); then we alternatively use the Patient-Wise (Fig. 2B1) or the Tile-Wise (Fig. 2B2) splits within the DAP environment. Notice that an average MCC score close to zero ($MCC \approx 0$) indicates a protocol immune from sources of bias, including data leakage; we focus on the RLab validation to emphasise evidence of data leakage derived from the TW and the PW protocols.

Performance Metrics. Several patient-wise performance metrics have been defined in the literature [12, 24, 29]. Two metrics are considered in this study: (1) Winner-takes-all (WA), and (2) Patient Score (PS).

In the WA metric, the label associated to each patient corresponds to the majority of the labels predicted for their tiles. With this strategy, standard metrics based on the classification confusion matrix can be used as overall performance indicators. In this paper, ACC is used for comparability with the PS metric. The PS metric is defined for each patient [12] as the ratio of the $N_c$ correctly classified tiles over the $N_P$ total number of tiles per patient, namely $PS=\frac{N_c}{N_P}$. The overall performance is then calculated using the global recognition rate (RR), defined as the average of all the PS scores for all patients:

$$ RR = \displaystyle {\frac{\sum PS}{|P|}} $$

In this paper, the WA metric and the PS metric are used for comparison of patient-level results on the HF dataset and the BreaKHis dataset, respectively.

Preventing Data Leakage: The histolab Library. As a solution to the data leakage pitfall, we have developed a protocol for image and tile splitting based on histolab, an open source software recently developed for reproducible WSI preprocessing in digital pathology. This library implements a tile extraction procedure, whose reliability and quality result from robust design, and extensive software testing. A high level interface for image transformation is also provided, making histolab an easy-to-adopt tool for complex histopathological pipelines.

In order to intercept data leakage conditions, the protocol is designed to create a data-leakage free collection (tile extraction with the Patient-Wise split) that can be easily integrated in a deep learning workflow (Fig. 3). The protocol is already customized for standardizing WSI preprocessing on GTEx and TCGA, two large scale public repositories that are widely used in computational pathology. The code can be also adapted to rebuild the training and test datasets from GTEx used in this study, thus extending the HINT collection presented in [18].

Table 2. DAP results for each classifier head, using the Tile-Wise partitioning protocol, and the $FE_1$ feature embedding with the ResNet-152 as backbone model. The average cross validation metrics ($\mathrm {MCC_v}$ and $\mathrm {ACC_v}$) with 95$\%$ CI are reported for each classification task, along with metrics on the test set ($\mathrm {MCC_t}$ and $\mathrm {ACC_t}$). The Others column reports the highest accuracy achieved among the compared papers.

Full size table

Table 3. DAP results for each classifier head, using the Patient-Wise partitioning protocol, and the $FE_1$ feature embedding with the ResNet-152 as backbone model. The average cross validation metrics ($\mathrm {MCC_v}$ and $\mathrm {ACC_v}$) with 95$\%$ CI are reported for each classification task, along with metrics on the test set ($\mathrm {MCC_t}$ and $\mathrm {ACC_t}$). The Others column reports the highest accuracy achieved among the compared papers.

Full size table

4 Results

Data Leakage Effects on Classification Outcome. The results of the four classification tasks using the ResNet-152 pre-trained on ImageNet as backbone model (i.e. feature vectors $FE_1$) are reported in Table 2 and Table 3, with the Tile-Wise and the Patient-Wise partitioning protocols, respectively. The average cross validation $\mathrm {MCC_v}$ and $\mathrm {ACC_v}$ with 95$\%$ CI are presented, along with results on the test set (i.e. $\mathrm {MCC_t}$, and $\mathrm {ACC_t}$). State of the art results (i.e. Others) are also reported for comparison, whenever available.

As expected, estimates are more favourable for the TW protocol (Table 2) with respect to the PW one (Table 3), both in validation and in test and consistently for all the datasets. Moreover, the inflation of the Tile-Wise estimates is amplified in the multi-class setting (see BreaKHis-2 vs BreaKHis-8). Notably, these results are comparable with those in the literature, suggesting the evidence of a data leakage for studies adopting the Tile-Wise splitting strategy. Results on the GTEx dataset do not suggest significant differences using the two protocols; however both MCC and ACC metrics lie in a very high range. Analogous results (not reported here) were obtained using the DenseNet-201 backbone model, further confirming the generality of the derived conclusions.

Random Labels Detects Signal in the Tile-Wise split. A data leakage effect is signalled for the Tile-Wise partitioning with a MCC consistently positive in the RLab validation schema (Sect. 3). For instance, as for BreaKHis-2 coupled with MLP, $\mathrm {MCC_{RL}}$ $=0.354$ (0.319, 0.392) in the Tile-Wise setting, to be compared with $\mathrm {MCC_{RL}}$ $=-0.065$ $(-0.131, 0.001)$ using the Patient-Wise protocol. Full $\mathrm {MCC_{RL}}$ results considering 5 trials of the RLab test are reported in Table 4, with corresponding $\mathrm {ACC_{RL}}$ values also included for completeness. Notably, all the tests using the Patient-Wise split perform as expected, i.e. with median values near 0, whereas results of the Tile-Wise case exhibit a high variability, especially for the BreaKHis-2 dataset (Fig. 4).

Table 4. Random Labels (RLab) results using the ResNet-152 as backbone model, and Tile-Wise and Patient-Wise train-test split protocols. The average $\mathrm {MCC_{RL}}$ and $\mathrm {ACC_{RL}}$ with 95$\%$ CI are reported.

Full size table

Table 5. DAP results for each classifier head, using the Patient-Wise partitioning protocol, and the $FE_2$ feature embedding with ResNet-152 as backbone model. The average cross validation $\mathrm {MCC_v}$ and $\mathrm {ACC_v}$ with 95$\%$ CI are reported, along with results on the test set (i.e. $\mathrm {MCC_t}$, and $\mathrm {ACC_t}$). The Others column reports the highest accuracy achieved among the compared papers.

Full size table

Table 6. Patient-level results for each classifier head, using the Patient-Wise and Tile-Wise partitioning protocols, and the $FE_1$ feature embedding with the ResNet-152 backbone model. The average cross-validation Patient-level accuracy with 95$\%$ CI ($\mathrm {ACC_v}$), and corresponding scores on the test set ($\mathrm {ACC_t}$), are reported. The Others column reports the highest accuracy achieved among the compared papers.

Full size table

Benefits of Domain-Specific Transfer Learning. The adoption of the GTEx domain-specific dataset for transfer learning (Table 5) proves to be beneficial over the use of ImageNet only (Table 3). Notably, the Patient-Wise partitioning protocol with the $FE_{2}$ embedding have comparable performance with $FE_{1}$ and the inflated TW splitting (Table 2). However, minor improvements are achieved on the BreaKHis-8 task, with results not reaching state of the art. It must be observed that the BreaKHis dataset is highly imbalanced in the multi-class task. As a countermeasure, authors in [34, 35] adopted a balancing strategy during data augmentation, which we did not introduce here for comparability with the other experiments.

To verify how much of previous domain-knowledge can be still re-used for the original task, we devised an additional experiment on the GTEx dataset: on the Feature Extractor component (i.e. Convolutional Layers) of the model trained on GTEx and fine-tuned on BreakHis-2, we add back the MLP classifier of the model trained on GTEx. Notably, this configuration recover high predictive performance (i.e. $\mathrm {MCC_t}$ = 0.983) on the classification task after only a single epoch of full training on GTEx.

Patient-Level Performance Analysis. We report patient-wise performance using the ResNet-152 backbone model with either the $FE_1$ feature embedding and both Tile-Wise and Patient-Wise protocols (Table 6), or with the $FE_2$ strategy and the Patient-Wise split (Table 7).

Table 7. Patient-level results for each classifier head, with the Patient-Wise partitioning protocol and the $FE_2$ feature embedding with the ResNet-152 model. The average cross-validation Patient-level accuracy with 95$\%$ CI ($\mathrm {ACC_v}$) and corresponding scores on the test set ($\mathrm {ACC_t}$) are reported. The Others column reports the highest accuracy achieved among the compared papers.

Full size table

5 Discussion

We report here a short description of the approach employed by comparable studies on the same datasets considered in this work; we refer to a Patient-Wise partitioning protocol when the authors clearly state the adoption of a train-test split consistent with the patient, or when the code is provided as reference. Notice that the different accuracy scores obtained for deep learning models applied on the same data can be explained by the adoption of diverse experimental protocols (e.g. preprocessing, data augmentation, transfer learning methods).

Nirschl et al. [24] train a CNN on the HF dataset to distinguish patients with or without heart failure. They systematically apply the Patient-Wise rule for the initial train-test split (50–50) and for the training partition into three-folds for cross-validation. Data augmentation strategies are also applied, including random cropping, rotation, mirroring, and staining augmentation. As for the BreaKHis dataset, Alom et al. [29] use a 70–30 Patient-Wise partitioning protocol to train a CNN with several (not specified) hidden layers, reporting average results from 5-fold cross-validation. Further, the authors apply augmentation strategies (i.e., rotation, shifting, flipping) to increase the dataset by a factor of $21\times $ for each magnification level. The work of Han et al. [34] propose a novel CNN adopting a Tile-Wise partition with the training set accounting for the 50% of the dataset. Data augmentation (i.e. intensity variation, rotation, translation, and flipping) is used to adjust for imbalanced classes. Jiang et al. [30] train two different variants of the ResNet model to address the binary and the multi-class task, for each magnification factor. They adopt a Tile-Wise partitioning protocol for the train-test split, using 60% and 70% of the data in the training set for BreaKHis-2 and BreaKHis-8, respectively. Data augmentation is also exploited in the training process, and experiments are repeated 3 times.

Other authors employed a similar protocol to address the BreaKHis-8 task by training a CNN pretrained on ImageNet: Nawaz et al. [33] implemented a DenseNet-inspired model, while Nguyen et al. [36] choose a custom CNN model, instead. Both studies use a Tile-Wise partition on the BreaKHis dataset (70–30 and 90–10, respectively), and do not apply any data augmentation. Xie et al. [32] adapt a pre-trained ResNet-V2 to the binary and multiclass tasks of BreaKHis, at different magnification factors, using a 70–30 Tile-Wise partition. Data augmentation has been applied to balance the least represented class in BreaKHis-8. Jannesary et al. [31] used a 90–10 Tile-Wise train-test split with data augmentation (i.e. resizing, rotations, cropping and flipping) to fine-tune a ResNet-V1 for binary and multi-class prediction. Moreover, experiments in [31] were performed combining images at different magnification factors in a unified dataset. Finally, both [37] and [38] used a Tile-Wise train-test split for prediction of malignant vs benign samples using a pre-trained CNN and [38] also employed data augmentation (rotation and flipping).

6 Conclusions

Possibly even more than other areas of computational biology, digital pathology faces the risk of data leakage. The first part of this study clearly demonstrates the impact of weakly designed experiments with deep learning for digital pathology. In particular, we found that the predictive performance estimates are inflated if the DAP does not flawlessly concentrate the subject and/or the tissue specimen from which tiles are extracted either in the training or test datasets. Fortunately, many studies already adopt the correct procedure [12, 16, 17, 24, 34, 35]. However, we argue that this subtle form of selection bias still constitutes a threat to reproducibility of AI models that may have affected a considerable number of works. Indeed, a significant number of studies considered in this work do not explicitly mention the patient-wise strategy [30,31,32,33, 39, 40]. We encourage the community to adopt our code (https://github.com/histolab/histolab/tree/master/examples) as a launchpad for reproducibility of AI pipelines in digital pathology.

Notes

1.
https://gtexportal.org/home/releaseInfoPage.
2.
http://idr.openmicroscopy.org/.
3.
DenseNet-201: $\sim $12M parameters; ResNet-152: $\sim $6M parameters.

References

Ioannidis, J.P.A., et al.: Repeatability of published microarray gene expression analyses. Nat. Genet. 41(2), 149 (2009)
Article Google Scholar
Iqbal, S.A., et al.: Reproducible research practices and transparency across the biomedical literature. PLoS Biol. 14(1), e1002333 (2016)
Article Google Scholar
National Academies of Sciences, Engineering, and Medicine, Policy and Global Affairs. Reproducibility and Replicability in Science. National Academies Press (2019)
Google Scholar
Leek, J.T., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733 (2010)
Article Google Scholar
Moossavi, S., et al.: Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota. bioRxiv:2020.04.20.052035 (2020)
Google Scholar
Turner, B.O., et al.: Small sample sizes reduce the replicability of task-based fMRI studies. Commun. Biol. 1(1), 1–10 (2018)
Article Google Scholar
Barla, A., et al.: Machine learning methods for predictive proteomics. Briefings Bioinform. 9(2), 119–128 (2008)
Article Google Scholar
Peixoto, L., et al.: How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. 43(16), 7664–7674 (2015)
Article Google Scholar
The MAQC Consortium: The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28(8), 827–838 (2010)
Article Google Scholar
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface 15(141), 20170387 (2018)
Article Google Scholar
Saravanan, N., et al.: Data wrangling and data leakage in machine learning for healthcare. Int. J. Emerg. Technol. Innov. Res. 5(8), 553–557 (2018)
Google Scholar
Spanhol, F.A., et al.: A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 63(7), 1455–1462 (2016)
Article Google Scholar
Shahidi, F., et al.: Breast cancer classification using deep learning approaches and histopathology image: a comparison study. IEEE Access 8, 187531–187552 (2020)
Article Google Scholar
Cohen, S.: Artificial Intelligence and Deep Learning in Pathology. Elsevier, Amsterdam (2020)
Google Scholar
Komura, D., et al.: Machine learning methods for histopathological image analysis. Comput. Struct. Biotechnol. J. 16, 34–42 (2018)
Article Google Scholar
Mormont, R., et al.: Comparison of deep transfer learning strategies for digital pathology. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2343–234309. IEEE (2018)
Google Scholar
Marée, R.: The need for careful data collection for pattern recognition in digital pathology. J. Pathol. Inform. 8(1), 19 (2017)
Article Google Scholar
Bizzego, A., et al.: Evaluating reproducibility of AI algorithms in digital pathology with DAPPER. PLOS Comput. Biol. 15(3), 1–24 (2019)
Article Google Scholar
Huang, G., et al.: Densely connected convolutional networks. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. IEEE (2018)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016)
Google Scholar
Barisoni, L., et al.: Digital pathology and computational image analysis in nephropathology. Nat. Rev. Nephrol. 16, 669–685 (2020)
Article Google Scholar
The GTEx Consortium: The genotype-tissue expression (GTEx) project. Nat. Genet. 45(6), 580–585 (2013)
Article Google Scholar
Tomczak, K., et al.: The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19(1A), A68 (2015)
Google Scholar
Nirschl, J.J., et al.: A deep-learning classifier identifies patients with clinical heart failure using whole-slide images of H&E tissue. PLOS ONE 13(4), e0192726 (2018)
Article Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article Google Scholar
Raschka, S.: Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808v3 (2020)
Kingma, D.P., et al.: Adam: a method for stochastic optimization. In: Published as a conference paper at ICLR 2015. arXiv:1412.6980 (2014)
Jurman, G., et al.: A comparison of MCC and CEN error measures in multi-class prediction. PLOS ONE 7(8), 1–8 (2012)
Google Scholar
Alom, M.Z., Yakopcic, C., Nasrin, M.S., Taha, T.M., Asari, V.K.: Breast cancer classification from histopathological images with inception recurrent residual convolutional neural network. J. Digital Imaging 32(4), 605–617 (2019). https://doi.org/10.1007/s10278-019-00182-7
Article Google Scholar
Jiang, Y., et al.: Breast cancer histopathological image classification using convolutional neural networks with small SE-ResNet module. PLOS ONE 14(3), e0214587 (2019)
Article Google Scholar
Jannesari, M., et al.: Breast cancer histopathological image classification: a deep learning approach. In: Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2405–2412 (2018)
Google Scholar
Xie, J., et al.: Deep learning based analysis of histopathological images of breast cancer. Front. Genet. 10, 80 (2019)
Article Google Scholar
Nawaz, M., et al.: Multi-class breast cancer classification using deep learning convolutional neural network. Int. J. Adv. Comput. Sci. Appl. 9(6), 316–332 (2018)
Google Scholar
Han, Z., et al.: Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7(1), 4172 (2017)
Article Google Scholar
Alom, M.J., et al.: Advanced deep convolutional neural network approaches for digital pathology image analysis: a comprehensive evaluation with different use cases. arXiv:1904.09075 (2019)
Nguyen, P.T., et al.: Multiclass breast cancer classification using convolutional neural network. In: Proceedings of the 2019 International Symposium on Electrical and Electronics Engineering (ISEE), pp. 130–134. IEEE (2019)
Google Scholar
Deniz, E., Şengür, A., Kadiroğlu, Z., Guo, Y., Bajaj, V., Budak, Ü.: Transfer learning based histopathologic image classification for breast cancer detection. Health Inf. Sci. Syst. 6(1), 1–7 (2018). https://doi.org/10.1007/s13755-018-0057-x
Article Google Scholar
Myung, J.L., et al.: Deep convolution neural networks for medical image analysis. Int. J. Eng. Technol. 7(3), 115–119 (2018)
MathSciNet Google Scholar
Pan, X., et al.: Multi-task deep learning for fine-grained classification/grading in breast cancer histopathological images. In: Lu, H. (ed.) ISAIR 2018. SCI, vol. 810, pp. 85–95. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-04946-1_10
Chapter Google Scholar
Shallu, R.M.: Breast cancer histology images classification: training from scratch or transfer learning? ICT Exp. 4(4), 247–254 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Nicole Bussola & Giuseppe Jurman
University of Trento, Trento, Italy
Nicole Bussola
HK3 Lab, Milan, Italy
Alessia Marcolini & Cesare Furlanello
University of Bristol, Bristol, UK
Valerio Maggio

Authors

Nicole Bussola
View author publications
You can also search for this author in PubMed Google Scholar
Alessia Marcolini
View author publications
You can also search for this author in PubMed Google Scholar
Valerio Maggio
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Jurman
View author publications
You can also search for this author in PubMed Google Scholar
Cesare Furlanello
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicole Bussola .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bussola, N., Marcolini, A., Maggio, V., Jurman, G., Furlanello, C. (2021). AI Slipping on Tiles: Data Leakage in Digital Pathology. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12661. Springer, Cham. https://doi.org/10.1007/978-3-030-68763-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-68763-2_13
Published: 21 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68762-5
Online ISBN: 978-3-030-68763-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

AI Slipping on Tiles: Data Leakage in Digital Pathology

Abstract

Similar content being viewed by others

Biased data, biased AI: deep networks predict the acquisition site of TCGA images

AI Support for Accelerating Histopathological Slide Examinations of Prostate Cancer in Clinical Studies

From whole-slide image to biomarker prediction: end-to-end weakly supervised deep learning in computational pathology

Keywords

1 Introduction

2 Data Description

3 Methods

4 Results

5 Discussion

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

AI Slipping on Tiles: Data Leakage in Digital Pathology

Abstract

Similar content being viewed by others

Biased data, biased AI: deep networks predict the acquisition site of TCGA images

AI Support for Accelerating Histopathological Slide Examinations of Prostate Cancer in Clinical Studies

From whole-slide image to biomarker prediction: end-to-end weakly supervised deep learning in computational pathology

Keywords

1 Introduction

2 Data Description

3 Methods

4 Results

5 Discussion

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation