Introduction

Validation studies regarding the feasibility of whole slide imaging (WSI) systems have been conducted by pathology laboratories in a wide range of subspecialties to produce solid evidence and support the use of this technology for several applications, including primary diagnosis. The guideline statement of the College of American Pathologists Pathology and Laboratory Quality Center (CAP-PLQC) for WSI systems validation summarizes recommendations, suggestions, and expert consensus opinion about the methodology of validation studies in an effort to standardize the process. This guideline encompasses the need to include a sample set of at least 60 cases for one application and to establish a diagnostic concordance between digital and glass slides for the same observer—intraobserver variability—with a minimum washout period of 2 weeks between views [1]. Surprisingly, the recommendations do not suggest a consecutive or random selection of the cases or a need to blind evaluators, but they do highlight that the viewing can be random or non-random.

Validation studies are cross-sectional studies by definition, and their designs have many methodological variations, which should be considered when evidence is assembled [2]. All these variations lead to skewed estimates about the test accuracy. The most important variation concerns how the sample was selected, included, and analyzed [3]. Some aspects regarding configuration, the purpose of the test, and the risks that prevent the test from serving its purposes may have been considered in validation studies, since performance may be influenced by analysis bias; reproducibility; washout period; response time; and size, scope, and suitability of certain types of specimens. Besides that, the learning curve and performance problems may be related to the method or to the pathologists [2]. Apparently, the order of analyses—digital or conventional—does not affect the interpretation in this context [3].

The most common biases in diagnostic studies are verification bias/detection bias/work-up bias (when the reference standard is not applied in all sample), incorporation bias (when the index test and reference standard are not independent, which leads to overestimation of the sensitivity and specificity of the test), and inspection bias (when the tests are not blinded). The methodological characteristics should be individually evaluated by domain, which represents the way that the study was conducted [4].

The most common problems identified in the design of previously published validation studies are the case selection—samples selected have a narrow range of subspecialty specimens or known malignant diagnoses—and the comparisons of the study results with a “gold standard”/consensus diagnosis/expert diagnosis instead of establishing the concordance by assessing the intraobserver agreement [5].

The FDA recently approved a WSI system for primary diagnosis purposes [6] and, even though this statement highlighted some assurance about the safety and feasibility of the digital system, only one device was tested and approved. Regardless of this achievement, individual validation studies conducted by each laboratory and customized for each service and WSI system used are still necessary and will provide the best evidence to attest the feasibility of digital pathology, especially if based on CAP-PLQC guidelines.

Given the absence of a broader collective agreement on the use of WSI in a human pathology context, it is necessary to assemble evidence regarding the performance of digital microscopy in order to establish whether this technology can be used to provide a primary diagnosis. Therefore, this systematic review tested the diagnostic performances of WSI in human pathology. In addition, this review provided access to the main reasons for disagreement occurrences.

Materials and methods

The present systematic review was conducted following the guidelines of Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) [7] and was registered with the PROSPERO database under the protocol CRD42018085593. The review question defined was: “Is digital microscopy performance as reliable for use in clinical practice and routine surgical pathology for diagnostic purposes as conventional microscopy?” The best evidence to answer this question is from intraobserver agreement [1].

Definition of eligibility criteria

The eligibility criteria (Table 1) were elaborated based on two important recommendations and one suggestion established by CAP-PLQC guideline [1]: the validation process should include a sample set of at least 60 cases for one application, the validation study should establish diagnostic concordance between digital and glass slides for the same observer (i.e., intraobserver variability), and a washout period of at least 2 weeks should occur between viewing digital and glass slides.

Table 1 Inclusion and exclusion criteria

Literature review

Recognizing the need to check if there are similar systematic reviews registered, executed, in progress or published with the same theme, the primary researcher (ALA) conducted a previous literature review. A systematic review with a similar proposal registered with the PROSPERO in 2015 was in progress, entitled: “The diagnostic accuracy of digital microscopy: a systematic review”; it was under the protocol CRD42015017859. Two published systematic reviews were found: “A systematic analysis of discordant diagnoses in digital pathology compared with light microscopy” [8] and “The Diagnostic Concordance of Whole Slide Imaging and Light Microscopy: A Systematic Review” [9]. Based on these findings, the research team decided to proceed with the present systematic review, since the methodology of the present review focused on studies supported by the CAP-PLQC guidelines [1]. These well-designed studies can provide much more reliable evidence about the utilization of WSI systems performance to provide a primary diagnosis in human pathology than the previously published systematic reviews.

Search strategy

An electronic search was carried out in these databases: Scopus (Elsevier, Amsterdam, the Netherlands), MEDLINE (Medline Industries, Mundelein, Illinois) by PubMed platform (National Center for Biotechnology Information, US National Library of Medicine, Bethesda, Maryland) and Embase (Elsevier, Amsterdam, the Netherlands). Scopus was the first database used (due to its interdisciplinary basis and article indexing capabilities) in order to align the keywords. The search strategy used was the following: [ALL (validation) AND ALL (“whole slide image”)]. In sequence, the search was reproduced in the other databases. As result, 599 articles from Scopus, 132 from Embase, and 115 from PubMed were retrieved. A manual search was conducted in order to identify any eligible articles that may not have been retrieved by the search strategy, but none were compatible with the eligibility criteria.

Article screening and eligibility evaluation

Two reviewers (ALDA and ARSS) independently conducted the screening of articles by reading the title and abstract and excluding articles that clearly did not fulfill the eligibility criteria. The assessment of eligibility was guided by a flow diagram drawn on phase 2 of the quality assessment. The two reviewers proceeded to read the full text of the articles, screened them to identify the eligible articles; all primary reasons for exclusions were registered for the composition of the article selection flow chart. Rayyan QCRI was used as the reference manager to perform the screening of the articles, exclusion of duplicates, and registration of a primary reason for exclusion [10].

Extraction of qualitative and quantitative data and quality assessment

The data extraction was conducted by the primary researcher (ALDA) and guided by a tailored extraction data form (Appendix 1) originally suggested by The Cochrane Collaboration [11]. The tailored tool has 5 sections: general information, eligibility, interventions participants and sample, methods, the risk of bias assessment, applicability and outcomes. The section of “risk of bias assessment” and “applicability” was added based on the tailored QUADAS-2 (University of Bristol, Bristol, England), a tool designed to assess the quality of primary diagnostic accuracy studies. Specific guidance for each signaling question was produced and some signaling questions—which did not apply to the review—were removed (Appendix 2). Qualitative and quantitative data were tabulated and processed in Microsoft Excel®. The studies identified in this review were highly heterogeneous with regard to equipment utilized, magnification, the number of pathologists involved, specimen type (subspecialty), washout time, and mainly how the sample was analyzed. These variations in study design represent limitations and did not justify meta-analysis but only allowed a narrative synthesis of the findings from the included studies.

Results

PRISMA flowchart

The search strategy identified a total of 846 records through database searching. After duplicates were removed, 681 records were screened; among these, 48 articles were selected to be assessed for eligibility. A total of 13 articles [12,13,14,15,16,17,18,19,20,21,22,23,24] were included and 35 articles were excluded based on eligibility criteria. The composition of the article selection flow is shown in Fig. 1.

Fig. 1
figure 1

Flow diagram of literature search adapted from PRISMA (Moher et al. 2009)

One article (2.1%) [25] was excluded for being published in French, 1 (2.1%) [26] for having insufficient sample size, 1 (2.1%) [27] for having a sample with a known malignant diagnosis, and 11 studies (22.2%) [28,29,30,31,32,33,34,35,36,37,38] for presenting only abstracts (gray literature). Two studies (4.1%) [39, 40] were excluded because the main objective was not to examine diagnostic concordance between WSI and conventional light microscope (CLM). Four studies (8.3%) [8, 41,42,43] were excluded because they utilized insufficient washout time between the analyses.

The most important eligibility criteria establish that the intraobserver agreement should be the preferred measure to assess the performance of digital microscopy, according to CAP-PLQC guidelines [1]. Thirteen studies (27.1%) did not fit that criteria and were excluded for the following reasons: in six studies (12.5%) [44,45,46,47,48,49], the pathologists only assessed WSI and the concordance was reached by comparing WSI diagnosis with the original glass slide diagnosis; in four studies (8.3%) [50,51,52,53], the WSI diagnosis was compared to a consensus panel diagnosis; in one study (2.1%) [54], two groups of students only assessed WSI and the other only assessed glass slides; in two studies (4.1%) [55, 56], the sample analyzed was not the same in both methods. Two studies (4.1%) [57, 58] did not report either intraobserver concordance percentage or kappa value. Disagreements among the reviewers at the screening and assessment of eligibility were confronted and resolved by consensus.

Methodological characteristics of the studies

Publication dates ranged from 2010 to 2017. Only six articles (46.1%) [18,19,20,21, 23, 24] mentioned the use of CAP-PLQC guidelines, but methodologies of all the included studies were according to these guidelines. The included studies used scanners from eight different manufacturers. The most commonly used scanner was Scan Scope (Aperio, Vista, CA), which was reported in eight studies (61.5%) [13,14,15,16, 18, 19, 21, 23] (Table 2).

Table 2 Technical characteristics of the equipment used in included studies

The aims of the studies were highly variable: five (38.4%) [13,14,15, 17, 21] aimed to test the feasibility of digital methods, two (15.4%) [21, 24] aimed to determine the utility of CAP-PLQC guidelines [1], two (15.4%) [20, 23] intend to assess primary digital pathology reporting, one (7.7%) [22] proposed to determine the accuracy of WSI interpretation, one (7.7%) [12] proposed to investigate whether conventional microscopy of skin tumors can be replaced by virtual microscopy, one (7.7%) [19] proposed to evaluate whether diagnosis from WSI is inferior to diagnosis of glass slides, and one (7.7%) [16] aimed to evaluate the use of WSI for diagnosis of placental tissue and pediatric biopsies.

The most relevant methodological characteristics of the included studies are shown in Table 3. Full information about the methodological characteristics of the studies included in this systematic review is available as supplementary material (Supplementary Table 1). Included studies performed validations in the following areas: dermatopathology, neuropathology, gastrointestinal, genitourinary, breast, liver, and pediatric pathology. Surgical pathology specimens of pediatric pathology were gastrointestinal, heart, liver, lung, neuropathology, placenta, rectal suction, skin, and tonsil. Subsets also included endocrine, head and neck, hematopoietic organ, hepatobiliary-pancreatic organ, soft tissue, bone, hematopathology, medical kidney, and transplant biopsies.

Table 3 Methodological characteristics of the included studies

These 13 papers included a total sample of 2145 glass slides and corresponding digital slides, in which the majority of 695 (32.4%) were from dermatopathology, followed by 200 (9.3%) from gastrointestinal pathology. The mean number of samples within the included studies was 165. Four studies included cases from various pathology subspecialties.

The samples were analyzed in two different ways: (1) pathologists assessed the cases with one modality and—after a washout period—they reassessed the cases with the other modality; (2) when WSI diagnoses were compared to original glass slides diagnoses, the cases were addressed to the original pathologist, providing a satisfactory washout period and maintaining the intraobserver agreement as the preferred measure. In one study (7.7%) [24], the cases were first evaluated half as glass slides and half as digital images and reviewed with the other modality after washout. The washout period between views within the included studies ranged from 2 weeks to 12 months.

Three studies (23%) [12, 23, 24] reported set training and eight (61.5%) stated that the pathologists had previous experience with WSI systems. One study (7.7%) [20] did not include a trained pathologist in the validation process but claimed that the pathologist was familiar with the method. Set training or previous experience was not mentioned in one study (7.7%) [18].

Only one study (7.7%) [13] measured the scan time of slides (stating that they took on average of 2.5 min) and only one (7.7%) [24] measured the diagnosis time (median time for glass slides was 132 s and 210 s for WSI). Two studies (15.4%) [16, 17] considered WSI more time-consuming than CLM, although no formal timings had been performed. A consensus diagnosis was mentioned to be used in three included studies (23%) [19, 20, 23].

Intraobserver concordance

Within the included studies, one (7.7%) [12] did not report the percentage of concordance but reported an almost perfect kappa index of 0.93. Two other studies (15.4%) [21, 22] reported the concordance percentage for each pathologist, instead of an overall concordance percentage. For these reasons, these three studies are not graphically represented on Fig. 2; however, they are detailed in Table 3. The majority of the intraobserver agreements reported showed an excellent concordance, with values ranging from 87% to 98.3% (κ coefficient range 0.8–0.98). Only one study (7.7%) [24] showed a lower concordance of 79%. All values of the intraobserver agreement are shown in Table 3. Interobserver agreements were reported additionally to the intraobserver agreement in four studies (30.7%) [12, 19, 20, 22].

Fig. 2
figure 2

Graphic presentation of intraobserver agreement of included studies

Reasons for disagreements

Within these 13 included studies, ten (77%) reported a total of 128 disagreements [13,14,15,16,17,18,19,20,21, 23]. The other three studies (23%) [12, 22, 24] lacked the exact number and nature of the disagreement. We provide an overview of the reasons for disagreements—i.e., pitfalls—according to the subspecialties of pathology (Fig. 3). Among all the reasons that might explain the occurrence of disagreements, the most frequent were borderline, difficult, or challenging cases, which were reported in seven articles (53.8%) [12, 13, 15, 19, 20, 22, 24]. Along with the subjective nuances of non-melanocytic lesions with dysplasia and melanocytic lesions (widely considered to be challenging cases), the study by Kent et al. also reported the lack of clinical information in inflammatory lesions as reasons for disagreements [19]. Nielsen reported the challenging diagnosis—specifically referring to actinic keratosis—as reasons for discordances, as well as poor quality of the biopsies, the lack of clinical information, and inexperience of the pathologists.

Fig. 3
figure 3

Overview of the reasons for disagreements (pitfall) according to subspecialties of pathology

Six authors reported limitations of the equipment and/or limited image resolution as reasons for disagreements. Within these, one study (7.7%) [18] indicated pitfalls regarding the identification of eosinophilic granular bodies in brain biopsies, eosinophils, and nucleated red blood cells (which demonstrate refractile eosinophilic cytoplasm). Another study (7.7%) [21] reported difficulties in the identification of mitotic figures, nuclear details, and chromatin patterns in neuropathology specimens. Three articles (23%) [14, 16, 24] reported difficulties in the identification of microorganisms (e.g., Candida albicans, Helicobacter pylori, and Giardia lamblia). Thrall et al. also reported a limitation of the technology related to lack of image clarity at a magnification above 320×: the image becomes pixelated and unclear [24]. However, authors stated that the intraobserver variances do not derive from technical limitations of WSI.

The lack of clinical information was reported by four authors [12, 17, 19, 21] as a source for disagreements.

Two studies (15.4%) reported poor quality of the biopsy, specifically the small size of the material [22] or inadequate routine laboratory processes [12] as reasons for disagreements.

Another reason cited was the utilization of suboptimal navigation tools reported by two authors (15.4%) [16, 17]. One author (7.7%) [23] remarked upon the difficulty to determine whether the discordance depends on disagreement between the methods or intraobserver disagreement of pathological diagnosis; it is possible the author intended to refer to the variations on the interpretations of pathological diagnosis, so intraobserver disagreement should not be used in this context. One author (7.7%) indicated the lack of immunohistochemistry special stains as a source for discordance [21].

Furthermore, nine studies (69.2%) [12,13,14,15,16, 19, 21, 22, 24] did not consider the performance of the digital method—i.e., limitations of the equipment, insufficient magnification/limited image resolution—as reasons for disagreements.

Eight studies (61.5%) [13,14,15,16,17,18, 20, 23] provided a preferred diagnosis when disagreements occurred. These preferred diagnoses were reached upon reviewing the discordant cases and choosing the most correct diagnosis. Among 99 disagreements, only 37 (37.3%) had preferred diagnoses rendered by means of WSI.

Categorization for digital pathology discrepancies

To summarize pitfalls of digital pathology practice and better address the root cause of the discordances, we developed a Categorization for digital Pathology Discrepancies, which can be used to report reasons for disagreements in further validation studies. This categorization can help to establish if there are valid concerns about the performance of the digital method (Table 4). We based this categorization on data retrieved from this systematic review and from the previously published systematic reviews [9, 59].

Table 4 Categorization for digital pathology discrepancies

Quality assessment (risk of bias)

The results of the quality assessment are shown in Table 5 and Fig. 4. In 13 included articles, two (15.4%) [13, 14] presented an unclear risk of bias in the sample selection domain due to selection criteria of the sample remained unclarified (e.g., if it was randomized or consecutive). One study [21] excluded several lesions (pituitary adenomas, degenerated diseases or other reactive lesions, metastatic carcinomas and melanomas, vascular malformations, and other benign or descriptive diagnoses such as meningoceles, dermoid cysts, or focal cortical dysplasia) not relevant to the study and also excluded cases for which the slides were not available for WSI scanning. These exclusions were acceptable and do not indicate bias. Two studies (15.4%) [21, 24] presented a high risk of bias in the index test due to the absence of specification of a threshold. The term “threshold” is related to the parameters used to classify the diagnoses—e.g., if they were concordant, slightly discordant, or discordant. The risk of bias was considered low in 100% of the other domains in the remaining included studies. Regarding applicability, all studies included were classified as a low concern in all domains.

Table 5 QUADAS-2
Fig. 4
figure 4

Graphic presentation for QUADAS-2 results for included studies

Discussion

Validation studies have been improved over time and the recommendations of CAP-PLQC guidelines are particularly important in this aspect, since the standardization of study designs provides validations with homogeneous methodology [1]. The main purpose of systematic reviews is to minimize the chance of type I (systematic) error, by eliminating studies with high risk of bias. Therefore, exclusion of studies with highly discrepant methodologies allowed the comparison of only well-designed studies and the reaching of solid, reliable conclusions. The way the sample is analyzed should encompass the index test and the reference standard with timing between analyses of paired samples (glass slide and correspondent digital slides). The analyses must be blinded, and the sample flow should encompass the analysis of all glass slides by CLM and, after the washout, the analysis of all correspondent digital slides.

Studies with a known malignant diagnosis, which may lead to a false high performance, and studies that compared WSI diagnosis with original or consensus diagnosis were excluded. These issues represent the most common problems in validation studies [60] and generate selection bias [4]. The use of the index test alone and the comparison with a consensus panel refers to a concept of accuracy, which is not a recommended design for this particular purpose. Three articles included in this systematic review mentioned a consensus diagnosis in two different, yet justifiable situations: to include in the sample only cases appropriate for the intended purpose [19] and to reach a preferred diagnosis in discordant cases [20, 23]. The importance of reaching a preferred diagnosis lies in the possibility of identifying the pitfalls and missing details of the pathology, which are determinants in some cases [1].

Among included studies, one (7.7%) [22] proposed to determine the accuracy of WSI interpretation but presented intraobserver agreement instead. The accuracy is defined as concordance between the result of the method tested and the diagnosis established by a consensus or gold standard, while the intraobserver agreement is basically the percentage of concordance between diagnoses reached by an observer when assessing two diagnostic modalities [1]. The outcome of this study was not aligned with the aim but was found to provide appropriate data, which allowed the correct interpretation of the results. Another study [12] proposed to evaluate if the diagnosis can be replaced by virtual microscopy and, for this purpose, the accuracy, sensitivity, specificity, and positive/negative predictive values were measured. The accuracy, in this context, was defined as the addition of the percentage level of concordance and minor discordance, which is not the best concept definition. The diagnostic performance was intended to be calculated by means of sensitivity and specificity. However, sensitivity and specificity are used to calculate the reliability of the method and indicate the consistency of the results as the test is repeated, not the performance of the test. Fortunately, this study also provided the percentage of concordance (intraobserver agreement) between WSI and CLM diagnosis. It is very important to correctly delineate the study design according to the aim. These sources of inconsistency generate divergent measures and provide conflicting and unreliable data.

Validated pathology areas included dermatopathology, neuropathology, gastrointestinal, genitourinary, breast, liver, and pediatric pathology. Surgical pathology specimens of pediatric pathology were gastrointestinal, heart, liver, lung, neuropathology, placenta, rectal suction, skin, and tonsil. Subsets also included endocrine, head and neck, hematopoietic organ, hepatobiliary-pancreatic organ, soft tissue, bone, hematopathology, medical kidney, and transplant biopsies. However, Saco et al. considered, in 2016, that the areas of hematopathology, endocrine pathology, soft tissue, and bone had not been fully studied [61]. Tabata et al., in 2017 [23], included soft tissue specimens and bone pathology in the sample, but it is not possible to know how representative these specimens were, and a more targeted and specific validation is recommended. Saco et al. had also pointed out the need for validations in the head and neck area because there was only one study in this subspecialty. Fortuitously, our research group recently published a validation in oral pathology [62], adding original evidence of high performance of WSI in this unexplored area. This study was not added to this systematic review because it was published after the search.

The washout time is highly variable in the literature, and there is no consensus of what period is most appropriated to avoid recall bias; either an inferior or an overextended washout may produce bias due to the sample flow. A small period of washout may cause memorization bias in the test, and a long washout may allow diagnostic criteria to change over time [12]. Surprisingly, this systematic review found that the study with the lowest intraobserver agreement has been conducted with one of the shortest washout periods: 3 weeks [24]. This study also stated that intraobserver variations do not derive from the technical limitations of WSI.

The inclusion of trained pathologists encompasses one of the recommendations of CAP-PLQC and appears to provide better concordance rates and minor diagnosis time [1]. One included study reported the lack of experience of the pathologists as a reason for disagreement [12]. However, the study methodology reported the inclusion of four trained pathologists. In addition to increasing the intraobserver disagreement, an inexperienced pathologist also increases the time needed for diagnosis. Most pathologists are convinced that the digital method is more time-consuming. However, the learning curve [39, 43] and the utilization of suboptimal tools for navigation [16, 17] are likely explanations for this increasing time and may also be related to the lack of confidence of the pathologist in the WSI manipulation [63].

Although no formal assessment of timing has been conducted, two included studies [16, 17] stated that the utilization of suboptimal navigation tools, such as a computer mouse, is not adequate to explore the glass slide and may increase digital diagnosis time.

The scan time may also be influenced by the file size, which is dependent on the magnification of scanning [64] and represents an extra step in the diagnostic process. This is one of the chief barriers to digital pathology acceptance, even greater than the time required to render a diagnosis [8]. However, scan time is also highly variable and depends on the type of scanner used and its throughput capacity. It is therefore very difficult to include scan time as a part of the validation studies since it does not provide a reproducible parameter. This may explain the absence of timing in most validation studies. These issues should be considered an integral part of the digital methodology and not a disadvantage.

A higher intraobserver agreement is related to the high quality of digital slides and a better workflow provided by WSI systems [65], which appears to be easier to navigate compared to handling glass slides [66]. Some studies stated that digital microscopy provides the best definition of histologic images and confers the best method for the identification of microscopic structures [67]. Intraobserver agreement values of the included studies support the high performance of the digital method, and even the study with a lower intraobserver agreement [24] dismissed the technical limitations of WSI as a reason for the occurrence of discordances. However, it is important to be able to recognize when an overestimation of the test’s performance occurs. Validation studies have incorporation bias since index tests and reference standards are not independent. In addition, intraobserver variability also increases when comparing the same glass slide over time. Interobserver variability can also be increased in difficult cases. This fact supports the cross-analysis of intraobserver and interobserver variability [24]. However, CAP-PLQC advocated that it is important for validation purposes to have one pathologist reproducing the same diagnosis with both modalities—i.e., intraobserver agreement—and the main objective is to accomplish a higher concordance rate [1]. The interobserver agreement should not be used to evaluate the performance of the test because this introduces bias due to the individual diagnostic interpretations of each pathologist [68].

The secondary objective of this review was to identify the reported reasons for disagreements and to determine the cause of these occurrences, which is also stated by CAP-PLQC as an important outcome [1]. In this systematic review, the most commonly reported reason for diagnostic discordance were borderline cases. The difficulty caused by borderline cases is inherent in the diagnostic process and can occur in CLM as well [20]. Sometimes, there is a need for higher magnifications to visualize subtle details, which could be present in difficult cases [64]. The subjectivity of some diagnoses, such as the interpretation of dysplasia [19], often indicates a greater complexity and also correlates directly to the individual interpretation and experience of the pathologists. This systematic review identified a higher frequency of borderline and challenging cases in dermatopathology validation studies.

Seven authors reported the limitations of the equipment and/or the limited image resolution as pitfalls. Among these, one study [18] indicated pitfalls regarding the identification of eosinophilic granular bodies, eosinophils, and nucleated red blood cells in a neuropathology specimen of a pediatric validation study, but the authors did not consider this fact as a failure of the WSI method. Pekmezci et al. reported difficulties in the identification of mitotic figures, nuclear details, and chromatin patterns in a neuropathology validation study [21]. Also, difficulties in the identification of microorganisms were reported in three studies [14, 16, 24], but the need for higher magnifications appeared to be of little relevance in these studies. Thrall et al. stated that the lack of image clarity was a limitation of the technology but dismissed this fact as a reason for the intraobserver variances [24]. The impairment in recognizing eosinophilic granular bodies, eosinophils, mitotic figures or nuclear details and chromatin pattern, as well as some microorganisms—such as Candida albicans, Helicobacter pylori, and Giardia lamblia—points to a limitation of the scanner and occur more frequently in some subspecialties of pathology (neuropathology, gastrointestinal pathology, and pediatric pathology within a neuropathology specimen). These pitfalls highlight the need for more advanced scanners, which should certainly be improved with the advent of technological improvement. Therein lies the need for regulation of these devices, which should be standardized and improved. It is important to emphasize that difficulties in the identification of microorganisms were pointed to as reasons for disagreements, but higher magnifications were not considered to be very relevant by the authors [14, 16].

The lack of clinical information supplied with cases in both analyses represents an absence of reproducibility [1], increases the difficulty in the diagnostic process, and may lead to wrong diagnoses. Four included studies [12, 17, 19, 21] did not provide clinical data for the analyses. Nielsen reported that this absence could make it more difficult to render the diagnoses, which may add an element of error [12], while Al-Janabi et al. indicated that the provision of clinical data may decrease these errors [17]. According to Kent et al., the lack of clinical information leads to disagreement in a sample of inflammatory lesions. However, this author also reported a high level of concordance with inflammatory lesions that had a mixed infiltrate with eosinophils [19]. One included study [23] did not mention if clinical data was provided and did not correlate the absence of this information with the occurrence of discordant diagnoses. Fortunately, the majority of validation studies recognized the need to correlate the histopathological and the clinical information to provide a correct diagnosis, either through glass or digital slides.

The selection and inclusion of the cases should, ideally, be consecutive or random. However, this selection strategy may not provide a sample with the most relevant diagnosis and a broad range of tissue sources. A stratified uniform sampling is more appropriate to select the cases; it gives smaller error estimation and may be useful to do measurements and estimates using cases grouped into strata [69]. Unfortunately, none of the included studies followed this methodology. Additionally, two studies included in this systematic review [13, 14] did not clarify how the samples were retrieved. An inappropriate exclusion of cases may result in overly optimistic estimates of diagnostic accuracy [4]. One included study reported exclusions [21], but it was acceptable and coherent with the proposal of the study. The pre-specification of the test threshold is important so there is no bias in interpreting the results; this could otherwise lead to an over-optimistic estimate of the test performance [70]. Two included studies [21, 24] did not mention the threshold previously, but one [24] mentioned a deliberately low threshold setting to maximize the identification of discordances.

In general, this systematic review showed a high concordance between diagnoses achieved by using WSI and CLM. The included studies were optimally designed to validate WSI for general clinical use, providing evidence with confidence and—most importantly—it is possible to confirm that this technology can be used to render primary diagnoses in several subspecialties of human pathology. The reported difficulties related to specific findings of certain areas of pathology reinforce the need for validation studies in some areas not fully studied, such as hematopathology, endocrine, and bone and soft-tissue pathology.