Introduction

Artificial intelligence (AI)—the use of computer algorithms to perform tasks or solve problems that were traditionally associated with the human capacity for flexible thinking and adaptivity—has undergone explosive progress in the last decade thanks to improvements in computing power and access to ever-increasing repositories of data [1,2,3]. AI now plays a role in several commonplace technologies, including voice recognition or targeted advertising, and is poised to soon become involved in others, such as autonomously driving vehicles [1].

Given the portion of the gross domestic product that healthcare represents in developed countries, and given the enormous quantities of data generated both from clinical medicine and biomedical research, it is unsurprising that healthcare has been a hotbed for recent AI research. Indeed, as of 2019 in radiology, the medical field where AI has arguably made the most progress to date, 90% of practicing radiologists anticipate that AI will be incorporated into their future practice.

Simultaneously, as the volume of medical data available grows at an exponential pace, investigators and clinicians are faced with the “data rich, information poor” problem, wherein the quantity—and complexity—of the data generated exceed our ability to make use of it; other commentators have noted that merely possessing increasing amounts of data is no guarantee that more knowledge, or better patient care, will follow [3]. While still nascent, AI holds similar promise for medicine broadly, and for the practice of hematology. This review provides a description of relevant concepts in AI for those unfamiliar with it, applications where AI shows utility in hematology, and future challenges pertaining to AI’s integration into clinical practice.

Machine Learning: Basic Concepts

The vast majority of healthcare-oriented AI falls under the subheading of machine learning (ML) (Fig. 1). Broadly, ML refers to any algorithm that, rather than being programmed directly to complete some objective, is instead designed to take source data and find the most effective way to use those data to complete that objective [4]. Among ML tasks, most applications are considered “supervised learning,” meaning that a specific outcome of interest (called a “label”) is known and the algorithm in question is designed to find the best way to predict that outcome [5]. This typically consists of either classification or regression problems. “Unsupervised” applications, where a prespecified outcome is not known, also exist and are used to identify previously unknown structures within data, for example by clustering different samples according similar genomic data [4, 5].

Fig. 1
figure 1

Overview of machine learning implementation. Source data of varying types are collected for a given dataset, then fed into a machine learning model (decision tree in this diagram, on the upper part) or deep neural network (lower part) which is trained to make prediction of two types of problems. Classification problem (defining a class or multi-class prediction) and a regression problem (continuous number prediction). For example, a model can be fed in images of dogs and cats and trained on a deep neural network to predict each class

Myriad algorithms exist under the umbrella of machine learning. Of these, neural network methods are some of the most commonly used in recent years. As suggested by their name, neural networks draw inspiration from the structure of the nervous system; the basic units of communication in a neural network, called nodes or neurons, are arrayed in sequential layers, with connections of varying strength between each layer [1, 6]. The initial layer of a neural network receives input data such as an array of pixels, a series of words, or an array of categorical data. These data are relayed through intermediate, “hidden” layers, which can create representations of relationships or higher-level concepts within data, for example progressing from a set of individual genes to gene networks, or from an array of pixels to a collection of objects like faces and bodies [1, 6]. Hidden layers ultimately feed into an output layer, which can perform standard classification/regression tasks or create more sophisticated outputs, such as novel images, sound, or text. “Deep neural networks” or “deep learning” refers to any neural network with more than one hidden layer; most neural networks employed in healthcare are deep networks [2, 6].

Neural networks can be built in a variety of configurations that allow them to excel at various functions. Convolutional neural networks (CNNs) are highly successful models designed primarily for handling visual data. CNNs work by filtering small pieces of an image (typically three to seven pixels) in order to detect simple elements of an image such as edges, curves, or changes in color; the output of these filters is passed through subsequent layers of the network, which in turn encode increasingly high-level features such as shapes, textures, and eventually entire objects [1, 7].

CNNs’ architecture gives them remarkable flexibility, and since their inception, they have significantly outperformed previous approaches [8]. CNNs carry the additional benefit of requiring less domain-specific knowledge to construct than other methods, and because their initial layers encode basic shapes rather than specific objects, highly accurate existing networks can be repurposed for other uses in a process called “transfer learning,” which is a crucial ability when, as in many healthcare settings, datasets number in the hundreds of samples rather than the millions used to develop the original network [6]. Some CNN-based models have even achieved human-level accuracy in diverse image processing tasks such as screening for diabetic retinopathy, correctly classifying histopathologic data, and interpreting screening mammograms, highlighting both the effectiveness of CNNs and their applicability in diverse applications [9,10,11,12].

Image Analysis

Pathology

As with other medical applications, AI-based image processing in hematology has leapt forward with the advent of CNNs. CNN-based models can now discriminate between different leukocytes on peripheral smears with area under the receiver operating characteristic (AUROC) scores exceeding 95%, demonstrating promising proof of principle for greater automation in routine pathology practice [13,14,15]. Similar work is under way in the automated interpretation of bone marrow specimens [16]. CNNs have also demonstrated utility in describing qualitative and quantitative differences within individual lineages of cells, such as erythrocyte morphology and textural changes in sickle cell disease [17]. These successes extend to the differential diagnosis of disease, where models have demonstrated the ability to diagnose acute myeloid leukemia, differentiate between causes of bone marrow failure, and, in resource-constrained settings, serve as a point of care screening tool for lymphoma [14, 18, 19].

The above developments suggest several possible practical uses for AI. Pathology results often play a major role in informing the treatment of malignancy, but significant variation may exist among observers, a fact that can be especially true in the case of uncommon diseases, and one which carries significant clinical ramifications [20,21,22]. In such settings, AI models could serve as a consistent reference standard that could serve either to support diagnoses or to prompt review by another individual. In the case of uncommon conditions or resource-limited environments, AI systems could serve as a means of effectively triaging patients for referral to specialist care.

Radiology

Radiology has enjoyed similar benefits from advancements in computer vision. Detection of marrow involvement and bony lesions in patients based on PET or CT data [23, 24]. Similar methods have been used to define regions of marrow involvement in recurrent acute myeloid leukemia (AML) [25]. CNN-based methods have also been employed in image segmentation, a process that involves highlighting the boundaries of different structures in an image and is laborious for human operators to perform [26]. Beyond diagnosis and anatomic measurement, AI can extract additional information from radiographic data, such as more effective risk stratification of patients with Hodgkin lymphoma, which could in turn affect treatment decisions [27, 28]. As with pathology, AI has several potential roles to play in radiology, including use as a screening tool, decision aid, or prognostic model.

Laboratory and EHR Data

Beyond image processing, several other data sources, either individually or in concert with one another, provide valuable substrate for AI models and can augment clinical care in the setting of diagnosis, prognosis, and diagnosis.

Diagnosis

AI has been used in several avenues to improve the reliability, convenience, or efficiency of diagnoses. CNN-based approaches have been demonstrated to effectively diagnose multiple myeloma based exclusively on mass spectrometry data from peripheral blood [29]. In the case of difficult-to-differentiate conditions, such as various causes of bone marrow failure, personalized models have demonstrated high diagnostic ability by integrating patient demographics, laboratory data, and basic genetic information [30]. Similar approaches have also been employed for differential diagnosis of peripheral leukemia versus lymphoma [31]. As with image analysis, these advances open the door to AI’s use as a means to reduce the amount of resources required for diagnostic studies and to standardize their interpretation.

Prognosis and Risk Stratification

Prognosis is a notoriously difficult task, and even in widely used clinical prognostication tools significant variability exists within risk strata [32]. AI, which is well equipped to handle nonlinear, complex data, has the potential to provide more refined, personalized prognoses. Such approaches have been used in the benign hematology setting to refine risk scores for central catheter thrombosis, identifying low-risk individuals with a 95% negative predictive value [33]. For patients receiving hematopoietic stem cell transplants, AI has been used to stratify individuals at low versus high risk for acute graft-versus-host disease, with implications for decision-making about immunosuppression of such individuals [34]. Similar efforts have been undertaken in autologous transplant for multiple myeloma [35]. In malignant hematology, AI has been used to improve upfront risk stratification for AML/MDS [36, 37]. In the post-treatment setting, where the presence of minimal residual disease (MRD) is an adverse prognostic factor, AI has demonstrated the ability to achieve human-level performance at MRD detection via flow cytometry and mass cytometry, something which could streamline and standardize the processing of such data [38,39,40].

Beyond providing valuable information to patients, advances in prognostic ability may better inform clinicians’ treatment decisions by better assessing risk in heterogeneous risk strata. Sasaki et al. (2019) describe the use of a decision tree–based approach to CML treatment, and in retrospective data demonstrate that ML-informed treatment resulted in longer survival for patients compared with usual care [41]. While prospective validation is needed, the ability to apply risk-stratification to treatment planning is appealing, and in CML, it could provide guidance for clinicians as they seek to balance drug tolerability and efficaciousness.

Genomics and Response Prediction

In the case of hematologic malignancy, patients’ disease biology is, with a few notable exceptions, driven by heterogeneous and complex genetic factors that are difficult to elucidate. The advent of accurate, relatively inexpensive next-generation sequencing (NGS) technologies makes it more feasible to enumerate the genetic alterations present in individual patients’ malignancies. In lymphoma, NGS in tandem with AI has been used, with or without other data such as in vitro drug sensitivities, to predict response to chemotherapy [42, 43]. In MDS, this has been used to predict response to lenalidomide or hypomethylating agents using a recommender system inspired by targeted advertising algorithms [44, 45]. Similar approaches have been employed when considering the treatment of AML, where the generation of the BeatAML data repository has additionally provided an example of the utility of creating large, well-curated datasets available to the research community at large [46].

As seen in larger pan-cancer cohorts, adequately sized datasets also provide the ability to apply ML to the pursuit of basic science insights; recent work by Rheinbay et al. demonstrates the use of unsupervised methods to augment the discovery of novel non-coding driver mutations in a pan-cancer cohort [47]. Developing both the techniques and the data collections to better elucidate disease biology will, hopefully, in turn better inform novel, biologically sound therapies that are better suited to patients’ unique disease biology.

Novel Therapeutics and Trial Design

Devising, refining, and testing new therapies is an expensive, time-consuming task with a low success rate. AI-based strategies may aid in developing more rational and efficient pipelines for drug development. This includes models designed to integrate in vitro sensitivity data from drug screens with genomic information about the cell lines used in order to more accurately predict the response to new agents and to investigate alternative uses of existing compounds [42]. From the perspective of medicinal chemistry, neural networks have demonstrated the ability to closely approximate the performance of more computationally expensive computational techniques for modeling protein-drug interactions, which may potentially allow researchers to take on more computationally demanding tasks [48].

Beyond basic and translational science, personalized care informed by AI has the potential to affect how clinical trials are conducted. Low response rates confer the need for increased size and expense in clinical trials, and strategies to effectively select the patients most likely to benefit from an intervention can lower costs and increase the likelihood of finding use for new therapies [49, 50]. Conversely, methods that effectively identify patients unlikely to benefit from a treatment spare them futile treatment and needless toxicities, and open up the possibility for accessing other avenues of treatment. In the setting of malignancy, this could mean that patients unlikely to benefit from the current standard of care could receive investigational agents without being subjected to agents unlikely to benefit from them.

Future Challenges

AI has several hurdles to clear before realizing its potential in medicine. It is likely, however, to increase in prominence and, ultimately, to foster changes and evolution in clinical practice. Considering this, an understanding of AI’s basics will need to become part of physicians’ statistical literacy, and they should be aware of some of the challenges that accompany AI’s emergence. These include logistical questions related to its implementation, the adequacy of data used to develop AI models, and the adoption of clinically meaningful standards for AI development. While there has been some speculation that AI will displace human pathologists or radiologists, these views are more commonly held by those less familiar with those fields and those less familiar with AI; domain experts instead envision complementary roles for AI and physicians [51,52,53].

Implementation

Bringing AI into the clinic represents a distinct challenge from the initial development of AI systems. Given its need for access to large amounts of patient data, AI systems will likely have to be implemented into EHR, posing potential challenges regarding security and data ownership. The prospect of an EHR-based system also raises questions about the effect on clinicians. EHR is frequently cited as a contributing factor in physician burnout, and potential unintended consequences such as perceived autonomy or an increased documentation burden need to be considered.

Practically speaking, AI also needs to be engineered in a manner that makes its use convenient in the setting of a busy practice. Systems that require extra work to manually enter patient data or access models are less likely to see uptake regardless of how well they perform. In cases such as pathology where physical specimens are used, the logistics of specimen preparation and analysis bear consideration. For implementation-focused research, e.g., recently published work in AI-augmented microscopy, the outcomes of interest include not only the discriminative ability of the system but also its ability to perform in real time and without substantially affecting existing workflows [54] (Table 1). Such factors directly influence a technology’s ability to be incorporated into the practice of medicine, and as such should be considered endpoints in similar work.

Table 1 Representative AI publications in hematology. AA, aplastic anemia; MDS, myelodysplastic syndrome; CNN, convolutional neural network; sens., sensitivity; spec., specificity; AML, acute myeloid leukemia; AUROC, area under the receiver operating characteristic curve; SUV, standardized uptake volume; MRD, minimal residual disease; SVM, support vector machine; NGS, next-generation sequencing

Appropriate Training Data

As with any other statistical model, AI models’ performance hinges on the data used to develop them. Watson for Oncology (WFO), an IBM initiative designed to process, integrate, and implement large volumes of both patient data and medical literature to help oncologists choose appropriate options for patients, is an illustrative case. Enthusiasm for WFO has waned substantially, in large part because WFO’s treatment recommendations often varied from those of clinicians, sometimes to an alarming degree [55, 56]. Studies of WFO’s recommendations found such discrepancies more often in intermediate-stage malignancy or in patients unfit for standard therapies [57, 58]. WFO’s example highlights both the inherent challenge of delivering truly personalized care and the significance of adequate source data. As with other clinical research, source data need to be considered when applying AI models in new settings, whether that is in a new health system or in a population different from the one used to develop the model.

The data used to generate models can also precipitate ethical challenges. Language models derived from repositories of free text from the internet display a clear tendency to associate male individuals more closely with leadership roles and professional accomplishment than their female counterparts [59, 60]. Similarly, programs designed to predict crime or approve loans have been observed to exhibit bias against racial minorities [61, 62]. These cases highlight the downside of AI’s ability to learn from large datasets; if the data themselves reflect problematic outcomes or practices, then models run the risk of internalizing them.

Making AI Clinically Meaningful

Algorithms can only be as clinically meaningful as the outcomes that they are designed to predict. This requires, as with other clinical research, the use of appropriately patient-oriented endpoints. On a more technical level, it also requires the use of appropriate metrics to evaluate model performance. For instance, the use of accuracy as a metric is of dubious use in medical settings because it assigns equal value to true positive and true negative results; in healthcare settings, where the outcome of interest is often only present in a minority of the population, accuracy will overstate the utility of a model due to a high proportion of true negative results, despite poor performance in identifying the true outcome of interest. In order to best align algorithms’ performance with patients’ interests, metrics should be selected and reported with an eye to the outcomes that clinicians and patients care about (e.g., explicitly describing the burden of false positive and negative results alongside model accuracy).

Beyond reliable performance, AI models’ decisions need to be interpretable. Knowing not only what an algorithm predicts, but why it does so is critical when faced with clinical ambiguity, when discussing decision-making with patients, or in instances where there is disagreement between human and machine predictions. One of the ways to accomplish this is through the use of separate algorithms designed to identify how individual variables contribute to a model’s output, an approach that has been developed for ML more broadly but has also studied specifically in the context of healthcare [63, 64]. Such approaches have been used in hematologic malignancy to highlight the most salient features of individual patients’ diseases, something that can prove useful both in the context of explaining diagnoses and in making predictions about responsiveness to treatment [30, 37, 65].

In settings outside of hematology, there has also been success in designing neural networks whose architecture contains intrinsic mechanisms for explainability, such as highlighting regions of interest on pathology slides or identifying the phrases in a patient’s medical record that most strongly suggest a particular diagnosis or prognosis [66]. Such measures lend accountability and credibility to predictions, and facilitate AI models’ integration into the larger picture of clinical deliberation rather than serving as stand-alone decision points.

Conclusion

Hematology as a field stands to benefit significantly from contemporary AI, both across a spectrum of different types of data and across a spectrum of patient care, including diagnosis, prognosis, and more effective management of hematologic disorders. As AI continues to gain a foothold in the management of pathologic, radiologic, genomic, and EHR data, attention needs to be paid to its effective implementation into clinical practice and the development of AI systems with an eye to how they will ultimately impact patient care.