Ultrasound (US) is one of the core diagnostic imaging modalities, and is routinely used as the first line of medical imaging for evaluation of internal body structures, including solid organ parenchyma, blood vessels, the musculoskeletal system, and the fetus. US has become a ubiquitous diagnostic imaging tool owing to several major advantages over other medical imaging methods such as computed tomography (CT) and magnetic resonance imaging (MRI). These key advantages include real-time imaging, no use of ionizing radiation, and better cost effectiveness than CT and MRI in many situations. In addition, US is portable, requires no shielding, and utilizes conventional electrical power sources and is therefore well suited to point-of-care applications, especially in under-resourced settings. As the field progresses, US, especially when combined with other technologies, has the potential to be an in-home biosensor, providing ambulatory, long duration, and non-intrusive monitoring with real-time biofeedback.

US also presents unique challenges, including operator dependence, noise, artifacts, limited field of view, difficulty in imaging structures behind bone and air, and variability across different manufacturers’ US systems. Dependence on operator skill is particularly limiting. Many healthcare providers who are not imaging specialists do not use US at the point of care owing to a lack of skill in acquiring and interpreting images. For those that do, high inter- and intra-operator variability remains a significant challenge in clinical decision making. As a result of high inter-operator variability, US-derived tumor measurements are not accepted in most cancer drug trials, and US is therefore generally not used clinically for serial oncologic imaging. Automated US image analysis promises to play a crucial role in addressing some of these challenges.

Recent surveys of ML for medical imaging, such as [1,2,3,4], primarily focus on CT, MRI, and microscopy. In this review, we focus on the use of machine learning (ML) in US. The objective of this paper is to review how recent advances in ML have helped accelerate US image analysis adoption by modeling complicated multidimensional data relationships that answer diagnosis and disease severity classification questions. We have two goals: (1) to highlight contributions that utilize ML advances to solve current challenges in medical US, (2) to discuss future opportunities that will utilize ML techniques to further improve clinical workflow and US-based disease diagnosis and characterization. Our survey is non-exclusive, as we mainly focus on work within the past 5 years, where ML, particularly deep learning (DL), has started to have a major impact. We also emphasize solutions at the system level, which is an important aspect, due to the unique characteristics of the US image generation workflow. Figure 1 shows that US image processing involves more than simply a classification step, but additionally includes preprocessing and various types of analyses depending on several possible applications.

Fig. 1
figure 1

Overview of ultrasound processing system workflow

This article is divided into four sections: (i) an overview of basic principles of US, (ii) an overview of ML, (iii) ML for US, and (iv) summary and outlook.

Overview of US imaging

US imaging

Medical US images are formed by using an US probe to transmit mechanical wave pulses into tissue. Sound echoes are generated at boundaries where different tissues exhibit acoustic impedance differences. These echoes are recorded and displayed as an anatomic image, which may contain characteristic artifacts including signal dropout, attenuation, speckle, and shadows. Image quality is highly dependent on multiple factors, including force exerted on the US transducer, transducer location, and orientation.

Using various signal-to-image reconstruction approaches, several different types of images can be formed using US equipment. The most well-known and routinely used clinically is a B-mode image, which displays the acoustic impedance of a two-dimensional cross section of tissue. Other types of US imaging display blood flow (Doppler imaging and contrast-enhanced US), motion of tissue over time (M-mode), the anatomy of a three-dimensional region (3D US), and tissue stiffness (elastography).

US elastography

US elastography is a relatively new imaging technique of which there are two main types in current clinical use: (1) strain elastography, where image data are compared before and after application of external compression force to detect tissue deformation, and (2) shear wave elastography (SWE), which uses acoustic energy to move tissue, generating shear waves that extend laterally in tissue. These shear waves can be tracked to compute shear wave velocity, which is algebraically related to tissue stiffness measured as the tissue Young’s modulus.

Tissue stiffness is a useful biomarker for pathologic processes, including fibrosis and inflammation, leading to several additional clinical applications for medical US. A diagnostic imaging gap recently addressed by US elastography is the evaluation of chronic liver disease [5,6,7,8]. US elastography liver stiffness measurements have been shown to be a promising liver fibrosis staging biomarker, and as a result highly relevant to chronic liver disease risk stratification. These technologies have the potential to replace liver biopsy as the diagnostic standard of care for key biologic variables in chronic liver disease. SWE has also been used to assess breast lesions [9,10,11], thyroid nodules [12,13,14,15], musculoskeletal conditions [16,17,18,19,20], and prostate cancer [21,22,23].

Figure 2 is an example of a SWE image (the colored pixels) overlaid on top of a B-mode US image, acquired for liver fibrosis staging. Tissue stiffness measurements are obtained by placing a region of interest (ROI) inside the SWE image box. Similar to B-mode US, elastography also suffers from inter- and intra-observer variability [8]. This represents an area of opportunity for ML-based automated image analysis improvement. We will discuss this in detail in Sect. “Additional applications of machine learning to US.”

Fig. 2
figure 2

Example SWE color map overlaid on a B-mode ultrasound image

Contrast-enhanced US (CEUS)

Contrast-enhanced US utilizes gas-filled microbubbles for dynamic evaluation of microvasculature and macrovasculature. At present, US contrast agents are exclusively intravascular blood pool agents. Differentiation between benign and malignant focal liver lesions is an application of particular clinical interest [24,25,26]. The late phase of contrast enhancement allows for real-time characterization of washout, a critical feature in the differentiation of benign liver lesions (e.g., hemangioma, focal nodular hyperplasia, adenoma, regenerative nodule) and malignant liver lesions (e.g., hepatocellular carcinoma, cholangiocarcinoma, metastasis). The DEGUM study—a multicenter German study that analyzed 1328 focal liver lesions—reported a 90.3% accuracy of CEUS for focal liver lesions with a 95.8% sensitivity, 83.1% specificity, 95.4% positive predictive value, and 95.9% negative predictive value for distinguishing benign and malignant liver lesions [27]. Other areas of clinical interest include evaluation of focal renal lesions [28], thyroid nodules [29], splenic lesions [30], and prostate cancer [31]. CEUS limitations include operator dependence, motion sensitivity, and the need for a good acoustic window. Advanced US image processing offers potential opportunities to augment CEUS by mitigating these limitations.

Overview of machine learning (ML)

ML is an interdisciplinary field that aims to construct algorithms that can learn from and make predictions on data [32], [33]. It is part of the broad field of artificial intelligence and overlaps with pattern recognition. Substantial progress has been made in applying ML to natural language processing (NLP), computer vision (e.g., image and text search, face recognition), video surveillance, financial data analysis, and many other domains. Recent progress in deep learning, a form of ML, has been dramatic, resulting in significant performance advances in international competitions and wide commercial adoption. The application of ML to diverse areas of computing is gaining popularity rapidly, not only because of more powerful hardware, but also because of the increasing availability of free and open source software, which enable ML to be readily implemented. The purpose of this section is to introduce ML approaches and capabilities to US researchers and clinicians. Historical reviews of the field and its relationship with pattern recognition can be found elsewhere [34,35,36]. The following essential concepts are introduced at a level appropriate for understanding this review: supervised and unsupervised learning, learning based on handcrafted features, deep learning, testing, and performance metrics.

Supervised vs. unsupervised learning

Most ML applications for US involve supervised learning, in which a classifier is trained on a database of US images labeled with desired classification outputs. For example, a classifier could be trained to output a value of 1 for input images of malignant tumors and a value of 0 for benign tumors. Once a classifier is trained, it can be used to classify previously unseen test images.

Unsupervised learning involves finding clusters or similarities in data, with no labels provided. This can be useful for applications such as content-based retrieval, or to determine features that can distinguish different classes of data.

A type of learning that falls between supervised and unsupervised learning is termed weakly supervised learning [37]. A significant challenge in building up large US image databases has been the time involved for expert annotation to support supervised learning. Annotation effort can be simplified by reducing the detail of information provided by the expert. For example, an image containing a tumor can be labeled as such, without having to annotate the precise location or boundaries. The ability to train a classifier with this type of less detailed information is termed weakly supervised learning. These and other types of learning, such as reinforcement learning, are described in detail elsewhere [38].

Learning based on handcrafted features

Traditionally, ML has involved computing handcrafted features that are believed to be able to distinguish between classes of data. These features are then used to train and test a classifier. For US, common types of features are morphologic, e.g., lesion area or perimeter, or textural [39], based on information in the frequency domain [40], or parameter fitting. Often a large number of candidate features are computed and then a feature selection algorithm is applied to select the best features or a dimensionality reduction algorithm [41] is applied to combine the features into a smaller composite set.

A classifier is then trained to form a feature mapping to compute desired outputs. It is important to constrain the classifier so that it does not overfit to the training data, because overfitting results in model errors that do not generalize beyond the training set to new data. The need to avoid overfitting is one of the main reasons feature selection or extraction algorithms are applied before training a classifier. Avoiding overfitting is a special concern for US research, which has thus far involved relatively small databases. Over the years, many supervised learning classification algorithms have been developed and many have been applied to US for handcrafted features. The most common approaches applied in the surveyed papers are random forests [42], support vector machines [43, 44], and multilayer feedforward networks [45,46,47], also known as artificial neural networks.

Deep learning (DL)

The effort and domain expertise involved in handcrafting features has led researchers to seek algorithms that can learn features automatically from data. DL is a particularly powerful tool for extracting non-linear features from data. This is particularly promising in US, where predictable acoustic patterns are typically neither obvious nor easily hand-engineered. Figure 3 illustrates high-level differences between conventional ML and DL. The fast adoption of DL has been enabled by faster algorithms, more capable Graphics Processing Unit (GPU)-based computing, and large data sets.

Fig. 3
figure 3

Conventional machine learning vs. deep learning

DL extends multilayer feedforward networks from the two layers of weights used in the past to multiple layers. Figure 4 is an example of a generic supervised DL pipeline that includes both the learning phase and the deployment phase. In the learning phase, labeled samples (e.g., labeled US thyroid nodule images) are randomly divided into training/test sets or training/validation/test sets. The training data are used for finding the weights for each of the layers. During the process, features are discovered automatically and a model is learned. The validation set is used for optimizing the network parameters. The test data are used for estimating the performance of the learned network. This model estimation and selection technique is called cross-validation [48]. During the deployment phase, the machine applies the model learned to make a prediction on a new, unlabeled input (e.g., an unlabeled US thyroid nodule image that the machine has not seen before).

Fig. 4
figure 4

Supervised learning with deep neural networks

The multiple processing layers have been demonstrated to learn features of the data with multiple levels of hierarchy and abstraction [49]. For example, in imagery of humans, a low level of abstraction is edges; higher levels are body parts. A variety of deep learning structures have been explored. Among them, convolutional neural networks (CNNs) are one of the most popular choices for classifying images, due to unprecedented classification accuracy [50, 51] in applications such as object detection [52,53,54], face detection [55,56,57], and segmentation [58, 59]. In a typical CNN, convolutional filters are applied in each CNN layer to automatically extract features from the input image at multiple scales (e.g., edges, colors, and shapes), and a pooling process (termed ‘max pooling’) is often used between CNN layers in order to progressively reduce the feature map size. The last two layers are typically fully connected layers, from which classification labels are predicted (Fig. 5).

Fig. 5
figure 5

Example convolutional neural network (CNN)

Testing and Performance Metrics

As mentioned in Sect. “Deep learning,” classifier development and testing typically involve splitting the randomized labeled data into training/test sets or training/validation/test sets. A validation set is used to determine the best network structure and other classifier variations based on several training runs, and an independent test set held aside to evaluate performance until the classifier has been completed.

When a database is sufficiently large, it can be partitioned a priori into these distinct sets above. For smaller data sets such as those commonly seen in US research, k-fold cross-validation is often used. Cross-fold testing can be performed up to a maximum of N times for a database of N samples. In this case, termed leave-one-out testing, all of the samples in the database are used for training except for one sample data (e.g., one image), which is used for testing. Details of these techniques and other cross-validation techniques can be found in [60, 61].

Classifier performance is reported in a variety of ways. The most common for two classes is area under the receiver operating characteristic curve (AUROC), often simplified as “area under the curve” (AUC). An operating characteristic is formed by measuring true positive and false positive rates as the decision threshold applied to the classifier output is varied [62]. The AUC is then computed from the operating characteristic.

ML for US

Principal applications of ML to US include classification or computer-aided diagnosis, regression, and tissue segmentation. Other applications include image registration and content retrieval. Each of these applications is surveyed in the following subsections, with an aim to provide insights into progress and best approaches. In particular, advances in approaches using deep learning are highlighted, compared to approaches that use handcrafted features. Table 1 provides a summary of the applications in the papers surveyed.

Table 1 List of applications for papers surveyed

Classification

Computer-aided disease diagnosis and classification in radiology have received extensive attention and have benefited greatly from the recent advances in ML. A variety of applications have been addressed in computer-aided diagnosis, but primarily for detecting or classifying lesions, mainly in the breast and liver. Most of the recent papers surveyed follow the classic approach of computing handcrafted features, applying a feature selection algorithm, and training a classifier on the reduced feature set. This basic approach has been investigated for over 20 years, e.g., [63, 64]. Specific feature and algorithm choices for each step vary. Preprocessing includes despeckling.

Features considered are primarily texture-based or morphological. The largest number of publications has been on classifying breast lesions. A review of breast image analysis [65] places US in the context of several imaging modalities. For classifying breast lesions, computerized methods have been developed to automatically extract features from the BI-RADS (Breast Imaging Reporting and Data System) lexicon, relating to shape, margin, orientation, echo pattern, and acoustic shadowing [66]. These features are standardized and readily understandable by radiologists. Typically, a large number of features is reduced in dimension by either selecting the most informative features or by linearly combining features with principal components analysis [41]. Commonly used classifiers include multilayer networks (neural networks) [67], support vector machines [43], and random forests [68], the details of which extend beyond the scope of this review.

Although these papers indicate the promise of US computer-aided diagnosis, the reported studies have several limitations. These classification studies typically rely on manual region-of-interest (ROI) selection of the portion of the image that includes candidate pathology; that subimage is then classified. Manual ROI selection assumes significant involvement by a radiologist in practice, or at least neglects the problem of ROI selection. The number of patients and images available for training and testing is typically small; in nearly all cases, the number of images was < 300. In addition, the US images were often collected at a single location by a single type of US device. Each paper reports results obtained on a different validation database, making results difficult or impossible to compare.

Two recent papers have compared the performance of commercial diagnosis systems vs. radiologists. In [69], performance of a system from ClearView Diagnostics (Piscataway, New Jersey, USA) for diagnosing breast lesions was compared to that of three certified radiologists. At the time of publication, the system was being reviewed for FDA clearance. The study was co-authored by ClearView Diagnostics employees and thus was not an independent evaluation. Ground truth for 1300 images was determined based on biopsy or one-year follow-up. Likelihood of malignancy and the preliminary BI-RADS assessment were assessed. The comparison focused on images; the reading radiologists did not have access to other information, such as patient history and previous imaging studies. Based on likelihood of malignancy, the computer system was determined to have outperformed the radiologists. Fusing the radiologist and computer assessments was also found to improve sensitivity and specificity over radiologist assessments alone.

In [70], performance of a system from Samsung (Seoul, South Korea) for assessing malignancy of thyroid nodules was compared to that from an experienced radiologist. One hundred two nodules with a definitive diagnosis from 89 patients were included in the study. The system’s performance was lower than that of the radiologist’s. It was speculated that improved segmentation would improve the performance.

The number of papers that have applied deep learning techniques to US disease classification has dramatically increased in the last 2–3 years [71, 72]. For deep learning, it has been unclear until very recently whether CNNs that have been trained on non-medical color images can be used as a starting point and partially retrained to classify US images that do not resemble optical images. However, recent work, such as [73], has shown that this method, referred to as “transfer learning,” can be effective. This technique avoids overfitting on small data sets, which is often the case for US imagery. Fusing handcrafted features with those computed with deep learning has been shown to further improve performance [74]. Weakly supervised learning has also been successfully applied to US [75].

Regression

Regression involves estimating continuous values as opposed to discrete classes of data. Deep learning has been applied to regression, for example by [76] to estimate muscle fiber orientation from US imagery. Deep learning was found to improve over previous approaches using handcrafted features, specifically a well-established wavelet-based method. However, another regression application provides an example of how handcrafted features may still be the preferred approach. In this application, gestational age is estimated from 3D US images of the fetal brain [77]. A semi-automated approach based on deformable surfaces is used to compute standard biometric features, e.g., head circumference, as well as information on local structural changes in the brain.

Segmentation

Segmentation is the delineation of structural boundaries. Automated US segmentation is challenging; in that US data are often affected by speckle, shadow, and missing boundaries, as well as by tradeoffs between US frequency, depth, and resolution during image acquisition.

Many US segmentation approaches have been developed, including methods based on intensity thresholding, level sets, active contours [78], and other model-based methods. These techniques are reviewed in [79, 80]. Intensity-based approaches are sensitive to the noise and image quality. Active contour and level sets require initialization, which can affect the results. Most conventional approaches are not fully automated.

Segmentation methods based on ML typically involve two steps: first, a pixel-wise classification of the desired structure, followed by a clean-up or smoothing step since the pixel-wise classification is noisy. In recent papers, several classification approaches have been investigated involving handcrafted features [81,82,83,84,85,86,87], and various types of neural networks, including deep-learning [88,89,90,91,92]. Three papers [82, 91, 92] made use of 3D US.

Additional applications of machine learning to US

In addition to US segmentation, ML has also been applied to US registration, for example for imagery of vertebrae [93] and transrectal US [94].

One key advantage of US over some other modalities is that it is well suited for real-time guidance (e.g., needle guidance, intra-cardiac procedures, and robotic surgeries), but the real-time performance has not been fully realized due to the limitations in US image processing, including lack of robust content retrieval from US video clips. A number of very recent papers focus on using deep learning techniques for frame labeling or content interpretation [95,96,97,98]. One approach [99] was evaluated on a database of about 30,000 images, which is very large for US. Techniques that integrate spatiotemporal information have started to emerge, particularly in dealing with echocardiograms acquired from different views, to capture key information of the motion of heart [100,101,102]. We predict ML will play a major role in the near future in enabling US guided interventions.

US elastography and CEUS

Elastography, particularly SWE, is being increasingly used in conjunction with US as a quantitative measurement to characterize tissue lesions [103]. Key limitations of SWE, as summarized in [13], include variability in stiffness cutoff thresholds, lack of image quality control, and variability in ROI selection. It has been shown that SWE measurements depend greatly on the quality of the acquired data [104, 105]. Using liver fibrosis staging as an example, Fig. 6A illustrates the existing clinical workflow and challenges. As such, the current clinical protocol requires multiple image acquisitions as a way to mitigate measurement variability. Figure 6B presents a potential solution to improving the clinical workflow. It includes algorithms to automatically check image quality and ML methods to quantify SWE and classify disease stages. In addition, algorithms can also assist with assessing additional useful biomarkers (e.g., subcutaneous fat content, steatosis, inflammation), which are currently not used because of the time-intensive manual interpretation required.

Fig. 6
figure 6

A Example pipeline of using SWE for liver fibrosis staging. B Proposed semi-automated SWE acquisition and analysis workflow

Among the surveyed papers from the past 5 years, the most common ML approach is to extract statistical features from the SWE images and then apply a classifier [106,107,108,109,110].

SWE images often contain irrelevant patterns (e.g., artifacts, noise, areas absence of SWE information), which can be difficult for both handcrafted feature extraction approaches and for typical DL methods such as CNN. Very recently, [111] reported using a two-layer DL network for automated feature extraction from SWE breast data. The work focuses on differentiating task-relevant (i.e., patterns of interest) vs. task-irrelevant patterns (i.e., distracting patterns).

CEUS is a non-invasive diagnostic tool for focal liver lesion evaluation. Typically, time intensity curves (TICs) are extracted from manually selected ROI in CEUS. Results are often subjected to operator variability, motion sensitivity, and speckle noise. Recently, DL has been applied to CEUS to improve the classification of benign and malignant focal liver lesions from automatically extracted TICs with respiratory compensation [112]. DL shows higher accuracy than conventional ML methods.

Discussion and outlook

While the use of medical US is becoming ubiquitous, advanced US image analysis techniques lag behind other modalities such as CT and MRI. As with CT and MRI, ML is a promising approach to improve US image analysis, disease classification, and computer-aided diagnosis.

Overall, application of ML to US is at an early stage, but is rapidly progressing, as evidenced by the large number of 2016 and 2017 surveyed papers. Most of the recent papers surveyed use databases of a few hundred images. Only a few papers use databases of at least one thousand images, which is three orders of magnitude smaller than large challenge databases of optical images. On the other hand, it is unrealistic to expect US databases will reach that size in the foreseeable future, pointing to the need for ML techniques that can train on smaller databases. In many cases, databases are generated from a single device type and a single collection site, limiting the generalizability of ML classification models derived from these databases. Large, publicly accessible challenge databases such as ImageNet that have significantly advanced conventional image classification performance are currently unavailable for US. Most of the present US ML research has concentrated on single functions within an overall system, such segmentation or classification.

Within the past few years, deep learning approaches have been shown to significantly improve performance when compared with classifiers operating on handcrafted features. Transfer learning, which involves retraining a portion of a network originally trained on other images, has been shown to be effective for classifying the relatively small databases that are currently available. These results address early skepticism that transfer learning would not be useful for US because US images appear to be quite different than the optical color imagery on which the networks were originally trained. Deep learning approaches have also obviated the need for sophisticated preprocessing, such as despeckling. On the other hand, certain applications are based on sophisticated handcrafted features that are unlikely to be surpassed by deep learning with currently available databases. Moreover, surveyed papers combining classifiers with both deep learning and handcrafted features have shown improved results over either approaches, indicating that Deep Learning techniques alone are unlikely to achieve the potential of ML in US.

There are several challenges in applying ML to US and other medical imaging modalities: (1) because US is often used as a first-line imaging modality, there is often an imbalance with an excess of normal “no-disease” images, and (2) obtaining consistently annotated data is a challenge as there is significant inter-operator and inter-observer variability among expert US physicians. The variability that this subjectivity adds to the annotations requires a larger database so the classifier can be trained to smooth over the variations. Transfer learning has been widely adopted to address the challenge of operating with relatively small databases. Weakly supervised learning was also successfully used in one surveyed paper; its use is likely to increase, although challenges have been found in unpublished work. In addition to these techniques, other approaches commonly used by the deep learning community to address small, annotated databases are unsupervised learning, database augmentation and active learning. Interestingly, these techniques have been rarely used in US, and are likely to be promising approaches. Active learning requires an interactive annotation tool that is somewhat more complex than a static tool, but once developed, has the value of focusing the expert’s time on images most important to annotate. Another approach to annotating images would be to apply natural language processing tools to extract annotations from the patient reports. This is still an area of research that has its own challenges to address.

Another algorithmic challenge is the need for the results to be interpretable by radiologists, as opposed to a “black box” result that might suffice in domains other than clinical medicine. Although interpretability is not an intrinsic characteristic of deep learning, it is an active area of research. Within the past few years, new techniques for interpreting CNNs have emerged, and other classification techniques are being developed that are intrinsically interpretable [113].

One key strength of US is its ability to produce real-time video. ML applied to echocardiography and obstetrics has increasingly exploited the advantages of spatiotemporal data to improve results. Even in the case of detecting tumors and other pathologies, video clips provide more information than a single image frame. None of the surveyed papers about classifying pathologies exploited video data. This is an aspect that will likely advance in future work.

Returning to the system view in Fig. 1, advances across the workflow are needed. ML enables part of the system solution, but not all of it. For example, a unique challenge of US is the expertise required for image acquisition, which currently contributes to variable interpretations. Operating on freehand US is preferred. In the future, it will be important for ML systems to provide real-time feedback to the sonographer during image acquisition, and not only to interpret freehand US post hoc. Also, manual ROI selection and caliper placement for measurements are still common, which also result in significant variability. Image quality control, automatic ROI selection, and attention to computer–human integration are needed to replace manual ROI selection and caliper placement for measurements.

Based on recent rapid progress summarized in this review, we expect ML for US will continue to progress, and will be one of the most important trends in diagnostic US in the coming years. Broadly speaking, US will likely become one of the many inputs of a ML-based intelligent diagnostic assistant system, where multimodal and multiscale observations are learned over time and are turned into clinical viable quantitative models (Fig. 7); the aggregated machine intelligence will have the ability to observe data, orient the end user, assess new information, and assist with decision making. Such a system has the potential to greatly improve not only the clinical workflow but also the overall outcome of care.

Fig. 7
figure 7

Proposed framework for machine learning-based intelligent diagnostic assistant