Introduction

Cardiovascular disease (CVD)-related deaths in the United States of America (USA) were 17.6 million in 2016, and that is 14.5% higher than reported in 2006 [1]. The major causes of such an increase in fatalities can be attributed to an increase in tobacco use, physical inactivity, obesity, high blood pressure, diabetes, arthritis, coronary artery disease, and other disorders. This has deep economic implications on the US economy. It is estimated that the total direct and indirect costs were US $351.2B in 2014–2015, adjusting to inflation [1, 2].

The main cause of CVD is the blood vessel inflammatory disease, called atherosclerosis [3, 4]. Atherosclerosis is initiated by endothelium dysfunction [4, 5], where the thin wall of the interior surface of blood vessels gets damaged leading to the formation of more complex lesions and fatty streak within arterial walls [1, 6,7,8]. Several studies have been conducted demonstrating how atherosclerosis progresses in the carotid arteries. One study on 68 asymptomatic patients with greater than 50% stenosis showed that the wall area increased by 2.2% per year over 18 months [9]. Another study that ran over \(22\pm 15\) months, consisting of 250 patients with asymptomatic plaques having 40–99% stenosis, showed a high percentage area of lipid-like matter. This risk factor caused subsequent cerebral infarction (hazard ratio = 4.4) [10]. Such a situation generally signifies a time bomb where the artery can no longer sustain the plaque burden leading to the rupture of the fibrous cap. This leads to the intrusion of plaque into the bloodstream causing thrombosis and finally resulting in a stroke [11,12,13] (shown in Fig. 1). Atherosclerosis has been also linked to neuronal diseases such as dementia [14], leukoaraiosis [15, 16], and Alzheimer’s [17], renal diseases [18, 19], arthritis [20], and coronary artery diseases [21].

The atherosclerotic disease can be quantified by tracking the carotid intima-media thickness (cIMT) and plaque area (PA) over time, so-called atherosclerotic disease monitoring or vascular screening [22]. The cIMT is the measured distance between lumen-intima (LI) and media-adventitia (MA) borders [23, 24] while the area covered between the LI and MA boundary walls typically refers to PA [25,26,27,28,29,30,31,32]. The role of cIMT/PA measurement has gone beyond is normal tracking of the atherosclerosis burden, but rather computing CVD composite risk estimation, unlike the traditional risk calculators such as Framingham [33], ASCVD [34], Reynold risk score (RRS) [35], United Kingdom Prospective Diabetes Study (UKPDS56) [36], UKPDS60 [37], QRISK2 [38], and Joint British Societies (JBS3) [39]. None of these conventional calculators used plaque burden as a risk factor. This changed recently when AtheroEdge Composite Risk Score (AECRS 1.0) [20] was developed that used an automated morphological-based CVD risk prediction tool that integrates both conventional and imaging-based phenotypes such as cIMT and PA. The image-based biomarkers are based on the scanning of carotid segments such as a common carotid artery, bulb, or internal carotid artery acquired using 2D B-mode ultrasound [5, 40, 41]. AECRS 2.0 is another class of image-based 10-year risk calculator that combines conventional risk factors, blood biomarkers, and image-based phenotypes. It is these image-based phenotypes that are an integral part of the risk calculators, and therefore, better understanding is needed on how cIMT/PA can be computed automatically in artificial intelligence (AI) framework. To get a better idea, one must therefore see how the cIMT/PA evolved over time.

There are different school-of-thoughts (SOT) dealing with automated identification of the cIMT/PA region (so-called segmentation) for static (or frozen) and motion images; however, the scope of this work is only limited to static scans. These SOTs range from simple image morphology-based threshold techniques to deep learning (DL)-based AI systems developed over 50 years. Therefore, a generation-wise categorization is necessary. Thus, SOTs can be further divided into three generations based on how the ROI region is determined [42] and then how LI/MA is searched in this ROI region. The first-generation technologies were low-level segmentation techniques which used conventional image processing methods based on the primary threshold to get the edges of LI/MA and then use a caliper-based solution to measure the mean distance. The second-generation (or second kind of methods) used contour-based methods, which consisted of parametric curves or geometric curves. Some of these methods are semi-automated [43,44,45,46]. Further in the same class were a fusion of signal-processing-based methods (such as scale-space) and deformable models. These were categorized under the class of AtheroEdge™ models. They are completely and fully automated. Some methods were regional-based combined with deformable models under the fusion category in generation 2. These were also fully automated [47, 48]. The latest third-generation models use AI technology like machine learning (ML) [49] and DL [50] for both ROI estimation and LI/MA interface detection. All the three generations used some kind of distance measurement method such as shortest distance, centerline distance, Hausdorff distance [51], and more often adapted and called as “Suri’s polyline distance method” [52]. This review is focused on how ML and DL can be used for cIMT/PA measurements, which in turn requires LI/MA detection in plaque and non-plaque regions.

The layout of the paper is given as follows: the “Chronological Generations of cIMT Regional Segmentation and cIMT Measurement” section gives an overall overview of conventional and advanced techniques in joint carotid cIMT/PA estimation along with their classification. The “ML Application for cIMT and PA Measurement” section is dedicated to both ML and DL techniques applied for the segmentation of cIMT/PA regions. The “Deep Learning Application for cIMT and PA extraction” section provides different quantification techniques, and finally, the paper concludes in the “Discussion” section.

Chronological Generations of cIMT Regional Segmentation and cIMT Measurement

As already known, the carotid IMT is the surrogate biomarker for carotid/coronary artery disease [53]. Several first-generation computer vision and image-processing techniques have been developed for the low-level segmentation of the cIMT region (the region between LI and MA borders) and LI/MA interface detection, such as dynamic programming (DP) [54], Hough transform (HT) [55], Nakagami mixture modeling [56], active contour [57], edge detection [58], and gradient-based techniques [59].

These come under the class of computer-aided diagnosis [40, 60, 61]. The comparison of these methods has been presented previously [62,63,64,65,66]. Dynamic programming uses optimization techniques to minimize the cost function which is the summation of weighted terms of local estimations such as echo intensity, intensity gradient, and boundary continuity. All sets of spatial pixel points forming a polyline are considered. The polyline having the lowest cost is considered the cIMT vertices and such polylines constitute the boundary [54]. The HT of a line is a point in the (s, θ) plane where all the points on this line map into this single point. This fact is utilized to find different line segments through edge points which can be used to detect curves in an image. HT has been used for lumen segmentation in various works by Golemati et al. [67], Stoitis et al. [68], and Petroudi et al. [69], and Destrempes et al. [56] used Nakagami distribution and motion estimation to segment the cIMT region. This generation also involved the usage of signal processing techniques such as scale-space [70] to detect LI and MA boundaries [46]. Several version of cIMT segmentation methods with scale-space based augmented by second-generation methods (so-called [24, 71,72,73]).

In the second generation, the concept of active contour evolved that involves fitting a contour to local image information. Snakes or active contour are an example of active parametric contours [14] that were used for LI/MA estimation, followed by cIMT measurement [74]. In the same class, curve fitting models were a classic example of the cIMT estimation method [75]. Level sets topologically independent propagating zero level curves to settle at the interfaces of LI and MA were used to compute the LI/MA interfaces so-called as cIMT borders [76]. There has been a method which fuses first generation and second generation, such as fusion of scale-space [70] with deformable models [46]. Several version of cIMT segmentation methods with scale-space based augmented by second-generation methods (so-called [24, 71,72,73]). Edge detection techniques use a variation of grey levels and included image gradient to delineate cIMT borders [77]. Molinari et al. used a fusion of these approaches to segment the cIMT borders [78]. Elisa et al. [27] used an automated software AtheroEdge™ system [19, 66, 79] based on scale-space for computing cIMT and plaque area (PA) after computing LI-and MA boundary interfaces. The third-generation advanced techniques consist of intelligence-based techniques such as ML [49] and DL [50] techniques, primarily banking on neural network models. ML was based on hand-crafted features, while DL was more along the lines of automated feature extraction. Regarding the risk prediction of CVD, both ML and DL have been objective and methodologies that can be replicated with high diagnostic accuracy [80, 81]. A representation of technologies generation wise is shown in Fig. 2. A detailed discussion of general working of ML and DL follows next.

Fig. 1
figure 1

(i) Left panel: atherosclerosis progression consisting of (a) endothelium dysfunction and lesion initiation, (b) formation of a fatty streak, (c) formation of fibrous plaque underlying fibrous cap, and (d) rupture and thrombosis. (ii) Right panel: ultrasound scanning of the carotid artery (both images courtesy of AtheroPoint™, Roseville, CA, USA)

ML and DL differ in the methodology of feature derivation from instances. In the case of ML (details in Appendix 1), feature extraction is independent of the actual model of classification and is handcrafted (shown in Fig. 3 (i)), whereas, in DL, the feature extraction and model of characterization are indifferent from each other. ML models of classification mostly appear as a single statistical learning and inference technique to make accurate predictions using the extracted features [82, 83]. The DL models, on the other hand, apply multiple layers of statistical learning and inference techniques to extract features and make predictions. Hence, the DL techniques are costlier in terms of computation time and space, but are more robust and in some cases provide higher accuracy, when applied to a very large dataset [84]. ML techniques are more economical in both time and cost when compared with DL. Often these two techniques are clubbed together for better performance. An interesting characteristic is how ultrasound image noise (speckle noise, scattering noise) was handled by each of these generations. While the first two generations used various denoising techniques such as Gaussian filter, anisotropic diffusion, and smoothing, there was no evolution of such in third-generation [64, 65, 75, 85]. A brief outline of technologies generation wise is given in Table 1. Although technologies were divided generation wise, they were not water-tight compartments. Many models were a fusion of different technologies belonging to different generations to increase performance. In this regard, Molinari et al. [23] used a combination of first and second techniques to minimize the cIMT error.

Fig. 2
figure 2

Three generations of cIMT/PA measurement evolution (color image) (courtesy of AtheroPoint™, Roseville, CA, USA)

Table 1 The three generations of cIMT regional segmentation

The PA biomarker received significant research focus after it was conclusively proven to be as important as cIMT in the works of Spence et al. [86, 87], Mathiesen et al. [88], and Saba et al. [79]. Mathematically, PA can be computed along with cIMT if LI/MA borders are known. It is computed by counting all pixels between LI and MA borders and then calibrating it to mm2. PA has been well adapted in AECRS 1.0 [89] and AECRS2.0 [18] systems.

ML Application for cIMT and PA Measurement

In the Appendix 1, a brief outline of ML techniques was given. The different AI methodologies are shown in Fig. 3 (ii) and their description is given in Table 2. This is categorized into conventional, machine learning, and deep learning strategies. In this section, we will study in detail about different ML techniques that are applied for cIMT/PA segmentation and measurements from CCA images. As stated earlier, features are needed to be extracted before the ML model is applied. Some part of the features is used for training the model, i.e., training data (\({X}_{tr})\), while the rest of it is used to test the model, i.e., test data \({(X}_{ts}\)). The training process uses a feedback mechanism based on known and actual outputs, wherein the model parameters are changed based on the error. This training process repeats until the model parameters converge or do not change their values anymore, it is conceded that training is complete. Therefore, the model with trained parameters is tested on \({X}_{ts}\) with outputs unknown to the model. Once model outputs are out, they are compared with actual outputs to check the performance of the model. Several applications of ML has been developed which uses the concept of offline and online system, such as arrhythmia [90] and diabetes [91]. Since they all use ground truth during cross-validation, we call them supervised learning, and is applied extensively in cIMT and PA computation from CCA images. There are two different SOTs (Rosa et al. [92, 93] and Molinari et al. [23]) regarding segmenting the cIMT region from ultrasound CCA images. Both SOTs extract the ROI before the application of ML paradigm for cIMT segmentation. They are given as follows:

Table 2 Comparison between different cIMT regional segmentation methods for given attributes

ANN Model for cIMT Region Detection

Rosa et al. [92] used three stages to acquire cIMT values from 60 B-mode ultrasound CCA images. In stage I, the CCA images were preprocessed to extract ROI, and stage-II is the AI-based classification stage where the IMT regional pixels were segregated from non-IMT regional pixels resulting in a binary image. Stage III consisted of delineation of LI and MA boundary from the binary image. The preprocessing stage consisted of the application of watershed transform [94,95,96] of the CCA to detect the lumen, wherein the lower limit of the lumen is assigned as the posterior wall. The final ROI in the region, where the uppermost point of the far wall was considered as 0.6 mm above the binary lumen, and the bottom boundary is fixed to 1.5 mm below the lowest point detected in the binary mask. Once the ROI is detected, its dimension is noted and extracted from the original image as shown in Fig. 4 (i). The extracted ROIs are input to the next stage for IMT border delineation. The next stage applies the ensemble of four artificial neural networks (ANNs) [97] to classify cIMT from non-cIMT region pixels. The ANN is in the form of a multi-layer perceptron consisting of three layers such as input, hidden, and output as shown in Fig. 4 (ii). In the ANN model, the computation is done by the nodes, whereas the “learning experience” is embedded in the weights between input-hidden and hidden-output layers. These weights generally are randomly initialized and converge to a stable value based on the feedback on the error propagation as the learning progresses. The computation is generally, multiplication of weights with the input values, thereafter, a sigmoid function \((\sigma =\frac{1}{1+{e}^{-x}})\) is applied. Also, a bias term \(\beta\) is added to each computed term. The output function is given by:

$$\widehat{f}(x) = \sigma \left({\beta }^{jy}+{W}^{jy}(\sigma ({\beta }^{ij}+{W}^{ij}))\right)$$
(1)

where \(i\) and \(j\) signify the weights between input-hidden, and hidden-output layers, respectively. The weight values (\({w}_{ij}\)) are optimized for each error propagation using the gradient (\(\frac{\Delta \varepsilon }{\Delta w}\)), given as:

$${w}_{ij}={w}_{ij}-\gamma \frac{\Delta \varepsilon }{\Delta w}$$
(2)

where \(\gamma\) is the learning parameter. The input pattern to each ANN is the pixel intensities that are generated by a kernel process. For each image, three kernels of size 3 × 3, 7 × 7, and 11 × 11 were applied pixel-by-pixel through shifting, to collect contextual information of neighborhood pixel intensities. The process is also known as convolution and its operation using the \(3\times 3\) kernel is shown in Fig. 4 (iii). The ground truth was the pixel class information collected through manual segmentation of CCA images and annotation of each pixel being cIMT boundary or not. Therefore, the inputs were passed through three ANNs, respectively, for training and testing. The output is in the form of another reconstructed binary image from each of the three ANNs. Thereafter, these three output binary images are merged and another kernel of size 3 × 3 is again applied to the merged image and this data is fed into the fourth ANN to get the final binary mask. A representative model of the entire process is shown in Fig. 4 (iv). In the final stage, LI and MA boundaries are identified (shown in Fig. 4 (v)) and cIMT is computed. The mean absolute distance, polyline distance, and centerline distance were \(0.03763\pm 0.02518\) mm, \(0.03670\pm 0.02429\) mm, and \(0.03683\pm 0.02450\) mm, respectively.

Extreme Learning Machines-Radial Basis Neural Network Model for cIMT Region Detection

Rosa et al. [93] used radial basis neural network (RBNN) [98] for the estimation of the cIMT region from 25 ultrasound CCA images. RBNNs are single-layer feed-forward neural networks consisting of input, hidden, and output layers. One of the key differences between RBNN and ANN is that the number of hidden layers in RBNN is restricted to one. The working principle of RBNN lies in interpolating \(r\) training points \({x}^{r}\) to their corresponding target variable \({y}^{r}\). Therefore, the model output function for an input \({x}^{ts}\) is given by:

$$\widehat f\left(x\right)={\textstyle\sum_{r=1}^N}w_r\varphi_r(\Arrowvert x^{ts}-x^r\Arrowvert)$$
(3)

where \({\varphi }_{r}\) is the Gaussian radial basis function implemented by the hidden layer, \({w}_{r}\) is the weight between the hidden and output layer. The \({\varphi }_{r}\) is the radial basis function which is given by:

$${\varphi }_{r}\left(\Vert {x}^{ts}-{x}^{r}\Vert \right)=\mathrm{exp}\left(\frac{{\Vert {x}^{ts}-{x}^{r}\Vert }^{2}}{2{\rho }^{2}}\right)$$
(4)

where \(\rho\) represents the width of the hidden neuron. Equation (3) can be reduced into a matrix notation and is given as:

$${\varvec{W}}\boldsymbol{\varphi }={\varvec{T}}$$
(5)

where \({\varvec{T}}\) is the target vector. The weights can be found by standard matrix inversion

$${\varvec{W}}={\boldsymbol{\varphi }}^{-1}{\varvec{T}}$$
(6)

the initialization of hidden nodes’ Gaussian parameters (number of hidden nodes, centers, and deviation of each radial unit) are done using optimally pruned-extreme learning machine (OP-ELM) [99, 100]. The ground truth information and ROI of the 25 CCA images are extracted using the method mentioned in the “ANN Model for cIMT Region Detection” section. Similar to the previous method [92], the input pattern is generated by the kernel process. A comparative study was performed using varying kernel sizes in the range of 3 to 23. The optimized kernel window size was \(19\times 19\). The ground truth was manually traced by experienced radiologists. Three classes of pixels were considered for each CCA image, pixels of the LI boundary, pixels within the region between LI and MA boundary, and the pixels of the MA wall. Finally, each class of pixels were extracted and superimposed on the original image as shown in Fig. 4 (vi). The cIMT error for this experiment was \(0.065\pm 0.046\) mm.

Fuzzy K-means Classifier for cIMT Region Extraction

Molinari et al. [23] introduced the concept of unsupervised learning of fuzzy K-means clustering [101] (FKMC) to segregate a CCA image into three parts, i.e., plaque region, LI, and MA boundaries. This method is also called CULEXia. The FKMC is unsupervised in the sense that there is no ground truth for the pixels. The FKMC is similar to the K-means algorithm [102], where initially K random points denoting K cluster centers are initialized, and then all data points in the dataset are assigned to each of the K clusters based on the nearest mean. For each cluster centroid, \({c}_{j}\) represents the mean of all points in the cluster. The membership function \({b}_{ij}\) of each point \({a}_{i}\) determining degree of its membership for each cluster \({c}_{j}\), is given by:

$${b}_{ij}=\frac{1}{{\left(\sum_{k=1}^{K}\frac{\Vert {a}_{i}-{c}_{j}\Vert }{\Vert {a}_{i}-{c}_{k}\Vert }\right)}^{\frac{2}{m-1}}} 0\le {b}_{ij}\le 1$$
(7)

where \(m\) is a hyper-parameter controlling fuzziness.

In this work, the ROI is extracted from each CCA image by tracing the MA wall. The MA wall is traced by locating the brightest local maximum starting from the bottom of the image for each column. The upper limit of ROI is computed 1.25 mm and 0.625 mm above and below the MA wall, respectively as shown in Fig. 5 (i). Thereafter, in the next stage, three clusters of pixels are considered for each column in the ROI column-wise, i.e., lumen, LI, and MA wall. The pixels in each column are automatically assigned to each cluster using the FKMC algorithm. The sequence of LI and MA centroids for each column marks the delineation of respective walls in the ROI as shown in Fig. 5 (ii). This procedure was repeated for 200 ultrasound CCA images. The cIMT error for the method was \(0.054\pm 0.035\) mm. The ML models discussed covered cIMT measurement in brief; however, they did not consider plaque area until the DL work by Elisa et al. [103] in 2018 and joint cIMT and PA measurement by Biswas et al. [104] in 2020. A brief outline of DL in cIMT and PA measurement is given in the next section.

Fig. 3
figure 3

(i) Generalized ML model and (ii) classification tree of different automated models

Deep Learning Application for cIMT and PA Extraction

There are several DL techniques used for image classification and segmentation [82, 105] such as CNN [50], deep belief networks (DBN) [106], autoencoders [107], and residual neural networks [108]. Fully convolutional network (FCN) [109] is a variation of CNN which excludes the connected network and is used specifically for segmentation. Three different SOTs are working on cIMT measurement using ultrasound CCA images. DL. Rosa et al. [110] used autoencoders for feature extraction before employing ANNs for LI and MA boundary delineation. Suri and his team used FCN both on whole [111] and patch [104] CCA ultrasound images for cIMT and PA [103] measurement while Del Mar et al. [112] used FCN for whole CCA images for cIMT estimation. The next few subsections deal with some of these works in detail for cIMT and PA measurement computed using CCA images.

ANN Autoencoder-Based cIMT Region Segmentation

Rosa et al. [110] used ANN for characterizing LI and MA pixels belonging to 55 ultrasound CCA images. The authors used trained autoencoders [107] to extract features in the LI and MA interface. Autoencoders are neural networks designed to reproduce the input. They are generally used for unsupervised learning to understand complex relationships within input images. Given an input \(\in {\left[\mathrm{0,1}\right]}^{dx}\), the autoencoder maps it to a compressed representation using single or multiple hidden layers,\({\varvec{Y}}\in {\left[\mathrm{0,1}\right]}^{dy}\) using the mapping function similar to Eq. 3. Finally, the compressed representation is upsampled to its original dimension. It is done so to understand the essential structures and relationships with the input vector, and the training required to regenerate it. The compressed representation of neighborhood pixels denotes the features. An illustration of the autoencoder is shown in Fig. 6 (i). The ROI extraction approach is similar to [92]. The authors trained two autoencoders using five ground truth images. The neighborhood pixels of the LI and MA interface were used to train the two autoencoders. The LI and MA features were used to train two ANNs for pixel characterization. Finally, the offline CCA images were segmented, to extract the LI and MA boundary. Postprocessing was applied to discard the unnecessary LI and MA pixels. The system’s cIMT error was \(0.0499\pm 0.0498\) mm. The process model is shown in Fig. 6 (ii) while the results are shown in Fig. 6 (iii).

Fig. 4
figure 4

(i) ROI extraction (reproduced with permission), (ii) ANN, (iii) convolution operation, (iv) stage II ANN ensemble network, (v) outputs of [92] (reproduced with permission), (vi) segmentation using RBNN [93] (reproduced with permission)

Fully Convolutional Network for cIMT Region Estimation

In the work by Biswas et al. [111], the LI and MA borders were extracted in three stages from 396 ultrasound CCA images. The ground truth was binary images obtained from tracings of LI and MA borders by two experienced radiologists using general-purpose tracing software AtheroEdge™ [66, 113, 114]. In the first multiresolution stage, the images were cropped 10% from each side to ensure that the low-contrast and the nonrelevant portion of the images do not affect learning in the second stage. This typically arises due to poor probe-to-neck contact or lack of gel during the image acquisition. In the second stage, FCN was applied to segment the cIMT region from the rest of the image. The FCN-based system consisted of two subsystems: (i) encoder and decoder (shown in Fig. 7 (i)). The encoder consisted of 13 layers of convolution (two layers of 64 (window size = 3 × 3) kernels + two layers of 128 (window size = 3 × 3) kernels + three layers of 256 (window size = 3 × 3) kernels + three layers of 512 (window size = 3 × 3) kernels + three layers of 512 (window size = 3 × 3) kernels), and five max-pooling layers to draw a downsampled representation or feature map of the image. The convolution operation (similar to as discussed in Sect. 3.1 by Rosa et al. [92]) applies kernel filters over the image pixel-by-pixel to extract contextual information and position invariant features of the cIMT wall region. Before every max-pooling operation, a 1 × 1 convolution operation is employed to reduce the feature maps to a single map. The decoder constitutes three upsampling layers to upsample the downsampled feature map, two intermediate skipping operations to merge intermediate feature maps and one softmax layer to make pixel-to-pixel comparison with the ground truth. Mathematically, the convolution operation is given as:

$$g\left(a,b\right)=I\left(a,b\right)\otimes k\left(a,b\right)={\sum }_{s=-\frac{m}{2}}^\frac{m}{2}{\sum }_{t=-\frac{m}{2}}^\frac{m}{2}I\left(a+s,b+t\right) \times k\left(a,b\right)$$
(8)

where \(g\) is the output representation at the location (a,b), \(I\) is the input image, \(k\) is the kernel of size \(m\times m\), \((a,b)\) represents the location of the pixel, and \((s, t)\) are dummy variables. The max-pooling operation is used for the downsampling of the feature map by retaining the most important information from each block of the image. The FCN model applied is shown in Fig. 7 (i). The encoder applies 13 layers of convolution (shown in red boxes), five max-pooling layers (shown in blue boxes) to obtain a feature map of (1/32nd) size of the original image as shown in Fig. 7 (i).The decoder using a series of three up-sampling layers (shown in gray boxes) and intermediate skipping operations (shown in green boxes) to perform dense softmax classification (shown in the orange box) with ground truth for each feature map pixel. The up-sampling operation is the inverse of convolution applied to the feature map at the end of the encoder. Three up-sampling layers were applied to expand the image to its original size. The skipping operations merged features maps from contracting layers of the encoder to the intermediate layers of the decoder to recover spatial information lost during the downsampling in the encoder. The cross-entropy loss function for the pixel-to-pixel characterization is given by:

$$\theta_{class}(\alpha_1,\alpha_2)=\frac1{\left|N\right|}{\textstyle\sum_{n\in N}}{\textstyle\sum_{l\in L}}{\alpha_2}_n\left(l\right)\log{\alpha_1}_n(l)$$
(9)

where \({\alpha }_{1}\) is the prediction, \({\alpha }_{2}\) is the gold standard or GT, \(L\) is the total number of classes, and \(N\) is the total number of images. The softmax layer is used for final characterization where a pixel is assigned a class with the highest probability as given by:

$$P\left({z}_{i}\right)=\frac{{e}^{{z}_{i}}}{\sum_{j=1}^{N}{e}^{{z}_{j}}}$$
(10)

where \({z}_{i}\) represents the output score of the instance for the ith class. The segmented images were computed by fixing the DL iterations to 20,000. Finally, in the third stage, refinement of MA boundary was done using calibration employing a matrix inverse operation [115]. The LI and MA borders along with their ground truth counterpart are shown in Fig. 7 (ii). The cIMT error obtained for this application using DL on the two ground truth values were \(0.126\pm 0.134\) and \(0.124\pm 0.10\) mm, respectively.

Fig. 5
figure 5

(i) Far wall detection and (ii) segmentation of LI (white) and MA (black) boundary (reproduced with permission) [23]

FCN for PA Measurement

Elisa et al. [103] used a similar strategy [111] discussed in the “ANN Autoencoder-Based cIMT Region Segmentation” section, to obtain the PA values for the same cohort using the same ground truth and drew important conclusions. The PA error values for the two ground truth values were \(20.52\) mm2 and \(19.44\) mm2, respectively. The coefficient of correlation between PA and cIMT using the outputs of DL for the two ground truths were \(0.92 \left(p<0.001\right)\) and \(0.94 (p< 0.001)\), respectively. The output image is shown in Fig. 7 (iii).

Two-Stage Patching-Based AI Model for cIMT and PA Measurement

Recently, Biswas et al. [104] used a two-stage DL-based network for cIMT and PA measurement. In the first stage, a CNN was applied for extraction of ROI while in the second stage, an FCN similar to [111] was used for delineation of LI and MA borders and cIMT measurement. CNN is different from FCN in the way that CNN applies a neural network in the form of a fully connected layer for classification, whereas a trained FCN only applies convolution and subsequently upsampling to regenerate a feature map representing semantic segmentation of the actual image. A representative diagram of the 2-layer CNN model is shown in Fig. 8 (i). The softmax classification function (Eq. 12) is used for final characterization for the image. In this work, authors have used 22-deep layers [116] for the extraction of high-level features from the images. Initially, 250 CCA images of the diabetic cohort were collected. A preprocessing was performed using the AtheroEdge™ (AtheroPoint™, Roseville, CA, USA) system to crop the image to remove background information and ensure the lumen region is central to the cropped CCA image. The resultant cropped CCA image is split horizontally into two halves. The bottom half of the image consisting of the far-wall is taken out and further spit horizontally. The upper split in the bottom half consists of wall information whereas the lower split consisted of the tissue region. Both the upper and lower splits are divided column-wise into sixteen equal-sized patches. The patches form the input to the two-stage DL-based system. In the first stage, an independent 22-layered CNN network [116] is applied to characterize the input images into the wall and nonwall patches. The CNN accuracy performance for characterization was approximately 99.5%. Once the patches are characterized, the wall patches are combined patient-wise to generate the ROI segment. The preprocessing and DL stage I inputs and outputs are shown in Fig. 8 (ii). These ROI segments along with their binary ground truths are used to train the second stage DL. The second stage DL architecture is similar to the previous architecture [111]. Once trained, the second stage DL system partitions the ROI segment into plaque and nonplaque regions.

Fig. 6
figure 6

(i) An autoencoder, (ii) process model, (iii) image outputs (reproduced with permission) [110]

Thereafter, LI and MA boundaries are delineated from the plaque region, and cIMT and TPA are computed as shown in Fig. 8 (iii). The cIMT error, PA, and PA error was found to be \(0.0935\pm 0.0637\) mm, \(21.5794\pm 7.9975\) mm2, and \(2.7939\pm 2.3702\) mm2, respectively. A similar strategy [112] was applied to the whole CCA images using DenseNet [108] for the whole 331 images with cIMT mean error \(0.02\) mm. Although the DL techniques are accurate, robust, and more scalable than ML techniques, the storage and computation costs are much higher. The several layers of neurons mean a huge number of parallel computations that need to take place which may not be possible in desktop CPU architectures but need graphic processor units (GPUs). Also, the storing of intermediate values requires a huge amount of memory space. Therefore, ML is more suited to small systems where faster results are needed and accuracy is not a high priority. DL on the other hand suited for industrial medical imaging purposes where there are large patient volumes and higher accuracy is a requirement.

Discussion

In this study, we have looked at several state-of-art AI techniques implemented in recent years for plaque burden quantification in the form of cIMT and PA using carotid ultrasound images. Although AI is a newer concept, it has a significant impact on the medical imaging industry due to its robustness and accuracy. The ML techniques introduced the notion of learning from training images which have been taken over by DL techniques recently. Even though forthcoming deep learning systems are superior to older stage systems, there is a price to pay on the hardware cost or computational time. We have yet to see more through comparison in terms of speed, accuracy, large cohort, and variability in the resolution of the carotid scans, applied to both CCA and bulb segments. In some works such as application of DL in liver implemented previously, it has been seen that the presence of noise or redundant information can affect the performance of deep learning significantly [117]. Hence, cropping was done to achieve significant performance levels. The training data size also significantly affects the training performance irrespective of the cross-validation protocol. A large training pool is always better for training DL models as it captures wider complexities and intricacies of the data pool. This helps in better performance when applied to unknown data. Further, this review sheds light on three generations of cIMT/PA evolution, followed in-depth analysis on several key ML and DL paradigms that computed cIMT and PA values directly from carotid ultrasound images without any human intervention. Note that since the scope of this study was meant for only static (or frozen) ultrasound scans, there was no attempt to study cIMT/PA in motion imagery. However, studies have been done using conventional (non-AI) methods for cIMT estimation in selected frames of the cardiac cycle for understanding the plaque movement [118,119,120]. A visual comparison of the AtheroEdge™, a patented software [121, 122], scale-space method [46], and AI-based model (Biswas et al. [111]) is shown in Fig. 9. The AtheroEdge™ model is based on splines and elastic contour and achieved fairly good results, and clear border tracing is achieved. However, it fails around noisy corners. On the other hand, a distinct and clear delineation is achieved by the AI model when compared with the scale-space model. The proposed AI model results are more aligned along the ground truth than the scale-space model for the same patient. The AI model parameters are trained to align along the LI-far and MA-far wall over several iterations leading to better learning. The better training means that the parameters have also included plausible noise around the walls and got around them to delineate correct borders. Low-level segmentation models such as scale-space have failed to include the noisy information in their computation and therefore give a lesser accurate delineation.

Fig. 7
figure 7

(i) FCN model, (ii) IMT output using the FCN [111], and (iii) TPA output using the FCN [103]

Benchmarking

The first- and second-generation techniques have been already briefly described by Molinari et al. [123]. In this section, we have presented Table 3 which shows the benchmarking table using first-, second- and third-generation techniques used in the last two decades. The first-generation techniques are discussed first. In the year 2000, Liang et al. [124] used DP to quantify cIMT from 50 images. Both Stein et al. [125] and Faita et al. [126] used edge detection techniques to measure cIMT from 50 and 150 CCA images, respectively, with cIMT error of \(0.012\pm 0.006\mathrm{ mm}\) and \(0.010\pm 0.038\mathrm{ mm}\), respectively. Ikeda et al. [48] used bulb edge point detection technique to quantify cIMT from 649 images with bias between predicted and ground truth being \(0.0106\pm 0.00310\mathrm{ mm}\). A fusion of first- and second-generation techniques are discussed next. Molinari et al. [72] used a combination of level set and morphological image processing to estimate cIMT from 200 images. The cIMT error was found to be \(0.144\pm 0.179\) mm. A similar fusion based patented techniques were developed by Molinari et al. in form of CARES [127], CAMES [24], CAUDLES [128], and FOAM [129], for cIMT estimation. The cIMT error for each of them were found to be \(0.172 \pm 0.222\mathrm{ mm}\), \(0.154 \pm 0.227\mathrm{ mm}\), \(0.224 \pm 0.252\mathrm{ mm}\), and \(0.150 \pm 0.169\mathrm{ mm}\) from 647, 657, 630, and 665 images, respectively. Some of the second-generation techniques that were developed are discussed now. In 2002, Gutierrez et al. [130] developed active contour model for cIMT measurement from 180 images. Molinari et al. [131] developed snakes-based model for cIMT estimation from 200 images. In 2014 again, Molinari et al. [132] developed second-generation CALEX 1.0 model for estimating cIMT from 665 images. The cIMT error for the active contour, snakes, and CALEX 1.0 were \(0.090 \pm 0.060\mathrm{ mm}\), \(0.01 \pm 0.01\mathrm{ mm}\), and \(0.191 \pm 0.217\mathrm{ mm}\), respectively. Among the third-generation AI-based techniques, Rosa et al. [92] used ANN as the ML model for cIMT error from 60 ultrasound CCA images using three metrics, mean, Suri’s polyline, and centerline with average values of \(0.03763 \pm 0.02518\) mm, \(0.03670 \pm 0.02429\) mm, and \(0.03683 \pm 0.02450\) mm, respectively. Further, Rosa et al. [93] used RBNN for cIMT error computation which was \(0.065 \pm 0.046\) mm using the database of 25 CCA images. Molinari et al. [23] used unsupervised FKMC technique for cIMT computation from 200 ultrasound CCA scans. The cIMT error obtained was \(0.054 \pm 0.035\) mm. Rosa et al. [110] further showed the combination of ML (ANN) and DL-autoencoder to compute cIMT from 55 ultrasound images having a mean cIMT error of \(0.0499 \pm 0.0498\) mm. Biswas et al. [111] used the database of 396 carotid scans and demonstrated the DL paradigm using FCN for cIMT computation by taking ground truth taken from two different observers. The cIMT error when considering the two ground-truth was \(0.126 \pm 0.134\) mm and \(0.124 \pm 0.10\) mm, respectively. Elisa et al. [103] used the same FCN technique and obtained TPA of \(20.52\) mm2 and \(19.44\) mm2, respectively, using the two sets of ground truths. Biswas et al. [104] further again showed the usage of a two-stage (CNN + FCN) DL system to compute cIMT from 250 ultrasound CCA images. The authors used a patching-based approach to extract ROI from the first stage and delineate the LI and MA border in the second stage. The cIMT error was \(0.0935\pm 0.0637\) mm. The patch-based method was a unique solution in the sense that both stages were DL stages. In another study by Del Mar et al. [112], FCN was applied on 331 carotid ultrasound scans with a cumulative cIMT error of \(0.02\) mm.

Table 3 Benchmarking table showing ML/DL methods for cIMT/PA measurements
Fig. 8
figure 8

(i) A CNN model, (ii) patching and reconstruction process, and (ii) outputs of the two-stage DL model [104]

In further reading, short notes on cardiovascular risk assessment, clinical impacts of AI on cIMT/PA techniques, inter- and intra-observer variability analysis, 10-year risk assessment, statistical power analysis and diagnostic odds ratio, and graphical processing units is given in Appendix 2.

Conclusions

The paper presented three generations for cIMT/PA measurement systems starting from conventional methods to intelligence-based methods using the machine and deep learning methods. The key reason for this wave is the ability to do the deeper number crunching to derive sophisticated information leading to better accuracy, reliability, and stability of the systems. The improved results also show that there is significant clinical viability of the systems. Deep learning powered by GPU has significantly impacted medical imaging ushering new horizons in automated diagnosis and treatment.