Introduction

Hepatocellular carcinoma (HCC) and cholangiocarcinoma (CCA) are the most common primary liver cancers, with markedly increasing incidences worldwide1,2. HCC commonly develops in individuals with cirrhosis, which is the strongest risk factor for HCC and can arise from several etiologies, including hepatitis B virus (HBV) and hepatitis C virus (HCV) infection, heavy alcohol consumption, and non-alcoholic fatty liver disease2,3,4,5. CCA occurs in individuals who have chronic inflammation of the biliary epithelium2,3,6, particularly from liver fluke infection in East Asian populations7 and primary sclerosing cholangitis in Western populations2,3,4. HCC and CCA could be difficult to differentiate from other benign liver lesions because their characteristics in ultrasound images can be remarkably varied according to their growth pattern6,8. Frequently, HCC typically appears as a hypoechoic lesion with hypervascularity, while CCA appears as an intrahepatic hypoechoic lesion with hepatobiliary duct dilation5,6. Early HCC and CCA patients often lack symptoms, leading to advanced- stage diagnosis2,5. Thus, the best opportunities to improve patient outcomes are through surveillance in individuals at risk, which has been shown to detect early stages of the disease that can be curative, thus reducing the mortality of these patients2,5.

Radiologic examination is currently recommended for HCC and CCA surveillance. Ultrasonography (USG) is the most commonly used radiologic tool due to its wide availability, relatively low cost, and lack of radiation exposure. Despite lots of advantages, the worldwide USG surveillance rate remains low9,10. One of the major barriers to surveillance is the availability of skilled professionals, including insufficient radiologists and ultrasonographers who can proficiently serve for cancer surveillance, particularly in remote areas10,11. USG is operator dependent, and the examination results could be subjectively misinterpreted. Thus, developing an assisted-ultrasound system that helps detect and classify focal liver lesions (FLLs) during real-time USG examination could potentially overcome these barriers.

Due to the rapid advancement in deep-learning algorithms and high capability to analyze complex data, artificial intelligence (AI) has been increasingly utilized in healthcare, in particular, to enhance the sensitivity and accuracy of medical image interpretation12,13. An AI assisted-USG image analysis system would be of great benefit to increase the accuracy of USG examination and interpretation for non-radiologist physicians and healthcare personnel. This system could potentially be one of strategies to improve overall liver cancer surveillance rates14,15.

Recently, AI models were developed for the detection and diagnosis of FLLs in ultrasound images and demonstrated a promising performance with 81–97% sensitivity and 82–97% specificity for detecting and classifying FLLs commonly found in clinical practice16,17,18. We previously developed a convolutional neural network (CNN) AI model for USG image analysis17. Despite the satisfactory performance of the CNN AI model in detecting and diagnosing 5 different types of FLLs, we acknowledged some limitations. For example, the model had a limited capability to differentiate FLLs with atypical characteristics, leading to misclassification17. Importantly, the model was inapplicable in clinical settings due to its relatively slow real-time processing speed.

A newer class of AI model called “YOLO” for object detection tasks has shown to outperform the CNN model and is more suitable for real-time object detection in images19. Accordingly, we aimed to develop an updated AI analysis system using the YOLOv5 model as a framework to detect and differentiate various types of FLLs in USG still images. In addition to the 5 FLLs studied in the previous work, we expanded the model’s capacity to enable the detection and diagnosis of CCA and regenerative liver nodules, which are commonly observed in cirrhotic patients.

Methodology

Dataset

The retrospective study was approved by the Research Affairs Institutional Review Board of the Faculty of Medicine, Chulalongkorn university (IRB No. 423/61) and the need to obtain informed consent was waived. Clinical data was de-identified and analyzed anonymously. The study was compliant with the ethical guidelines of the Declarations of Helsinki.

The study comprised a total of 7 FLLs (2 malignant and 5 benign lesions) (Fig. 1). For malignant FLLs, we focused on HCC and CCA because they have the highest prevalence among primary liver cancers2,4,5. Regarding the benign FLLs, liver cysts, hemangioma, focal fatty infiltration (FFI), focal fatty sparing (FFS), and regenerative nodules were chosen as they were most frequently seen during HCC and CCA surveillance by USG examination20.

Fig. 1
figure 1

The ultrasonographic images of seven different types of FLLs. (A) FLLs manually labled, (B) FLLs predicted by AI.

The gold standard for diagnosing FLLs was based on pathology and/or imaging criteria. For benign lesions, diagnoses were primarily determined by characteristic imaging features observed in USG, computed tomography (CT), or magnetic resonance imaging (MRI), which are widely accepted in clinical practice. These lesions were confirmed as benign if their appearance and size remained consistent over a 2-year follow-up period. For malignant lesions, histopathological examination was used for confirmation whenever possible (approximately 10% of HCC and CCA patients had pathological diagnosis). In cases where pathology was unavailable, HCC was diagnosed using the American Association for the Study of Liver Diseases (AASLD) criteria, which include the presence of non-peripheral arterial enhancing lesions with washout in the portal venous and/or delayed phase in dynamic contrast CT or MRI5. For CCA, diagnoses were based on imaging criteria combined with a rising serum tumor marker CA 19–9 level and/or malignant clinical progression. The imaging criteria for CCA varied by type: for perihilar and distal CCA, criteria included malignant-appearing biliary obstruction, abrupt narrowing or stricture of the bile duct with irregular thickening and enhancing walls, or an intraductal irregular enhancing mass. Intrahepatic CCA was diagnosed by identifying an ill-defined mass-forming lesion with peripheral arterial enhancement followed by gradual centripetal delayed enhancement. Additional features included adjacent dilated intrahepatic ducts, capsular retraction, atrophic changes in the affected liver lobe with compensatory hypertrophy in the contralateral lobe, smaller satellite nodules, and vascular encasement2,6,8.

The process of image selection and allocation to the three datasets followed a previously published protocol as follows17:

Image selection

We retrospectively retrieved still images from upper abdominal USG examinations stored in the Digital Imaging and Communications in Medicine (DICOM) format within the Picture Archiving and Communication System (PACS) in a medical school, the King Chulalongkorn Memorial Hospital, Bangkok, Thailand, and a community hospital, Roi-et Hospital, Roi-et, Thailand. These still images were captured as snapshots during ultrasound examinations using 13 machines from 11 different brands and models of ultrasound machines across the two hospitals (Supplemental Table 1). All raw USG images were obtained as original color images. Inclusion criteria included: (1) Images containing FLLs with a definitive diagnosis, (2) Images without FLLs served as negative controls, establishing a baseline for liver parenchyma without any occupying lesions, (3) A maximum of two images from one USG study and two USG studies from one patient were allowed, (4) For studies with a single FLL, two images from different probe angles were randomly selected, and (5) For studies with multiple FLLs (≥ 2 FLLs), images containing different FLLs were chosen. Exclusion criteria included (1) Images depicting metastatic liver lesions, (2) Images showing HCC or CCA that had undergone treatments affecting lesion appearance, (3) Images obtained using doppler, contrast, or elastography techniques, and (4) Images severely compromised by significant motion artifacts, poor resolution or excessive noise that hindered accurate interpretation were also excluded, despite no strict requirements for image quality.

Image allocation

To optimize the AI training process, we included images containing FLLs and images without FLLs in a 1:2 ratio. All images were randomly distributed into three sets: the training set (for model training), the tuning set (for adjusting and optimizing the model performance during training), and the test set (for assessing the model performance) (Fig. 2). The images obtained from the same patient were allocated to the same set to ensure that all the three sets were completely independent.

Fig. 2
figure 2

Overview of the AI system development process.

AI model

This work employed YOLOv5 framework as the AI model. Since first released in 2015, YOLO has become a famous algorithm for real-time object detection due to its small size and rapid computation speed19. At the time of this work, the most updated version was YOLOv5 of which its ability was markedly improved compared to the previous YOLOv4, i.e. the YOLOv5 possesses more lightweight model size, increased versatility, and much faster training speed19,21. It has data enhancement functions that effectively detect small objects, which was the most problematic limitation in the prior YOLO model22.

The fundamental principle of object detection involves delineating each object of interest, such as liver lesions, by its bounding box and corresponding class (e.g., cyst, HCC). Conventional computer vision methodologies typically employ a two-stage approach: first predicting the bounding box encompassing the liver lesion, followed by the classification of the detected anomaly. In contrast, the YOLO algorithm adopts a unified approach, simultaneously predicting both the object's bounding box and its class. This innovative design significantly accelerates the detection process.

YOLO's methodology involves processing the entire input image through a deep neural network to extract a high-level feature map representing the input image. This feature map is subsequently partitioned into an N × N grid, where N is a user-defined parameter. Each cell within this grid is tasked with detecting objects whose center falls within its boundaries (Fig. 3). To handle objects of varying dimensions and aspect ratios, multiple bounding boxes are predicted for each cell.

Fig. 3
figure 3

Illustration of YOLO inference phase with grid partition (left), bounding boxes prediction (middle), and the result after NMS and confidence thresholding (right).

During the inference phase, YOLO predicts a large number of bounding boxes. However, bounding boxes lacking liver lesions are assigned low confidence scores across all classes and subsequently eliminated. Note that YOLO may generate multiple bounding boxes around a single FLL. To mitigate this redundancy, a post-processing technique known as "non-maximum suppression" (NMS) is employed. This algorithm evaluates groups of proximal bounding boxes, quantifying their similarity through the intersection over union (IoU) metric. Within each group, only the bounding box exhibiting the highest confidence score is retained.

The confidence threshold and the IoU threshold used in the NMS algorithm play a crucial role in modulating the performance of the YOLOv5 model. To optimize model performance, detection rates were evaluated using an IoU threshold of 0.3 and a confidence threshold of 0.25 as criteria for correct detection. These thresholds were empirically determined to provide an optimal balance between detection sensitivity and specificity in the context of hepatic lesion identification.

To simulate authentic clinical scenarios encountered during USG examination, the confidence scores from all seven differential diagnoses were aggregated and normalized to generate class posterior probabilities. These probabilities were subsequently presented in descending order for each FLL, providing a prioritized list of potential diagnoses that closely resembles the decision-making process in clinical practice.

Experimental setup

Data pre-processing

The FLLs depicted in USG images were independently reviewed by two hepatologists to verify the diagnosis of each FLL. They each reviewed different sets of images without cross-checking due to time constraints and practicality. After verifying the diagnosis of FLLs, the images were labeled by experienced sonographers following standardized criteria, which included the ultrasound characteristics of each type of lesion and specific features to consider (Fig. 1a). Periodic quality control checks were conducted by two hepatologists, who independently reviewed a random sample of labeled images to ensure adherence to the protocol.

The patients’ identification and the peripheral regions in the USG images were cropped out. The images were resized to dimensions of 640 pixels width and 640 pixels height and input into the YOLOv5 algorithm.

Training and tuning process

YOLOv5 codes obtained from the ultralytics repository were customized and optimized for the task of detecting and classifying FLL in USG images21,22. We integrated YOLOv5 architecture with the PyTorch framework and the Darknet architecture, combined with Cross-Stage Hierarchical Networks (CSPNet). To enhance the model performance of object detection in USG images, a COCO dataset was used to pretrain the algorithm17. Further refinement of the model performance was achieved through fine-tuning during the training process using the tuning set. The model achieved its maximum mean average precision, indicating optimal performance, at or beyond 150 epochs. This iteration was then selected to be used for assessing model performance.

Performance evaluation

The model’s performance was evaluated on the test set. Two primary tasks, namely detection and diagnosis, were evaluated independently on both a per-image and per-lesion basis using the following metrics:

Detection task

An FLL was considered as being correctly detected when the model generated a box that accurately encompassed the region of the FLL in the image. The detection rate was derived from the number of FLLs correctly detected divided by the total number of FLLs.

Diagnosis task

The diagnosis performance was evaluated as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NVP), and accuracy, calculated using the following formula: sensitivity = a/(a + c), specificity = d/(b + d), PPV = a/(a + b), NPV = d/(c + d), and accuracy = (a + d)/(a + b + c + d), where a, b, c, and d were the numbers of FLLs that were correctly diagnosed, falsely diagnosed, falsely excluded, and correctly excluded, respectively.

Statistical analysis

The evaluation of the overall model performance included two aspects: 1) a comparative analysis of detection and diagnosis between benign and malignant FLLs, and 2) an evaluation of detection and diagnosis of each specific type of FLLs. The model performance in detecting FLLs was reported as detection rates. The model performance in diagnosing FLLs was reported as sensitivities, specificities, PPVs, NPVs, and accuracies with their respective 95% confidence intervals (95%CI). All statistical analyses were conducted using Python version 3.8.13 (Python Software Foundation, Delaware, USA).

Results

Dataset characteristics

Table 1 describes characteristics of the three datasets consisting of 26,288 images (9397 images with FLLs and 16,891 images without FLL) from 5444 patients (Supplemental Table 2). There were 7493, 523, and 1381 images with FLLs randomly allocated into training, tuning, and test set, respectively (16:1:3 ratio). The images with FLLs contained a total of 16,098 FLLs (some images contained more than 1 FLLs). The number of CCAs, HCCs, FFS, cysts, hemangiomas, focal fatty infiltration (FFIs) and regenerative nodules were 6,454 (40.1%), 1635 (10.2%), 3306 (20.5%), 2229 (13.9%), 1378 (8.6%), 563 (3.5%), and 533 (3.3%), respectively, with the median sizes (interquartile range, IQR) of 4.12 (3.40), 3.80 (4.24), 1.49 (1.03), 1.47 (1.66), 1.16 (1.37), 2.71 (3.18), and 0.97 (0.44) cm, respectively.

Table 1 Characteristics of patients and focal liver lesions (FLLs).

Among the images without FLL, 7,414 images were randomly assigned to the training set and the remaining images were randomly allocated in a 2:1 fashion, resulting in 6184 images in the tuning set and 3293 images in the test set. This allocation ratio of 2:1 was chosen to minimize false positive errors and yielded the best model performance.

Performance of AI system

Performance of the AI system for overall FLLs is demonstrated in Table 2 and Fig. 1b. The model provided an overall FLL detection rate of 84.83% (95%CI: 83.30–86.36). When categorized by FLL size as shown in Table 3, the detection rates of FLLs ≤ 1 cm and > 1 cm was 84.66% (95%CI: 81.72–87.59) and 84.95% (95%CI: 83.16–86.74), respectively.

Table 2 Performance of deep learning model for detection of focal liver lesions.
Table 3 Number of each focal liver lesion classified by size of lesion.

Malignant and Benign FLLs.

As shown in Table 4, the detection rates for malignant and benign lesions were 86.50% (95%CI: 82.99–90.02) and 86.65% (95%CI: 84.57–88.74), respectively. After successfully detecting FLLs, the AI model achieved comparable sensitivity for correctly classifying malignant and benign lesions, with values of 97.03% (95%CI: 95.88–98.18) and 96.96% (95%CI: 95.87–98.05), respectively. The NPV for malignant lesions was slightly greater than for benign lesions, with 97.37% (95%CI: 96.35–98.39) and 96.57% (95%CI: 95.34–97.80), respectively. The overall accuracy for diagnosing both malignant and benign lesions was 96.99% (95%CI: 87.78–96.20).

Table 4 Performance of deep learning model for detection and diagnosis of malignant and benign FLLs.

Types of FLLs

The diagnostic performance of seven types of FLLs were shown in Table 5. Among the seven FLLs, the model demonstrated the highest detection rate for CCA, with a value of 92.21% (95%CI: 87.97–96.44), followed by FFS at 89.69% (95%CI: 87.06–92.32). The lowest detection rate was observed for FFI at 62.26% (95%CI: 49.21–75.31). Detection rates for the other FLLs ranged between 78.05 and 88.65%.

Table 5 Performance of deep learning model for detection and diagnosis of 7 types of focal liver lesions.

Upon the successful detection of each type of FLLs, the model further classified the FLLs into one of the seven specific diagnoses. For malignant FLLs, the model had a better overall diagnostic performance for CCA than HCC, with accuracies of 97.38% (95%CI: 96.64–98.12) and 94.82% (95%CI: 93.79–95.84), respectively. The model also demonstrated superior sensitivity and PPV for diagnosing CCA compared to HCC, with sensitivities of 96.51% (95%CI: 95.08–97.94) vs. 78.10% (95%CI: 72.50–83.69) and PPVs of 96.06% (95%CI: 94.54–97.57) vs. 77.73% (95%CI: 72.11–83.34), respectively, while the specificities and NPV for diagnosing both types of cancers were comparable, ranging from 97.03% to 98.11%.

Regarding the individual diagnostic results for the five benign FLLs, the model’s performance ranged from 74.25 to 98.81% for sensitivity, 97.03 to 100% for specificity, 77.73 to 100% for PPV, 97.10 to 99.66% for NPV, and 94.82 to 99.94% for accuracy. The model showed the highest sensitivity of 98.81% (95%CI: 97.86–99.76) for diagnosing FFS and the lowest sensitivity of 74.25% (95%CI: 67.62–80.88) for diagnosing hemangioma. The highest accuracy of 99.94% (95%CI: 99.84–100) was observed for diagnosing regenerative nodules, while the lowest accuracy of 96.27% (95%CI: 95.39–97.14) was noted for diagnosing hemangioma.

Discussion

In this study, we developed and evaluated an AI system using the YOLOv5 model to detect and classify FLLs from 26,288 USG images. The AI demonstrated high accuracy, achieving an 84.8% detection rate and excellent sensitivity and specificity for both benign and malignant lesions, including HCC and CCA, indicating its potential to enhance clinical diagnostic accuracy.

The model demonstrated an accuracy of 96.99% in classifying malignant and benign FLLs. Compared to the previous CNN model, the YOLOv5 model’s accuracy in classifying malignant and benign FLLs increased by 13%17. This improved performance was likely due to the YOLO model’s superior ability in image detection, particularly its enhanced capacity to detect small objects compared to the CNN model22.

The sensitivity of YOLOv5 in detecting and classifying FLLs was also superior to that of the CNN model. For HCC, the diagnostic sensitivity increased from 73.6% to 78.10%, and for FFI, it improved from 69.0% to 82.86%. Additionally, the overall specificity of hemangioma, cyst, HCC, FFS, and FFI surpassed that of the CNN model. Across all seven FLLs diagnoses, the AI system consistently showed high specificities and NPVs, over 97%. These excellent specificities and NPVs indicate that the AI system could be highly useful in clinical practice23. Its high specificities help physicians enhance the accuracy of FLLs diagnosis and make informed further management decisions.

Among the seven FLLs studied, the model exhibited the lowest diagnostic sensitivities for hemangioma, HCC, and FFI, with sensitivities of 74.25%, 78.10%, 82.86%, respectively. Interestingly, these findings were consistent with the diagnostic performance of our previous model, which reported sensitivities of 80.8%, 73.6% and 69.0% for hemangioma, HCC, and FFI, respectively17. We suspected that these three types of FLLs might be challenging to distinguish from one another due to their atypical characteristics and similar appearance on USG images, leading to lower sensitivity.

This study has several strengths. Notably, it included CCA and regenerative nodules in the AI system. The inclusion of regenerative nodules as an input parameter has not been previously explored, data on AI detection of CCA in USG images was also very limited. Indeed, USG screening for CCA remains challenging due to the absence of specific features of the lesions, occasionally resulting in misclassification between CCA and HCC6,8. Early detection of CCA is significantly crucial because it is an aggressive cancer known for its resistance to chemotherapy2,24. The 5-year survival rate for metastasized CCA was below 2%; however, when detected early, the 5-year survival rate increased to 74%25. With diagnostic sensitivity and specificity of 96.51% and 97.85%, respectively, this AI model proves valuable in classifying and detecting CCA lesions, particularly in areas with high prevalence such as Southeast Asia. Its application can aid in preventing the fatal progression of CCA and improving the survival outcomes for at-risk individuals. Lastly, the USG images were obtained from various USG models and machines. The diversity in USG vendors and versions enhances the generalizability of the developed AI system in clinical settings where different USG machines are used. Moreover, YOLOv5 model can eliminate the slow real-time processing speed constraints of CNN, suggesting its feasibility for implementation in real clinical settings.

There are several limitations to this study. We acknowledge the limitation of using still images captured by experienced radiologists, which contrasts with the real-time detection process during ultrasound examinations. Our primary aim was to evaluate the accuracy of the YOLO v5 deep-learning model in detecting and diagnosing FLLs using a diverse set of pre-identified images. This initial step is necessary to establish the model's capability before progressing to real-time applications. Further studies are warranted to validate the model's performance in real-time clinical settings, dynamically identifying lesions during ultrasound examinations to assess its practical applicability and robustness in real-world scenarios. Given this limitation, this work was considered a phase 0 study within the domain of AI-assisted systems for healthcare, focusing on system development and performance evaluation26. A phase 1 study to evaluate the performance of the developed AI-assisted USG model is currently underway. Nonetheless, we recently assessed the utility of an AI-assisted system in a randomized controlled trial27. The study evaluated whether the AI system could assist non-expert operators in detecting FLLs, and the results showed that the AI system significantly increased the FLLs detection rate for non-expert sonographers. Although our previous work suggested that the AI-assisted system could enhance the FLL detection rate for non-expert sonographers, it remains necessary to validate the performance of the YOLO model in real-time USG examinations performed by non-experts, given that the model in the present study was not developed using the same CNN model as the previous one. Another limitation was that ultrasound interpretations are often influenced by both the proficiency and experience of the examiners and the inherent characteristics of the lesions. Since USG is an operator-dependent device, the experience of the examiner significantly affects the quality of the obtained images. In the present study, the performance of the model was not validated using an image batch obtained by less experienced examiners. The maximum utility of such a system lies precisely in increasing the accuracy of diagnosis for less experienced examiners.

The YOLO model demonstrated a confusion rate of 40% per lesion and 8% per images when distinguishing cysts from normal transonic structures in the liver. This was based on 126 false positive bounding boxes in a test set of 1385 images, primarily around vessels with a few around other normal transonic structures. To minimize the confusion rate, increasing the number of images with normal transonic structures in the training set can be effective. Clinically, sonographers can differentiate these structures using various techniques, such as probe angle adjustments, compressibility checks, and Doppler mode. Integrating such expertise with AI systems can further mitigate this issue. The accuracy of AI in detecting and categorizing FLLs may vary depending on ultrasound settings and equipment quality. Our study utilized images from various ultrasound machines and models, each with different settings manually adjusted by ultrasonographers. This diversity led to a broad range of ultrasound images. Due to the retrospective nature of our data collection, specific ultrasound settings such as gain, depth, and focus position were not available, limiting our ability to assess their impact on model performance. Despite this, the AI system showed robust performance across different settings, indicating its applicability in diverse clinical environments. Additionally, despite the YOLOv5 model exhibiting an approximate 5% improvement in diagnostic sensitivity for HCC compared to our previous study conducted using CNN model (increasing from 73.6% to 78.1%), we still did not attain the desired sensitivity17. We are committed to exploring methodologies aimed at enhancing the detection rate and boosting our AI performance efficacy.

Conclusion

The new AI model demonstrates excellent performances in detecting and diagnosing of malignant FLLs, including HCC and CCA, and benign FLLs on USG still images. External validation and real-time clinical performance are required to enhance its feasibility and applicability.