Introduction

Liver fibrosis quantification plays a crucial role in the management of patients with chronic liver disease (CLD) as a prognostic hallmark for the detection, surveillance and staging of the disease and its complications, and as a guide for the management of antiviral treatment. Nowadays, liver biopsy is still considered the gold standard to assess the stage of fibrosis [1], although an imperfect one, mainly because of sampling variability issues and some limitations, including its invasiveness, with severe potential complications limiting repeated assessments [2]. Furthermore, sampling variability can affect the histological evaluation: the optimal liver biopsy specimen should be 20 mm in length with 11 portal tracts, but it is not often obtained in clinical practice [3]; and this is an important limitation considering that the sample represents only 1/50.000 of liver tissue [3].

In the last ten years, several new diagnostic methods, and in particular, elastosonographic techniques have been intensively evaluated for their ability to estimate the amount of liver fibrosis in a non-invasive, accurate, non expensive, reliable and reproducible manner.

Ultrasound elastography technologies are all based on the measurement of deformation induced in tissue by a static/quasi-static or dynamic stress. European federation of Societies for Ultrasound in Medicine and Biology (EFSUMB) and World Federation for Ultrasound in Medicine and Biology (WFUMB) Guidelines summarize the features and clinical application of all these techniques [4,5,6,7,8,9].

Transient elastography (TE, FibroScan®, Echosens, Paris, France) has been first proposed for the assessment of liver stiffness (LS) as an indirect index of hepatic fibrosis and is nowadays widely applied in daily clinical practice to estimate liver fibrosis. The diagnostic performance of TE in patients with HCV-related liver disease has been evaluated in three important multicenter studies [9,10,11], and it has also been tested in patients with NAFLD [12]. Intra- and inter-observer reproducibility of the technique are good [13]; the disadvantages of TE assessment of fibrosis are the impossibility to discriminate between the various contiguous degrees of fibrosis [14], false overestimated results during ALT flares [15], the high cost of the dedicated device, and measurement failure in patients with ascites and poor acoustic windows [5,6,7,8,9,10]. Furthermore, a training course is necessary to instruct the performers [12].

Point quantification elastography (PQE) with ElastPQ ® (Philips Healthcare, Bothell, WA, USA) is a one-dimensional technique integrated onto a conventional ultrasound system and the examination can be performed with the same device used for conventional US scans. Unlike transient elastography, PQE relies on the mechanical excitation of the tissue by the shear waves produced by a localized, impulsive, acoustic radiation force and it is classified as an ARFI-based technology. An interrogation region (ROI—region of interest) can be placed anywhere in the US visualized liver parenchyma. Immediately after the image guided acquisition, the machine displays the result (in kPa) and mean, median values and the deviations of valid measurement. Elastic value E (in kPa) is calculated using the equation E = 3 ρ c 2s where cs (in m/s) is defined as the shear wave propagation velocity and ρ is the tissue density (whose approximated value in the human body is 1). According to this formula, stiffer is the tissue then faster the shear wave propagates [16,17,18].

The few data available in the literature for this technique show a good reproducibility [16], but not much is known about the necessity of training, the level of expertise needed to perform a trustworthy examination and at present there is no agreement as to what constitutes an experienced operator for PQE [8].

The aim of this study was to evaluate the reproducibility of PQE in assessing normal liver elasticity among operators with different levels of expertise, and to investigate any training effect.

Materials and methods

Study design

This study was a single-center prospective study. Healthy subjects were enrolled to evaluate liver PQE measurements by three raters, each of them performing LS measurements in the same subject in the same day. Two raters were experienced in ultrasound and elastography: one with more than 5 years of expertise (US expert: C.S.) and one with 2 years of expertise (US intermediate: C.D.M.), while the third had no expertise in US, but 5 years of expertise in transient elastography (novice: F.C.) and he received a brief explanation on how to perform the measurements. All raters were blinded about the results obtained by others.

Subjects

Fifty volunteers, drawn from a pool of hospital staff members, were evaluated from December 2012 to April 2013. All of them had normal liver parameters and US without known liver diseases. All subjects underwent a preliminary US scan to confirm the absence of hepatic parenchymal abnormalities (i.e., liver lesions, steatosis, etc.). The age ranged between 25 and 66 years with a mean of 34.2 ± 10.4 years (median: 30 years; IQR range: 28-36 years) with a male-to-female ratio of 0.56 (18 males, 76.0%; 32 females, 24.0%). The body mass index (BMI) ranged between 16.7 and 33.6 kg/m2 with a mean of 22.8 ± 3.2 kg/m2 (median: 22.4 kg/m2; IQR range: 20.8–24.9 kg/m2). Twelve out of 50 subjects (24.0%) had a BMI greater than 25 kg/m2; the BMI of these subjects ranged between 25.2 and 33.6 kg/m2 with a mean value of 27.2 kg/m2 (median: 26.5 kg/m2; IQR range: 25.6–28.1 kg/m2) while the BMI of the 38 subjects with normal BMI had the following descriptive statistics: mean: 21.4 kg/m2; SD: 1.9 kg/m2; range: 16.7–24.8 kg/m2; median: 21.9 kg/m2; IQR range: 20.2–22.6 kg/m2.

Ultrasound imaging protocol

Ultrasound evaluation was performed using an iU22 ultrasound machine (Philips, Bothell, WA, USA, software version 6.3.2.2) with the C5-1 curvilinear probe. Using B-Mode US image as guidance a ROI of 0.5 × 1.5 cm was freely located in liver parenchyma at 3–8 cm from the capsule, avoiding vascular structures or artifacts. When the measurement returned a value equal to 0, often caused by an excessive target motion or poor beam coupling, acquisitions were identified as not quantifiable [4,5,6,7,8,9].

The mean of ten valid acquisitions was considered representative for LS measurement as recommended [5, 8, 19].

To diminish potential perturbing factors, the subjects were examined in fasting conditions, in supine position with right arm in maximal abduction and were asked to suspend their breath in intermediate expiration for 5–10 s during each measurement, since these situations may influence LS in healthy subjects [5,6,7,8].

All LS measurements have been obtained from the right lobe with an intercostal approach [5, 17].

Main outcome measurements

The intra-operator variability was evaluated by assessing the agreement among all the ten measurements made by the same operator.

The inter-operator variability was assessed by comparing the mean values of the ten measurements obtained in the same day by each operator among the three operators, as well as between pairs of operators.

To establish if a training effect was present, both the intra- and inter-operator agreements observed in the last 25 subjects were compared with the agreements found in the first 25 subjects.

In addition, we have evaluated if a reduction of the number of valid shear wave acquisitions could provide reliable results. For this purpose, we evaluated the inter- and intra-operator agreements using two subsets of five measurements (i.e., the first five and last five acquisitions) and we have compared them with the agreement obtained by the entire set of ten measurements.

Statistical analysis

Mean ± standard deviation (SD), range, median, interquartile range (IQR) and absolute and relative frequencies were used as descriptive statistics. Non-parametric statistics were used to analyze stiffness values: the Friedman’s two-way ANOVA was applied to comparisons among the three observers and the Kruskal–Wallis test was applied to comparisons between different groups of patients.

Reproducibility and reliability were assessed through the calculation of the average measure intraclass correlation coefficient (ICC) [19]. The ICCs were reported together with their 95% confidence intervals (CIs). ICC ranges in values from 0 to + 1; and in values of ICC below 0.40 or so may be taken to represent poor reliability, values above 0.75 to represent excellent reliability, and values between 0.40 and 0.75 to represent fair to good reliability [20]. The ICCs among the ten measurements made by each operator in the 50 patients were taken into account to assess the intra-operator agreement and the overall agreement was evaluated by means of the same ICC computed by pooling together the data of the three operators. The inter-observer variability was evaluated by estimating the LS of each patient by means of the average of the ten measurements made by each operator. The overall ICC among the three operators was evaluated as well as the ICCs of the three pairs of operators. The differences between the measurements made by pairs of operators were computed and plotted (ordinate value) against the mean of the two measurements as the abscissa value (expected value) according to the method described by Bland and Altman [21]. This analysis was able to: (i) assess the existence of any systematic difference between the measurements (i.e., fixed bias); (ii) investigate any possible relationship of the discrepancies between the measurements and the expected value (i.e., proportional bias); and (iii) identify the presence of outliers. The exact lower and the upper limits of agreement (together with their 95% CIs) were also computed [22] and the repeatability coefficients were evaluated according to the British Standards Institution [23]. The proportional bias was tested by means of the Pearson’s r coefficient of correlation between differences and average values.

To compare the reproducibility pairs of ICC coefficients, the z-test was applied to the corrected z-transformations of the ICC coefficients evaluated according to Fisher [24].The standard errors of the corrected z-transformations of the ICC coefficients were estimated as the 3.92 (i.e., 1.96 × 2) fraction of the differences between the corrected z-transformation of the upper and lower 95% CI of the ICC coefficients.

All statistical analyses were performed using the IBM SPSS Statistics (Ver. 23 for Windows, IBM Co., Armonk, NY, USA) package. Two-tailed P values less than 0.05 were considered statistically significant.

Results

The LS values assessed by operators with different levels of expertise are shown in Table 1; the values were comparable among the three operators (overall P = 0.980).

Table 1 Liver stiffness measurements (kPa) of 50 healthy volunteers according to the operators’ expertise

Table 2 shows the LS values assessed by the three operators in various subgroups of subjects. Only the expert operator managed to capture lower LS values in females as compared with the male population (P = 0.028), while no significant differences between sexes were found by the other two operators (P = 0.225 for the intermediate and P = 0.140 for the novice). Higher BMI had no influence on LS measurement, although the difference was near to the significance level for the expert operator (P = 0.058). No significant differences were found between the first and the last subgroups of 25 subjects for each operator.

Table 2 Liver stiffness measurements (mean ± SD; kPa) of 50 healthy volunteers stratified according to gender, body mass index (BMI) and learning sample

An excellent intra-observer agreement was found by the three operators in the overall population (ICC = 0.918; P < 0.001) (Table 3). A mild, but progressive reduction of the intra-operator ICC was found with the decline of the operator skill (ICC: 0.941, 0.917, and 0.888 for expert, intermediate and novice operators, respectively) and a significant difference was found between the expert and the novice operators (P = 0.035) while the intra-observer agreement of the intermediate operator was not significantly different from those of the expert (P = 0.262) and the novice (P = 0.323) operators.

Table 3 Intra-observer agreement between different types of operators

As far the role of different factors in influencing the intra-observer agreement is concerned, a significantly better agreement was found in women as compared with men (ICC: 0.930 vs. 0.877; P = 0.029) (Table 4). This behavior was particularly due to the novice operator which showed a fair grade of agreement in males (ICC = 0.673); this value was significantly lower than the excellent agreement showed by the same operator in females (ICC = 0.916; P = 0.005) as well as it was significantly lower than the agreements showed in males by the other two operators (expert: ICC = 0.915, P = 0.013; intermediate: ICC = 0.900, P = 0.027). All operators showed an intra-observer agreement lower in overweight that in normal weight subjects; these differences reached the significant level in the overall population (P < 0.001) and in the expert (P = 0.012) and the novice (P = 0.022) operators but not in the intermediate one (P = 0.276). A significant increase of the agreement was also found with the increasing experience of the operators (ICC: 0.936 vs. 0.881 in the last vs. the former set of subjects, respectively; P = 0.013). This effect of the experience on the intra-observer agreement was significant in the novice operator only (ICC: 0.921 vs. 0.733, P = 0.006).

Table 4 Intra-operator variability in the overall sample studied and stratified according to different factors

An excellent value of inter-observer agreement for PQE estimation of LS measurement was found among the three raters (overall ICC = 0.882; P < 0.001) (Table 5). The lowest agreement was seen between the expert and the novice operators (ICC = 0.791) although each operator showed no significant differences between the agreement showed with the other two operators (expert: P = 0.350; intermediate: P = 0.881; novice: P = 0.429). The results of the analysis of the inter-observer reproducibility between pairs of operators are presented in Table 6. The better agreement was found between the intermediate and the novice operators (repeatability coefficient: 1.07 kPa) while the greater variability was found between the expert and the novice ones (repeatability coefficient: 1.40 kPa), thus the variability among operators ranged between ± 1 and ± 1.5 kPa. No significant fixed biases were present while a significant positive proportional bias was found between the expert and the intermediate operators (P = 0.019) only. The Bland–Altman plots are showed in Fig. 1. Two outlier values were identified for each of the three pairs of evaluations, but while both of them were within the range of variability of the upper limit of variability as far as the agreement between the expert and the intermediate operators was concerned, one value of the novice operators resulted particularly out of the lower limit of agreement when compared with both the expert and the intermediate operators.

Table 5 Inter-observer agreement between different types of operators
Table 6 Inter-observer reproducibility between pairs of operators: differences between the measurements made by the couples of operators (kPa)
Fig. 1
figure 1

Bland–Altman plots representing the level of agreement between pairs of operators: a expert vs. intermediate, b expert vs. novice, c intermediate vs. novice. Dark lines show the mean value (solid) of the differences together with its 95% limits of agreement (dashed) while the light dotted lines are the 95% confidence intervals of the lower and upper limits of agreement. The wider is the interval between the limits of agreement, the bigger is the difference between the operators

The role of different factors in influencing the inter-observer agreement is shown in Table 7. No significant differences were found between males and females although the novice operator showed fair agreement values that were only near the significant limit (ICC = 0.576, P = 0.040 vs. the expert operator; ICC = 0.656, P = 0.020 vs. the intermediate operator). On the other hand, the inter-observer agreement was significantly lower in overweight/obese subjects than in normal weight subjects (ICC: 0.603 vs. 0.923; P = 0.011). In particular, in overweight/obese subjects, the overall agreement among the three operators (ICC = 0.603, P = 0.016), as well as the agreement between the expert and the intermediate operator ICC = 0.692, P = 0.016), resulted only near the significant level while the novice operator showed no significant agreement with both the expert (ICC = 0.263, P = 0.255) and the intermediate (ICC = 0.576, P = 0.079) operators. Thus, the agreement of the novice operator was greatly influenced by the BMI: the agreement with expert operators was significantly lower in overweight/obese subjects than in normal weight subjects (ICC: 0.263 vs. 0.883, P = 0.021) while the comparison between overweight/obese (ICC = 0.576) and normal weight (ICC = 0.902) subjects of the agreement showed by the novice operator with the expert operator was only near the significant level (P = 0.092). Finally, the learning has no effect on the inter-observer agreement; in fact no significant differences were found between the agreement observed in the evaluation of the first 25 subjects and the agreement observed in the last 25 subjects (Table 7).

Table 7 Inter-observer agreement among the three operators according to different factors

The intra- and inter-operator agreements obtained using the two subsets of five measurements (i.e., first five vs. last five) are shown in Table 8. The ICC values in the two subsets of five measurements are lower than those obtained using the entire set of ten measurements. Both sets of five measurements present an overall intra-operator agreement significantly lower than that of the ten measurements. No significant differences were found as far the inter-operator agreement was concerned. The reduction of the intra-operator agreement by considering five instead of ten measurements was significant in the subjects with normal weight only, while no significant reduction was found in the overweight subjects.

Table 8 Comparison of intra- and inter-operator variability calculated by taking into account two sets of five measurements (the first five and the last five measurements within the overall set of ten measurements) vs. the intra- and inter-operator variability calculated in the entire set of ten measurements

Discussion

The main objective of this study was to assess the good performance and reproducibility of PQE in normal subjects by operators with different expertise. These findings are important and clinically relevant, because nowadays the use of elastographic/elastometric techniques is widening and almost always implied, in non-invasive assessment of CLD. Unlike TE, which is a specially developed device for measuring LS, PQE is embedded into an US machine, and so presumably previous knowledge on ultrasound manipulation is expected. Our study contradicts this rationale, demonstrating that even inexperienced US operators can obtain reliable results, comparable with expert and intermediate operators. In our study, an excellent inter-operator agreement among the three operators was found (ICC = 0.882), as well as between each two categories at a time (Table 5). These data are concurrent with other reports that evaluated the agreement between expert and novice examinations using different elastographic techniques in patients with CLD or healthy subjects: ICC of 0.82 for the intercostal assessment with point SWE in patients with chronic hepatitis [25], or ICC of 0.88 in healthy subjects evaluated by 2D-SWE [26]. This finding are also in line with previous data about the reliability of Fibroscan in hands of a non-experienced operator [27] and opens the possibility of LS assessment—with various techniques, by non-specialists (trained nurses or doctors without US background).

With the advent of numerous shear wave-based elastographic techniques, the determination of normal LS range is an important prerequisite for clinical practice. Several studies in healthy population have investigated normal LS by either TE or ARFI [28,29,30,31,32]. The mean stiffness value of normal liver in different studies was 5.49 kPa (range 1.5–12.7 kPa) [30], 4.6 kPa (range 3.3–5.6 kPa) [31], and 4.4 kPa (range 2.1–17.5 kPa) [32]. Our study revealed a slightly lower median stiffness value for patients without liver disease: 3.5 kPa (range 2.47–7.16 kPa), with the tendency of differentiation according to gender: in men, normal liver seem to be stiffer than in women, even if this finding reached the significance in the expert operator only. This is concurrent with previous findings from a head-to-head comparison of elastographic techniques in chronic viral hepatitis patients [33] and with feasibility data of PQE in Asian healthy subjects [34].

Obesity (BMI > 30 kg/m2) was identified as the most important factor associated with LS measurement failure at TE, and in the same large French cohort, the proportion of unreliable results reached almost 25% in patients with BMI > 30 kg/m2 [35]. Taking it into account, we wanted to evaluate the influence of BMI on the LS values assessed by PQE. In all of our moderately overweight subjects LS measurement by PQE was feasible, with no particular difficulties for less experienced examiners, demonstrating the importance of having a visual control. In our population, we found no differences in LS values among subjects with BMI > 25 kg/m2 and those with BMI < 25 kg/m2 (Table 2), but the level of agreement was significantly lower in overweight subjects than in normal weight ones (Table 7). If we consider the small number of subjects together with the selection bias (healthy volunteers), and the slight elevated BMI, severe obesity cannot yet be excluded as a drawback for PQE quantification, further investigation are needed with “real obese patients”.

We also wanted to assess the learning curve of PQE, since one of our operators had no experience in the use of ultrasound. Considering all the three operators we did not find any difference among the liver stiffness values acquired from the first 25 subjects and the last 25 ones, neither among the ICC. Considering the novice operator we found a better agreement in the last 25 subjects, suggesting a kind of training effect in non-skilled operators.

We found an excellent intra-observer agreement for each operator with the tendency to increase with the level of expertise (Table 3). The finding is also in trend with previous reports [13]. Both BMI and gender significantly influence the performance of the reproducibility in the case of the novice operator, demonstrating the role of US experience in overcoming some challenges related to PQE examinations.

Finally, a reduction in the number of acquisitions from 10 to 5 led to a significant reduction of intra-operator agreement in normal weight but not in overweight subject, and this is due the fact that, in these patients, a wider variability and a lower agreement was observed also on ten measurements for all operators. On the other hand, no significant differences in intra-operator agreement were found. Even if some previous studies suggested that a valid assessment of LS could be obtained with less than ten measurements, according to our results, we suggest acquiring no less than ten valid measurements. This is considered also that our results were obtained in a population of healthy volunteers in which the LS values are not affected by pathological factors that clearly influence measurement variability.

In summary, this study demonstrates the wide applicability of PQE technique, which has excellent intra- and inter-operator agreement, preserving high reproducibility and reliability even in the hands of less or inexperienced operators. Despite this, we believe that a basic ultrasound knowledge is desirable to minimize intra-operator variability, especially in obese or technically difficult patients. The level of agreement with expert operator tends to decrease in case of overweight subjects, and also in men. Based on excellent intra-observer agreement figures, one can hypothesize that five measurements would be enough to have an adequate estimation of LS, apart from the novice operator examining overweight patients; in this case, we suggest collecting ten live stiffness values.