Introduction

The lips represent an important part of the anthropometric features of the human face [1] and play crucial roles in vocalization, mastication and emotional expression [2]. Given that, there is a growing interest in plastic and reconstructive surgeries and aesthetic procedures of lips. A comprehensive characterization of lip morphology, including lip length and thickness, lip contouring as well as facial proportions is the premise of obtaining a harmonious and aesthetically appealing result.

Early research on lip anthropometry was mainly based on classic direct anthropometry or two-dimensional (2D) imaging tools such as standardized photographs [3, 4]. However, manual measurement requires physical contact between the evaluator and the patient and the accuracy and precision are not guaranteed. 2D photos have major drawbacks either, including their inability to evaluate curvatures, areas and volumes, requirement of patient compliance as well as time consuming [5]. Three-dimensional (3D) imaging systems such as reconstructive computerized tomography (CT) and magnetic resonance imaging (MRI) were then utilized in lip anthropometry but CT exposes patients to excessive radiation and both of them introduced errors related to different body positions (i.e. standing or lying) and were timely and costly.

The emerging non-invasive 3D surface imaging technology has overcome above obstacles. Its advantages further lie in its acquisition speed and quickly reconstructing the 3D surface morphology, and its straightforward image presentation is particularly handy for highlighting slight asymmetries or contour defects that exist pre-operatively, which is extremely helpful during patient consultation. Thus this readily available and promising tool has been widely used in morphology assessment of facial parts such as the eyes and lips [6,7,8,9]. Weinberg et al. have established a 3D Facial Norms database, which consists of 3D craniofacial anthropometric normative data from 2454 European Caucasians ranging from 3 to 40 years old [10]. However, like many other perioral anthropometry studies [11,12,13,14], only several traditional perioral surface landmarks were identified and thus the corresponding measurements were limited. Furthermore, although it has been proven highly reliable in evaluating periocular region [6], no solid validation of the reliability and accuracy of 3D surface imaging technology in perioral anthropometry has been conducted.

Therefore, we proposed and verified a practical protocol for perioral morphology assessment with a 3D surface imaging system in an Asian population. We introduced a variety of novel perioral soft-tissue landmarks to ensure standardized and adequate perioral surface coverage, we included linear distance, curvatures, areas and angular measurements, and validated their intrarater, interrater and intramethod reliability.

Methods

Patients

A total of fifty healthy Asian individuals,12 males (mean age 32.1 ± 10.2 years; range 24–60 years) and 38 females (mean age 31.4 ± 9.4 years; range 19–57 years), were recruited from the Department of Plastic and Reconstructive Surgery, Peking Union Medical College Hospital between Apr, 2022 and Sep, 2022. Exclusion criteria include facial pathologies, congenital deformations, recent or previous trauma, procedures or surgeries influencing the perioral morphology. All participants provided written consent. This study was performed in line with the 1964 Declaration of Helsinki and its later amendments and was approved by the Institutional Review Board of the Peking Union Medical College Hospital (Reference Number I-22PJ693).

Equipment

The commercially available VECTRA H2 handheld camera (Canfield Scientific, Inc., Parsippany, NJ, USA) is a three-dimensional stereophotogrammetry system. For each set of facial imaging, three images were captured at a fixed distance and three angles (right oblique, frontal and left oblique) to ensure overlapping fields of view. Subsequently, a 3D model of the subject’s face was automatically generated from these captures. The image acquisition time is 2 milliseconds per capture and the 3D model synthesis takes less than 1 min. The manufacturer claims a high geometry resolution of 1.2 mm. The 3D files were then exported in the OBJ format and then imported into the Geomagic Wrap 2017 (Geomagic, Inc., Research Triangle Park, NC, USA) for landmark identification, measurement and analysis.

Three-Dimensional Image Acquisition

Before image acquisition, each subject was asked to wash the face with cleansing foam and fully expose the face by pulling back the hair, shaving off the beard and removing glasses and any jewelry. Participants were seated upright in a chair with a neutral facial expression. Their teeth and lips were gently closed and their eyes were gazing forward. An experienced operator performed all images acquisition according to the manufacturer’s instructions in the same room under uniform lighting condition. Two sets of 3D images (Capture 1 and Capture 2) were taken for each participant at a time interval of at least 45 minutes with recalibration of the VECTRA device.

Landmark Identification and Perioral Measurement

A unified coordinate system was established in each 3D model according to a previous study [13]. Then the perioral landmarks were digitally identified (Table 1) and marked following a standard protocol (Fig. 1). In brief, landmarks easily identified around the nose, philtrum and on the middle line of lips and chin were located first. Landmarks were then identified to trisect the upper vermillion border between the cheilion and crista philtri on the left and right side, respectively. Then, labial landmarks were located vertically to them through the coordinate axes. Finally, upper and lower lip tubercles as well as apex of upper arches and lateral thickening of the upper lip were identified. Ricketts’ E-line was drawn from the tip of the nose (pronasale) to the most anterior soft tissue point of the chin (pogonion). Subsequently, 28 linear distances, 2 curvatures, 4 areas and 9 angles were measured according to these landmarks (Table 2 and Fig. 2). Among the 28 linear distances, 6 were assessed on the lateral view as the horizontal distances of landmarks to E-line. And positive values were assigned to positions anterior to the E-line and negative values to positions posterior to the E-line.

Table 1 Definitions of 25 anthropometric perioral landmarks
Fig. 1
figure 1

The 41 three-dimensional anthropometric landmarks of the perioral region. 9 landmarks on the middle line and 16 landmarks on each side of the lip were indicated on the frontal view, and the 9 landmarks on the middle line and E-line were shown on the lateral view

Table 2 List of 43 perioral measurements
Fig. 2
figure 2

Perioral measurements. a Linear distances on the frontal view (left) and the lateral view (right). b Curvatures on the frontal view. c Areas on the frontal view. d Angles on the frontal view (left) and the lateral view (right)

Reliability Assessment

The first author (Yuyan Yang, rater 1) performed two measurement sessions (session 1 and session 2) with a time interval of at least 24 h for each of the two sets of 3D images. The third author (Lin Jin, rater 2) performed two measurement sessions with a time interval of at least 24 h for the first set of 3D images. For intrarater reliability (repeatability), the two measurements of the same set of 3D image by the same rater were compared. For interrater reliability (reproducibility), the second measurements of the same set of 3D image by rater 1 and rater 2 were compared. And for intramethod reliability, the second measurements of the two sets of 3D image were compared (Fig. 3).

Fig. 3
figure 3

Schematic of intrarater, interrater and intramethod reliability assessment

Data Analyses

Seven statistics were evaluated to assess the reliability (Table 3) [6]. The intraclass correlation coefficient (ICC) indicates the reliability between repeated measures and represents a high reliability when close to 1 and a low reliability when close to 0. Generally, an ICC value of greater than 0.75 indicates excellent agreement [15, 16]. For the mean absolute difference (MAD) and technical error of measurement (TEM), we defined an acceptable error threshold of 1 unit (millimeter) for linear distances and curvatures because of the relatively small magnitude of perioral measurement, and an acceptable error threshold of 4 unit (square millimeter) for areas and 2 unit (degrees) for angles due to the relatively big magnitude of perioral measurement. In consideration of the influence of sample size on reliability assessment, we also calculated the relative error measurement (REM) and relative TEM (%TEM). According to previous reports, five reliability categories were defined: < 1%, excellent; 1–3.9%, very good; 4–6.9%, good; 7–9.9%, moderate; and > 10%, poor [17]. Total TEM and relative total TEM (% total TEM) were calculated to accommodate the influence from different raters when more than one rater was involved.

Table 3 List of statistics for reliability evaluation

For statistical analysis, means and standard deviations (SDs) as well as MAD, TEM, REM, % TEM, total TEM and % total TEM were calculated using the software Microsoft Excel 2022 (Microsoft Corp., Redmond, WA, USA), and ICC using the software SPSS version 25 (IBM Corp., Armonk, NY, USA). Graphs were generated using the software GraphPad Prism 8 (GraphPad Software Inc., San Diego, CA, USA). For normally distributed measurements, statistic difference was assessed by paired-sample t tests, and for non-normally distributed measurements, Wilcoxon signed rank tests were performed. A P value of < 0.05 was considered statistically significant.

Results

General Results

Fifty healthy individuals with a mean age of 31.6 years (range 19–60 years) were recruited, of which 12 (24%) were men and 38 (76%) were women. Descriptive statistics (means and standard deviations, SDs) of all perioral measurements are shown in Table 4. Intra- and inter-rater reliability as well as intramethod reliability assessed by MAD, TEM, REM, %TEM, total TEM, % total TEM and ICC were shown in Supplementary Figure 1, 2, 3, 4, 5 and Table 5, respectively. In brief, the majority of perioral measurements had high reliability with an ICC estimate larger than 0.95, a MAD and a TEM estimate smaller than our defined limits and a REM and a %TEM estimate less than 7%.

Table 4 Mean and standard deviations (SDs) of all measurements
Table 5 Intrarater, interrater and intramethod intraclass correlation coefficient (ICC) and mean differences of all measurements

Intrarater Reliability

Vast majority of the intrarater ICC estimates for all perioral measurements were larger than or equal to 0.95, except for SULW, SPW, PCW, PW and LVMLH in one or both raters, which still were larger than 0.90.

All intrarater MAD and TEM estimates were smaller than our defined limits—1 unit for linear distances and curvatures, 4 unit for areas and 2 unit for angles. Among the intrarater REM estimates, 2 curvatures (UVML and LVML), 2 linear distances (HSNE and HSTE), all areal measurements (SACUL, SALV, SAUV and SAPH) and 6 of the angles (MLSA, NA, CBA, ULVA, LSA and CBA′) had excellent reliability of less than 1%. 21 linear distances and the remaining 3 angles including ULA, LLA and CBAA had very good reliability (1–3.9%). And 5 linear distances including HLSE, CLLH, LVMLH, PCW and SPW had good reliability (4–6.9%). Similarily, of all intrarater %TEM estimates, 2 curvatures (UVML and LVML), 3 linear distances (HSNE, LW and HSTE), all areal measurements (SACUL, SALV, SAUV and SAPH) and 6 of the angles (MLSA, ULVA, NA, CBA, LSA and CBA′) had excellent reliability. 23 linear distances and the remaining 3 angles had very good reliability and only 2 linear distances (CLLH and SPW) had good reliability.

Interrater Reliability

Most interrater ICC estimates were larger than or equal to 0.95, except for SULW, SPW, PW, UVH, LLH, UVMLH, UVMMH, ULST and LW ranging from 0.91 to 0.94 and the lowest ICC value 0.83 for PCW.

Vast majority of interrater MAD and TEM estimates were smaller than our defined limits, except for LW (MAD = 1.11 mm). Similarly, most measurements had high reliabilities in term of total TEM, with the exception of LW (total TEM = 1.10 mm), CBA′ (total TEM = 2.24°) and LLA (total TEM = 2.37°).

As for interrater REM estimates, 2 curvatures (UVML and LVML), 3 areas (SACUL, SALV and SAUV) and 2 angles (NA and MLSA) had a REM of less than 1%. 14 linear distances, SAPH and 6 angles (ULVA, CBA, LSA, CBAA, LLA and ULA) had a REM between 1 and 3.9% and 13 linear distances and CBA′ had a REM between 4 and 6.9%. PCW had the lowest but still moderate reliability (REM = 7.40%). The results of interrater %TEM estimates were similar. 2 curvatures (UVML and LVML), 3 areas (SACUL, SALV and SAUV) and 3 angles (NA, MLSA and ULVA) had excellent reliability. 17 linear distances, SAPH and 3 angles (CBA, LSA and CBAA) had very good reliability, the remaining 11 linear distances and 3 angles (ULA, LLA and CBA′) had good reliability. Furthermore, most % total TEM estimates were less than 7%, with the exception of LVMLH (% total TEM = 7.12%), PCW (% total TEM = 7.17%), HLSE (% total TEM = 7.27%), UVMLH (% total TEM = 7.44%), HMTE (% total TEM = 7.58%) and CBA′ (% total TEM = 8.74).

Intramethod Reliability

16 of the 43 perioral measurements had an intramethod ICC estimates larger than or equal to 0.95, 13 were between 0.90–0.94 and 3 were below 0.90 (PCW, ICC = 0.88; PW, ICC = 0.89; MLSA, ICC = 0.89).

The intramethod MAD and TEM estimates were less than our defined limits in vast majority of perioral measurements, except for LW (MAD = 1.01 mm), ULVA (MAD = 2.32°, TEM = 2.03°), CBA′ (MAD = 2.37°, TEM = 2.43°) and MLSA (MAD = 3.12°, TEM = 3.10°). 3 (SACUL, SALV and SAUV) of the 43 perioral measurements had an intramethod REM of less than 1%, 17 were between 1-3.9%, 18 were between 4–6.9%, 1 was between 7–9.9% and 4 were larger than 10% (HSTE, HLSE, HLIE and HMTE). As for intramethod %TEM, all 4 areal measurements were less than 1%, 2 curvatures, 9 linear distances and 7 angles were between 1–3.9%, 14 linear distances and 2 angles were between 4–6.9%, 1 was between 7–9.9%, and similar to intramethod REM, 4 linear distances were larger than 10% (HSTE, HLSE, HSLE, HMTE).

Discussion

Unlike the periocular region, no research has thoroughly evaluated the reliability and accuracy of 3D stereophotogrammetry in perioral anthropometry which restricts its utilization to full potential in clinical practice, especially in reconstructive and aesthetic plastic surgeries of the lower face. To address this, the present study developed a feasible and repeatable protocol of evaluating perioral region with the VECTRA H2 3D imaging system, involving identification of 25 anthropometric perioral landmarks in a sequential order and subsequent generation of corresponding 28 linear distances, 2 curvatures, 4 areas and 9 angles. These measurements covered the entire perioral region between the nose base and mentolabial sulcus and involve both frontal and lateral views. Our protocol showed high intrarater, interrater and intramethod reliability as reflected by several statistics including MAD, REM, TEM, %TEM, total TEM, % total TEM as well as ICC.

The precision error of limits on measurements was usually defined as 1 [6, 18] or 2 unit [15, 19]. In this study, we defined different error limits for different measurements because of the huge difference in the value magnitude. The mean value of linear distances and curvatures in our perioral measurement ranged from 0.35 to 67.87 mm, the areas ranged from 135.03 to 664.64 mm2, while the angles ranged from 26.32° to 139.04°. Our findings showed highly reliable results of mean values for MAD, TEM, REM, %TEM and ICC in rater 1 (0.57 unit, 0.51 unit, 2.18%, 2.02% and 0.98) and rater 2 (0.57 unit, 0.55 unit, 2.44%, 2.34% and 0.98). Such consistency also extended to between raters (0.78 unit, 0.74 unit, 3.26%, 3.06% and 0.97) and between methods (1.01 unit, 0.97 unit, 4.74%, 4.57% and 0.95).

We found a highest reliability in intrarater measurements and a lowest but still good reliability in intramethod measurements. Ideal evaluation of intramethod reliability should only involve errors from camera recalibration while maintaining consistency of patients’ position and face expression. However, the way we calculated intramethod reliability in this study introduced intrarater errors as well because the rater had to perform two sets of landmark identification on the two sets of 3D images captured at different times. To address this, the study by Gibelli D. et al. pre-marked 50 facial landmarks on participants’ face using liquid eyeliner and compared the linear, angular and surface area measurements between two captures of one 3D imaging device and captures of two devices (VECTRA H1 and VECTRA M3). The intramethod reliability was shown to be high in M3–M3, H1–H1 and M3–H1 comparisons (TEM range: 0.3–2.0 mm, 0.4–1.8°; REM range: 0.2–3.1%) [5].

Consistent with others’ study [6, 16], we found that measurements with a small magnitude of value have small MAD and TEM estimates but large REM and %TEM estimates, and vise versa. In particular, 4 measurements relating lips’ distance to E-line (HLSE, HMTE, HSTE and HLIE) had small MAD and TEM estimates in intrarater, interrater and intramethod reliability assessment, but had high REM and %TEM estimates especially for intramethod reliability (> 10%). This is most likely due to the very small value of these measurements in our study population. Compared to Caucasians, Asians tend to have a less prominent nose and chin [20], and thus the upper and lower lips are in close proximity to the E-line. Therefore, we believe that these measurements should have higher reliability in Caucasian population with a concave facial profile which needs further validation. On the contrary, because of the big value magnitude, the four areal measurements showed relatively large MAD and TEM estimates (1.02–3.71 mm2) but small REM and %TEM estimates (0.34–1.28%).

Overall, we demonstrated a standardized method of perioral assessment that shows high precision. This study lays the foundation of applying this new imaging technology to many clinical situations. For example, certain diseases concerning perioral morphology change can be easily diagnosed with the establishment of normative and pathologic 3D perioral anthropometric database. Attractive lips can be better defined with detailed measurements. And with such knowledge, planning of lip surgeries or aesthetic procedures (e.g. fillers and/or botulinum injection) can be more individualized and precise, the therapeutic or aesthetic effects can be objectively assessed and better expected. Accordingly, a clearer and more reliable clinician-patient relationship might be built.

This study has several limits. First, it is hard to keep an exactly same lips position for participants at different captures and this had resulted in a relatively big intramethod error compared to the intrarater and interrater errors. The human mouth is a very muscular region with many muscles stemming from or terminate at orbicularis oris, and the lip morphology is further influenced by the occlusion status. Hence, although each participant has been required to gently close their mouth and teeth with a relaxed neutral face expression, involuntary lips movement was unable to be completely eliminated. Indeed, the study of Othman et al. showed that bigger evaluation errors were recorded in the mouth area compared to other parts of the face of patients with cleft lip and palate [14].

Second, no external validation was conducted in this study. Comparison of 3D stereophotogrammetry to direct anthropometry or digital photogrammetry may further confirm its validity and accuracy. Third, the sample size of this study is relatively small and the majority of them were female. This method needs to be evaluated in different ethnics as well. For example, aged Caucasians (especially Fitzpatrick skin types I, II and III) tend to have effaced vermilion border due to actinic damage which might impairs the reliability of assessment. Besides, a chief complaint in predominately Caucasians is perioral lines. As our subjects were all Asians and relatively young, perioral rhytides were not measured in this study. Future studies should encompass perioral pathologies (such as cleft lip and palate, acromegalia and congenital macrostomia), more males, more elderly people as well as other ethnic groups, and perioral rhytides assessment should be taken into account during the evaluation.

Conclusions

In this study, we proposed a novel and thorough evaluation protocol for perioral anthropometry using the 3D digital stereophotogrammetry technology with linear, curvilinear, angular and areal measurements. And we proved it highly reliable and repeatable for the analysis of perioral morphological characteristics. These results imply a great potential for its application in instructing clinical practice, although further validation is required.