Introduction

The lower eyelid area, including the pretarsal roll, tear trough, lid-cheek junction, and eyelid bag, has long been nonnegligible for the aesthetics and rejuvenation of the face. According to previous studies, from the fourth decade of life on, the restriction of the orbicularis retaining ligament (ORL) and tear trough ligament on swollen orbital septum fat leads to the gradual appearance of the eyelid pouch, tear trough and lid-cheek junction, which further leads to an aged and fatigued appearance [1,2,3,4,5]. Moreover, the pretarsal roll, consisting of the orbicularis oculi muscle, is widely believed by Easterners to make the eyes more attractive [6]. Therefore, such operations as blepharoplasty and pretarsal roll injection are often used to achieve facial rejuvenation. A complete evaluation method of the lower eyelid region would greatly help obtain satisfactory surgical results.

Three-dimensional (3D) surface imaging has been widely used in recent years for the morphological assessment of facial parts, such as periocular aging and lower eyelid sag [7,8,9,10]. Compared with traditional 2D photography, 3D stereophotogrammetry provides a more complete and accurate collection of soft tissue features, allowing researchers to more finely assess facial structure, feature differences between populations, and soft-tissue changes before and after operation [11,12,13,14]. Furthermore, instead of lower eyelid evaluation scales, which can only give us subjective grades, or computerized tomography (CT), magnetic resonance imaging (MRI) and other imaging techniques, which are costly and easily affected by body position, 3D stereophotogrammetry overcomes such obstacles with its accuracy, convenience and noninvasiveness [15, 16].

Unfortunately, the existing periocular evaluation systems that use 3D imaging analysis either have a single index—a basic indicator such as the height and width of the eyelid fissure—or they focus on the upper eyelid region or only analyze the age-related changes of lower eyelid fulness, and no effective 3D morphological assessment system has been proposed for the lower eyelid area [7, 10, 17, 18]. For this reason, the periocular anthropometric assessment we established first describes the characteristics of the lower eyelid region in detail through metrics such as lines, curvatures, angles and areas, making it the first 3D-imaging system assessment of lower eyelid region aging and the blepharoplasty effect.

In this study, we proposed a practical protocol for lower eyelid morphology assessment with 3D stereophotogrammetry, introduced various novel periocular landmarks, and validated the intrarater, interrater and intramethod reliability of our anthropometry system through the measurement of linear distances, curvatures, angles and area metrics.

Methods

Patients and Recruitment

Fifty-eight volunteers (9 males, 49 females, 116 eyes) were recruited at the Department of Plastic and Reconstructive Surgery, Peking Union Medical College Hospital, from September 2021 to January 2023. They were aged 23 to 61 (39.14±11.25) years. Obvious pretarsal roll and eyelid bags were observed in all patients collected. Patients with congenital facial deformity, pathologies, recent or previous trauma, periorbital surgery and botulinum toxin injection were excluded. All patients signed the informed consent. This study was performed in line with the Declaration of Helsinki and was approved by the Institutional Review Board of the Peking Union Medical College Hospital (Reference Number K2502).

Equipment and Three-Dimensional Image Acquisition

The VECTRA H1-270 handheld camera (Canfield Scientific, Inc., Parsippany, NJ, USA), a three-dimensional stereophotogrammetry system, was used for patient image collection. All participants were required to remove all make-up and fully expose their eyebrows and forehead by pulling back their hair and removing glasses and jewelry before 3D image acquisition. During the image acquisition, patients were asked to sit in a fixed position, maintain a neutral facial position and expression, and look straight ahead. An experienced operator used the VECTRA H1-270 camera to capture photos of each patient at a fixed distance and three different angles (right oblique, frontal and left oblique) exactly according to the manufacturer’s instructions. The lens height in the camera was adjusted to ensure that it was even with the middle point of the bilateral medial canthus of the patient before shooting. Subsequently, the 3D models of the patient's faces were automatically generated with a resolution of 0.95 mm and exported in the OBJ format for picture processing. Two sets of 3D images (Capture 1 and Capture 2) were taken for each patient at an interval of more than 45 minutes with recalibration of the VECTRA device. Geomagic Wrap 2021 (Geomagic, Inc., Research Triangle Park, NC, USA) was finally selected for further facial landmark identification, metric measurement and analysis.

Landmark Identification and Metric Measurement

A coordinate system is essential for 3D model evaluation. Taking the middle point between the left and right endocanthions as the origin, Camper’s plane was determined by the left tragus, right tragus, and midpoint of the right and left ala. Then, the horizontal plane was obtained after turning up 7.5° and translating to pass through the origin. The sagittal plane was then defined as the plane perpendicular to both Camper’s plane and the horizontal plane. Finally, the vertical plane was set to pass through the origin and be perpendicular to both the sagittal plane and the horizontal plane.

A total of 27 periocular landmarks are described in Table 1, which were then digitally identified and marked following the standard protocol shown in Fig. 1. In the front view, the pupillary center (Pc), endocanthion (En), exocanthion (Ex), medial corneoscleral limbus (Mcl) and lateral corneoscleral limbus (Lcl) were first marked because of their easy recognition. Then, the vertical corresponding points of these five points on the lower palpebral margin were marked. The midpoints of En and Mcl and Ex and Lcl were labeled for a more detailed description. With reference to the point of the lower palpebral margin, the vertical corresponding points on the pretarsal roll, tear trough and lid-cheek junction were identified on their lower margin in consideration of their width. The vertical correspondence of the pupillary center at the upper eyelid margin and the lower eyebrow margin was also marked for the overall evaluation of the orbital part. As for the lateral view, the nasion point was labeled to describe lateral orbital features for different subjects. On the basis of the 27 landmarks mentioned above, 19 linear distances, 4 curvatures, 7 angles and 2 areas were measured (Table 2 and Fig. 2), of which 2 linear distances were accessed in the lateral view.

Table 1 Definition of 27 anthropometric periocular landmarks
Fig. 1
figure 1

The 27 three-dimensional anthropometric landmarks of the periocular region. 26 landmarks were indicated on the frontal view, and the 3 landmarks were shown on the lateral view (Pc and Ex represented the same points on both frontal view and lateral view).

Table 2 List of 32 periocular measurements
Fig. 2
figure 2

Schematic drawing of each metrics of lower eyelid region. A, B showed 19 linear distances of the protocol in both frontal and side view. C showed 4 curvatures of the protocol. D showed angles of the protocol, in which the dotted lines were parallel with the solid lines in the same color. E showed 2 areal metrics of the protocol.

Intrarater, Interrater and Intramethod Reliability Assessment

The first author (Y. C., Rater 1) finished two measurement sessions at an interval of more than 24 hours for each set of 3D pictures, and the second author (L. J., Rater 2) performed two measurements on the first set of 3D pictures at an interval of at least 24 hours. From their results, intrarater, interrater and intramethod reliability were assessed (Fig. 3). Intrarater reliability referred to the uniformity of two measurements of the same image performed by the same rater. Interrater reliability was the comparison of the second measurements of the same image performed by Rater 1 vs. Rater 2. Intramethod reliability meant the similarity between the second measurements of two different sets of images carried out by the same rater.

Fig. 3
figure 3

Schematic of intrarater, interrater and intramethod reliability assessment.

Data Analysis

Intrarater, interrater and intramethod reliability were assessed by 7 statistics (Table 3) [18]. Through the intraclass correlation coefficient (ICC), the reliability between repeated measures was evaluated, a value close to 1 indicating high reliability and close to 0 indicating low reliability. Generally, the ICC value is divided into 3 categories with thresholds of 0.4 and 0.75, where a value greater than 0.75 indicates excellent reliability. The mean absolute difference (MAD) and technical error of measurement (TEM) shared the same acceptable error threshold values of 1 unit (millimeter) for linear and curvature metrics, 2 units (degrees) for angular metrics and 4 units (square millimeters) for areal metrics according to the relatively small or large magnitudes of periocular and lower eyelid measurements. The relative technical error of measurement (REM) and relative TEM (%TEM) were also calculated to provide an estimate of variation considering the influence of measurement size on reliability assessment. Five reliability categories were defined for REM and %TEM with reference to previous studies: < 1%, excellent; 1 to 3.9%, very good; 4 to 6.9%, good; 7 to 9.9%, moderate; and > 10%, poor [17]. Total TEM and relative % total TEM were also calculated to negate the influence of the positive correlation between total TEM and sample size.

Table 3 List of statistics for reliability evaluation

All 7 statistics mentioned above and the means of the measurements with their standard deviations (SDs) were entered into the software Microsoft Excel 2019 (Microsoft Corp., Redmond, WA, USA) after all measurements were finished, for further processing. ICCs were calculated and significance tests run using SPSS version 25 software (IBM Corp., Armonk, NY, USA). The normality of all measurements was tested by the Kolmogorov–Smirnov test first; then, the paired-sample t test was employed for normally distributed measurements, and the Wilcoxon signed-rank test was performed for nonnormally distributed paired measurements. A P value of < 0.05 was considered statistically significant.

Results

Baseline Data and General Results

Fifty-eight healthy volunteers were recruited (49 females, 9 males). The two sexes had similar ages (male 38.22±9.90 years, female 39.30±11.49 years). The mean value and standard deviation (SD) of each measurement are shown in Table 4. The reliability assessment was divided into three parts, namely, intrarater, interrater and intramethod, and a total of seven evaluation indicators were calculated: MAD, TEM, REM, %TEM, total TEM, % total TEM and ICC (Supplementary Figs. 1, 2, 3, 4, and 5 and Table 5, respectively). On the whole, most of the measurements showed relatively high reliability, as the ICC was more than 0.95, the MAD and TEM values were within the specified normal ranges, and the REM and %TEM were less than 7%. No REM or %TEM estimates exceeded 10%, which was the threshold for poor reliability. In addition, the vast majority of statistics calculated indicated a higher reliability in intrarater assessment than in interrater and intramethod measurements.

Table 4 Mean and standard deviations (SDs) of all measurements
Table 5 Intrarater, interrater and intramethod intraclass correlation coefficient (ICC) and mean differences of all measurements

Intrarater Reliability

All of the ICC estimates in the two intrarater measurements exceeded 0.95, with 17 of them reaching 1.00, which indicated high intrarater correlation. All intrarater MAD and TEM estimates were less than the self-set limits, which were 1 unit for linear and curvature metrics, 2 units for angles and 4 units for areal measurements. In particular, the linear and curvature MAD estimates were less than 0.2 units. As for REM estimates, 10 linear estimates (CW, TTED_pc, PFW, LJmD_lcl, TTmD_pc, TTmD_mcl, PFH, LJmD_ml, ED, LJmD_ex), 3 curvature estimates (PRL, LPML, LCJL), 5 angular estimates (TLA, MCA, TTA, LCA, PRAm) and both area estimates (SLES, STT) reached excellent reliability, with intrarater reliability values less than 1%. The remaining 9 linear estimates, 1 curvature estimate and 2 angular estimates all had very good reliability (1–3.9%). A total of 12 intrarater %TEM values had excellent reliability (< 1%), including 3 linear estimates (CW, TTED_pc, PFW), 2 curvature estimates (PRL, LPML), 5 angular estimates (TLA, TTA, MCA, LCA, PRAm) and 2 area estimates (STT, SLES). The other 13 linear estimates, 2 curvature estimates and 2 angular estimates had very good reliability, and only 2 linear estimates (ET, TTmD_en) had good reliability, with values of 4%-6.9%.

Interrater Reliability

The majority of interrater ICC estimates were no less than 0.95. The exceptions were PFW, PFH, LJmD_ml, LCJL and PRAm, whose ICCs ranged from 0.92 to 0.94, and only two of the ICCs were below 0.90 (0.83 for CW, 0.88 for PRL). All interrater MAD estimates were within the self-set limits, as were all TEM estimates except a linear metric, LJmD_ex, whose TEM value was 1.10 units. Similarly, nearly all intrarater measurements had high reliabilities in terms of total TEM, except LJmD_ex, which had a total TEM value of 1.12%. Furthermore, of the % total TEM values, only one linear distance (PPmD_mcl) failed to meet good reliability, with a value of 8.05%.

Among the interrater REM estimates, only 1 angle (TLA) and 2 areal measurements had REM of less than 1%; 10 linear distances, all 4 curvatures and the remaining 6 angular measurements had very good reliability, with REM between 1% and 3.9%. Eight other linear distances (LJmD_ex, TTmD_en, TTmD_mm, PPmD_ex, PPmD_lcl, PPmD_pc, PPmD_mm, PPmD_ml) had good reliability (4%-6.9%). The last linear distance (PPmD_mcl) had the worst REM of 7.56%, which was still in the moderate reliability category. Of the %TEM values, similarly, 1 angle and 2 areal estimates achieved excellent reliability, with %TEM of less than 1%; 8 linear estimates (TTED_pc, PFW, LJmD_lcl, LJmD_ml, CW, PFH, ED, ET), 4 curvatures and 6 angles had very good reliability (1–3.9%); 10 other linear estimates reached good reliability, with %TEM ranging from 4 to 6.9%; and one linear distance had the worst %TEM of 7.52%, which was considered moderate reliability.

Intramethod Reliability

Twenty-three intramethod measurements had ICC estimates of no less than 0.95, the other seven intramethod ICC estimates were between 0.90 and 0.94, and only 2 ICC estimates were below 0.90 (PFH, ICC=0.81; PRL, ICC=0.88). All intramethod MAD and TEM estimates were within our defined limits. In terms of intramethod REM, only 1 angle (TLA) and 2 areal measurements reached excellent reliability, with values below 1%. Three linear distances (PFW, TTED_pc, CW), 2 curvatures (LPML, PRL) and 6 remaining angular measurements were in the very good reliability category (REM ranges from 1 to 3.9%). Eleven linear distances and 2 curvatures had good reliability, with REM ranging from 4 to 6.9%. The other 5 linear measurements (ED, ET, PPmD_ex, PPmD_pc and PPmD_mcl) had moderate reliability, with a maximum REM of 7.65%. As for %TEM, 1 angle (TLA) and 2 areal estimates had excellent reliability, with %TEM of less than 1%. Three linear measurements (PFW, TTED_pc, CW), 2 curvatures (LPML, PRL) and the other 6 angles had %TEM between 1 and 3.9%. Fifteen lines and 2 curvatures had good reliability, with %TEM ranged from 4 to 6.9%. Only 1 linear estimate (PPmD_mcl) had moderate reliability (%TEM=7.09%).

Discussion

Considering the marked individual differences in the lower eyelid region, previous lower eyelid assessments were mostly conducted through grading scales, which are convenient but subjective and inaccurate. As for rare quantitative evaluation, orbital grayscale analysis can only provide limited information, such as the difference in the grayscale value between the lower eyelid fold and the surrounding area [19, 20]. In addition, imaging operations such as CT and MRI are cumbersome and expensive, and so far there has been no accurate evaluation model of lower eyelid soft tissue that can be widely promoted. Lo and Lin provided a comprehensive overview of the application of 3D imaging in craniomaxillofacial surgery, highlighting the numerous advantages offered by 3D photography, including minimal invasiveness, high repeatability and ease of use [21]. Fan et al. applied a 3D imaging system to the measurement of upper eyelid area and volume, finding that it was reliable in area measurement despite doubts about the accuracy of volume measurement [22]. Our study is the first periorbital morphology evaluation system focusing on the lower eyelid region by using 3D stereophotogrammetry. Referring to the opinions of Guo et al., 58 subjects with both pretarsal roll and eyelid bags were recruited, 27 periorbital landmarks were set up, and 32 linear distances, 4 curvatures, 7 angular metrics and 2 areas were calculated to describe the characteristics of the lower eyelid region in detail. Through the assessment of many statistical indices, such as MAD, TEM, REM, %TEM, total TEM, % total TEM and ICC, the protocol was found to have superb intrarater, interrater and intramethod reliability.

Referring to previous research and considering the enormous difference in our measured values, we defined 1 unit for linear and curvature metrics, 2 units for angles and 4 units for areas as the precision error limits for MAD and TEM evaluation. Fortunately, almost all measurements were within this range, except the TEM of interrater LJmD_ex, which slightly exceeded it (TEM = 1.10). The high reliability of the mean values of MAD, TEM, REM, % TEM and ICC were confirmed in rater 1 (0.22 unit, 0.31 unit, 0.90%, 1.53%, 0.99) and rater 2 (0.08 unit, 0.15 unit, 0.31%, 0.64%, 0.99). Such consistency also extended to the interrater measures (0.53 units, 0.59 units, 2.94%, 3.41%, 0.96) and intramethod measures (0.71 units, 0.77 units, 4.12%, 4.05%, 0.95). Among them, the highest reliability was found within the intrarater measurements and the lowest but still acceptable accuracy in intramethod measurements, in line with Guo et al. [18]. Moreover, a sharp decrease in reliability was found between intrarater measurements and interrater and intramethod measurements, while 32 of the estimates reached excellent intrarater reliability, but the number was 6 for both interrater and intramethod reliability. This may be explained by the fact that the tear trough, lid-cheek junction and pretarsal roll all had a certain width rather than being characterized by a single line. Although their lower margins had been uniformly marked, there were still differences in judgment between raters. For the differences between methods, apart from the errors caused by equipment recalibration, patient position and facial expression, which were certainly included, errors of intrarater measurement were also introduced because two sets of landmark identification were carried out at different times. However, Gibelli D. et al. found intramethod differences by premarking 50 landmarks on patients’ faces, which effectively eliminated intrarater errors [23].

In terms of specific indicators, it was found that when MAD and TEM were small, % TEM and REM tended to be large, and vice versa, which may be because the relative errors of indicators with smaller values were larger relative to the value itself [18, 24]. Comparing this study with Guo et al.’s, a total of 3 linear distances, 1 curvature and 2 angles were repeated, namely, PFW, PFH, CW, LPML, MCA and LCA. However, the mean value differences of PFH, CW, and LCA were within the self-set limits (this study vs. Guo’s study, 11.61 mm vs. 11.71 mm, 11.55 mm vs. 12.11 mm, 35.21° vs. 36.70°), while others were not (PFW 29.34 mm vs. 30.62 mm, LPML 31.88 mm vs. 34.44 mm, MCA 44.91° vs. 39.42°). These discrepancies may be caused by the different races enrolled in the studies: there are longer eyes in Caucasians and rounder eyes in Asians [18].

The linear distances with low accuracy were concentrated in the pretarsal roll and the LJmD_ex, where the lower margin was difficult to judge precisely because of the fine wrinkles. However, the related curvatures and angles still met the self-defined ranges, indicating the same judgment of the curve shapes between different raters. Furthermore, since the boundary of areas did not involve pretarsal roll, both area metrics showed high reliability in all of the evaluation statistics, which confirmed our conclusion. In addition, some estimates of eyes had high accuracy, such as PFW, PFH and CW, which also illustrated the high reliability of this protocol.

There is no denying the significance of volume measurement in evaluating the outcomes of lower eyelid surgery. We did not include volume measures in this article for two primary reasons. First, 3D photography primarily focuses on capturing surface information. Although certain software can measure volumes of marked areas, it still relies on surface labeling within the images. Fan et al. also highlighted that volume measurements obtained through 3D photography are not highly reliable [22]. Second, considering the complex anatomical structure and nonplanar bottom surface of the lower eyelid region, such internal information cannot be adequately captured by 3D photography alone, making it challenging to accurately evaluate lower eyelid volume solely through 3D imaging. For precise measurement of lower eyelid volume, we suggest combining CT, MRI and ultrasound measurements to obtain more accurate data.

This study proposed a standardized method for measuring the lower eyelid region by using 3D photography and confirmed its reliability and accuracy. The new imaging technology can thus be applied to many other clinical situations. Before performing lower blepharoplasty, the standardized method can be used to evaluate the severity of tear trough deformity and eyelid bags so that operators can facilitate more individualized surgical planning for each patient. Meanwhile, the therapeutic or aesthetic effects can be further assessed by such a protocol. Additionally, the attractiveness of pretarsal rolls can be better defined with such measurements, and changes in the lower eyelid region caused by aging or certain diseases can be easily identified and diagnosed through the establishment of a normative 3D anthropometric database of the lower eyelid region.

This study had some limitations. First, only 58 participants were included, of whom more than 80% were female. This sample size was relatively small, making it difficult to generalize the results. Second, the inevitable use of flash in 3D photography not only weakened the visibility of the margins of the pretarsal roll, tear trough and lid-cheek junction after exposure but also made it hard for the participants to maintain exactly the same expression in two different captures because the eyes are extremely sensitive to light. These edges can still be accurately identified and marked, but the small changes in expression inevitably led to the larger intramethod error compared to the intrarater and interrater errors. Third, no other validation was conducted in this study. A further comparison of 3D stereophotogrammetry to 2D photogrammetry or direct measurement may be best to confirm its validity and reliability. In summary, further studies are still necessary for more reliable and universal evaluation of the lower eyelid region.

Conclusion

This is the first periocular anthropometry system concentrating on the lower eyelid region using a 3D digital stereophotogrammetry technique. It proved highly reliable and repeatable. It takes advantage of the convenience of 3D imaging and can be widely applied to lower eyelid aging assessment, surgical design, therapeutic effect evaluation and periocular rejuvenation.