Introduction

Success of a bariatric procedure should be measured, both primarily and in the long term, by postoperative weight loss, improvement of (metabolic) comorbidities, impact on quality-of-life, and complications caused by the operation. For three of these effects, success can readily be defined, with standardized criteria for comorbidities, validated questionnaires for quality-of-life, and clear classifications for postoperative complications [1,2,3,4]. For bariatric weight loss however, defining success remains troublesome.

Fundamentally, successful results would lay around or above the average outcome that can be expected with a specific therapy. A criterion for success should serve as a clear cutoff between those results below average that are still acceptable (successful because close enough below average) and those that are not (unsuccessful because too far below average: poor responders).

This simple premise becomes problematic when applied to bariatric weight loss. First of all, there is no average weight loss outcome in bariatric surgery. Each type of procedure yields different results [5]. Common criteria for bariatric weight loss success do not differentiate for that. To validate them, a choice has to be made on a benchmark procedure to compare them with. One that generally is accepted as gold standard in bariatric surgery. The laparoscopic Roux-en-Y gastric bypass (LRYGB) could be considered as such. Until very recently, it was the most performed bariatric procedure worldwide, with average outcome between that of the increasingly popular sleeve gastrectomy and malabsorptive procedures like duodenal switch [5, 6].

Second, bariatric weight loss is dynamic over time [7]. Early weight loss success would therefore look different from success after 1, 5, or 10 years. To validate weight loss criteria, they should be compared to a benchmark that is dynamic as well. A weight loss percentile chart, similar to growth charts in children’s healthcare, would fill that need, picturing the course and spread of weight loss over time in one graph.

A third major problem is commonly overlooked. With common outcome metrics, postoperative weight loss results are influenced by differences in body mass index (BMI) at baseline. There is strong evidence that this distorting effect can be significant in bariatric surgery, where patients typically show a wide range of baseline BMI, from 35 kg/m2 to more than double [8]. To validate bariatric criteria, they should be compared to a benchmark that is influenced by this effect as least as possible. Different researchers proved that outcome metric percentage total weight loss (%TWL) is affected less than percentage excess BMI loss (%EBMIL) [9,10,11,12,13]. In 2011, Baltasar et al. described a metric that seemed to be not affected at all [14]. It was found with results of 7410 Spanish bariatric patients and validated with outcome of 265 different subjects [15]. In 2013, van de Laar described a less elaborate baseline BMI-independent metric called percentage alterable weight loss (%AWL), based on a similar algorithm found with results of 8945 American subjects and validated with outcome of 500 Dutch LRYGB patients [16, 17]. It was confirmed for sleeve gastrectomy and lower BMI’s and proven superior to other weight loss metrics when compared to metabolic outcome after LRYGB [18, 19]. An overview of this relatively new weight loss metric is presented in Table 1.

Table 1 Background information on the percentage alterable weight loss (%AWL) metric

These findings paved the way for constructing a new kind of bariatric benchmark; a baseline BMI-independent, time-dependent percentile chart [20]. Such benchmark (provided that it is based on outcome of a large and representative cohort of patients after the gold standard bariatric operation) would be ideal to test criteria for bariatric weight loss success.

With charts based on American and Dutch LRYGB data, it was already demonstrated that popular criteria ≥ 50%EBMIL and < BMI35 are weak in recognizing insufficient weight loss [20, 23]. At time, all further evidence on the validity of common bariatric criteria is expert opinion only [24]. In this study, we updated our elaborated LRYGB percentile chart and used this benchmark as a representation of the bariatric gold standard to systematically establish the sensitivity and specificity of commonly used criteria for bariatric weight loss success.

Materials and Methods

Benchmark

To validate common criteria for bariatric weight loss success, one needs a benchmark to compare them to preferably based on outcome of the gold standard bariatric procedure. We chose LRYGB outcome for this purpose, expressed in a %AWL-based percentile chart, being both independent of common differences in baseline BMI among patients and taking into account the dynamic weight changes over time. Such gold standard benchmark should be based on a cohort that is both sufficiently large and representative. We therefore used our own large LRYGB cohort and compared our results with outcome of landmark studies in bariatric research. For that purpose, we searched bariatric literature for studies that reported LRYGB outcome of at least 250 patients, with more than 1-year follow-up and expressed with %AWL or %TWL, two metrics least influenced by differences in baseline BMI.

We retrospectively analyzed data from all consecutive primary LRYGB patients operated in our bariatric center of excellence from the beginning of our bariatric program in 2007, until January 2017. Revision procedures were excluded, because results could have been influenced by preceding bariatric surgery. Patients were urged to lose weight preoperatively to reduce the operative risk. All procedures were standardized, with 4-cm × 8-cm gastric pouch, 50-cm biliary limb, and 150-cm alimentary limb. Patients were weighed with the same type of scale at all pre- and postoperative visits. All data were collected prospectively in a central electronic database. We used the measurement at first preoperative visit for baseline weight [21]. We specified age, gender, and type 2 diabetes (defined with the American Diabetes Association criteria) as potential confounders for weight loss, to illustrate the representability of the cohort [1, 25]. We only used the last measured weight for each patient postoperatively. Thus, all patients contributed to the chart equally, even if lost in follow-up, with only one result. In this way, the number of postoperative visits (and therefore the compliance to follow-up) could not introduce a bias. Weight loss was calculated with five different metrics:

  1. 1.

    BMI reduction = (baseline weight − last weight) / (body length)2.

  2. 2.

    %TWL = 100% × BMI reduction / baseline BMI.

  3. 3.

    %AWL = 100% × BMI reduction / (baseline BMI − 13), as described by van de Laar et al. [16, 17].

  4. 4.

    %EBMIL = 100% × BMI reduction / (baseline BMI − 25), as suggested by Deitel et al. [26].

  5. 5.

    Percentage excess weight loss (%EWL) = 100% × (baseline weight – last weight) / (baseline weight – ideal weight), based on the 1983 gender-specific Metropolitan Life insurance tables for ideal weight, using “medium frame” and subtracting 2.54 cm (the tables were based on people wearing “shoes with 1-in. heels”) [27].

Out of these five metrics, only %AWL was proven to render baseline BMI-independent outcome and therefore was used for the percentile chart [17, 22]. We created percentile curves using Excel (Microsoft Office 2016) by determining percentiles for results grouped by consecutive intervals of 100 days and smoothened them using polynomial trendlines. We used the trendline formulas to judge the adequacy of each curve.

Criteria

From a 2015 review of bariatric literature on the subject by Mann et al., we chose 11 popular criteria for weight loss success: ≥ 50%EWL, ≥ 25%EWL, ≥ 50%EBMIL, ≥ 25%EBMIL, ≥ 20%TWL, ≥ 15%TWL, ≥ 10-kg/m2 BMI reduction, thresholds < 30 BMI and < 35 BMI, < 10% weight regain relative to the absolute nadir weight (%WRnadir), and < 25% regain of the initial maximum weight loss (%WRmax) [28]. We added ≥ 25%TWL and ≥ 35%AWL.

Bariatric weight loss criteria are not very useful during the period of rapid initial weight loss. They are rather meant to assess the lowest (nadir) weight that results from it and the mid- and long-term weight thereafter. More than 95% of LRYGB patients reach their nadir weight only after the first year [20, 23]. We therefore used the benchmark chart only from 1 year upward to test the 11 weight loss criteria. For the two weight regain criteria < 10%WRnadir and < 25%WRmax, we compared the nadir results of the percentile curves with the corresponding results at 5 years, as suggested by Aminian et al. [29].

Cutoff

We looked at four %AWL percentile curves as possible cutoff between success (true positive) and failure (true negative). First, we asked 200 of the primary LRYGB patients preoperatively from which weight up they would expect to be disappointed in the long run. They were subjects taking part in a different study on expectations and preferences with respect to bariatric surgery. With those answers and with their weight and height at baseline, we calculated their mean “expected disappointing %AWL” and applied that to our ≥ 5-year outcome to select the patients’ expectation cutoff percentile. Second, we used the 25th percentile (p25), being the lower border of the interquartile range (IQR) as cutoff, as suggested by Mor et al. [30]. Third, we chose percentile p15.9 as cutoff, representing 1 standard deviation (SD) below median. Finally, we applied ≥ 50%EBMIL, the criterion most commonly used by surgeons, to our outcome and used the percentage of results after the first year that were < 50%EBMIL to indicate the surgeons’ goal cutoff percentile [24].

Sensitivity and Specificity

We expressed sensitivity and specificity of the 11 criteria for successful weight loss as percentages and calculated them with the number of false positive results (successful according to the criterion, but below the cutoff percentile curve) and false negative results (unsuccessful according to the criterion, but above the cutoff percentile curve), with respect to each of the four cutoff percentiles. We considered a criterion “inadequate” for a specific cutoff if either sensitivity or specificity was below 60%. We considered a criterion “useful” for a specific cutoff if both sensitivity and specificity were above 80%.

Results

Benchmark

Until January 2017, we performed 4497 primary LRYGB procedures. These patients had mean age 43 (range 18–66) years, mean baseline BMI 43.4 (range 34.5–71.5) kg/m2, and 16.7% were male, 10.3% were super-obese (BMI ≥ 50 kg/m2), and 22.6% had type 2 diabetes. There was no missing data for these variables. Time between first preoperative visit and operation was median 5.4 months. Preoperatively, they lost mean 6.2 (± 1 SD 4.1) kg. The longest possible follow-up per patient was median 24.3 months, mean 31.3 (range 0.0–108.9) months, while the longest actual follow-up was mean 22.0 (± 1 SD 19.3) months. One patient died within 30 days from bowel ischemia (0.02% mortality). Of the 3535 patients that were ≥ 1-year postoperative, 3031 (85.7%) had ≥ 1-year follow-up. Of the 518 patients that were ≥ 5-year postoperative, 304 (58.7%) had ≥ 5-year follow-up. All 4497 last weight loss results are presented in Fig. 1, with percentile curves p50 (median), p + 1 SD, and p − 1 SD.

Fig. 1
figure 1

All 4497 last weight loss results after laparoscopic Roux-en-Y gastric bypass expressed as percentage alterable weight loss (%AWL) with percentile curves p50 (median), and ± 1 standard deviation (SD)

Our data came close to outcome reported in literature. We found seven studies reporting %TWL of more than 250 patients, more than 1 year after LRYGB [7, 31,32,33,34,35,36]. These results are presented together with weight loss of our patients with last results at 0.5, 1, 2, 3, 5, and 7 years in Table 2.

Table 2 Mean weight loss results and standard deviation (SD) after laparoscopic Roux-en-Y gastric bypass (LRYGB) at 0.5, 1, 2, 3, 5, and 7 years (± 10%) and compared to results from seven landmark studies on percentage total weight loss (%TWL) after LRYGB. Body mass index (BMI), percentage excess weight loss (%EWL), percentage excess BMI loss (%EBMIL), percentage alterable weight loss (%AWL)

Our data confirmed that %AWL was the only metric that is baseline BMI-independent. For %TWL, %EWL, %EBMIL, BMI, and BMI-reduction, we found significant differences in mean outcome between the heaviest 1000 patients (mean baseline BMI 50.0 kg/m2) and the lightest 1000 (mean baseline BMI 38.0 kg/m2): + 3.4%TWL, − 11.8%EWL, − 19.2%EBMIL, + 6.9-kg/m2 BMI, and + 5.1-kg/m2 BMI reduction (all p < 0.0001; Mann-Whitney U test); but not for %AWL: − 0.1%AWL (p = 0.896).

Cutoff

In total, 3031 patients (67.4%) had their last measurement after the first year. Of these results, 10.9% were < 50%EBMIL. We therefore used p10.9 as the surgeons’ goal cutoff percentile curve. Characteristics of the subgroup of 200 patients that were interviewed about their expectations after bariatric surgery were mean age 45 (range 18–66) years, mean baseline BMI 42.2 (range 35.0–58.8) kg/m2, 22.0% male, 8.0% super-obese, and 12.0% type 2 diabetes. The distribution of their long-term expected disappointing weight loss results is presented in Fig. 2. The mean value was 34.6%AWL (or 23.9 (range 3.2–44.1) %TWL), which corresponded with percentile p31.6 of the 304 results at ≥ 5 years. We therefore used p31.6 as the patients’ expectation cutoff percentile curve. Percentile curves p50 (median), p31.6 (patients’ expectation cutoff), p25 (IQR cutoff), p15.9 (1 SD below median cutoff), and p10.9 (surgeons’ goal cutoff) from 1 year up are presented in Fig. 3, together with all last results beyond 1 year that correspond with ≥ 50%EBMIL. The nadir results of these four cutoff percentile curves were 42.2%AWL at 17 months, 39.8%AWL at 16.5 months, 36.4%AWL at 15.5 months, and 33.8% at 15 months, respectively.

Fig. 2
figure 2

The distribution of the expected disappointing long-term weight loss results expressed as percentage alterable weight loss (%AWL) of 200 bariatric patients interviewed preoperatively about their postoperative expectations

Fig. 3
figure 3

Percentile curves (from top to bottom) p50 (median), p31.6 (patients’ expectation cutoff), p25 (inter quartile range cutoff), p15.9 (1 standard deviation below median cutoff), and p10.9 (surgeons’ goal cutoff), from 1 year up, expressed as percentage alterable weight loss (%AWL). The dots are all > 1-year %AWL results that correspond with ≥ 50 percentage excess body mass index loss (≥ 50%EBMIL)

Sensitivity and Specificity

Sensitivity and specificity of the 11 criteria for weight loss success based on the 3031 results beyond 1 year and with reference to each of the four cutoff percentile curves are presented in Fig. 4. Weight loss criteria ≥ 25%EWL, ≥ 25%EBMIL, ≥ 15%TWL, and < 35BMI could be considered inadequate for all four cutoffs, as their specificities were (far) below 60%. Only four criteria ≥ 10 kg/m2, ≥ 50%EWL, ≥ 25%TWL, and ≥ 35%AWL could be considered useful, as their sensitivities and specificities were all above 80%, but only for cutoffs p15.9 and p10.9.

Fig. 4
figure 4

Sensitivity and specificity (expressed as percentages) of 11 criteria for weight loss success with reference to four cutoff percentile curves p31.6 (patients’ expectation cutoff), p25 (inter quartile range cutoff), p15.9 (1 standard deviation below median cutoff), and p10.9 (surgeons’ goal cutoff). Body mass index (BMI); percentage excess BMI loss (%EBMIL); percentage excess weight loss (%EWL); percentage total weight loss (%TWL); percentage alterable weight loss (%AWL). Red: inadequate criteria with sensitivity or specificity < 60%. Green: useful criteria with sensitivity and specificity ≥ 80%

All percentile curves showed weight regain. Compared to the < 10%WRnadir criterion, at 5 years, the four cutoff percentile curves gained 7.8, 8.5, 9.2, and 9.8% from their absolute nadir result for p31.6, p25, p15.9, and p10.9, respectively. Compared to the < 25%WRmax, at 5 years, the cutoff percentile curves p31.6, p25, p15.9, and p10.9 lost 18.6, 22.1, 27.0, and 31.4% from their initial maximum weight loss.

Discussion

For the first time, an attempt was made to systematically validate common bariatric weight loss criteria for success. We compared them to mid-term results (1–9 years), of a large cohort (n = 3031), of a wide variety of patients (BMI 35–72 kg/m2; 18–66 years; 1/6 male; 1/5 type 2 diabetes), after the gold standard bariatric procedure (LRYGB). Results were (very) disappointing for criteria ≥ 50%EBMIL, ≥ 25%EBMIL, ≥ 25%EWL, ≥ 15%TWL, ≥ 20%TWL, and < 35BMI. They all had low specificities, leaving too many poor responders unnoticed. Popular criterion ≥ 50%EBMIL differentiated successful results with ease (sensitivities > 96%), but missed 28–66% of the truly unsuccessful results. This inadequate specificity is clearly visible in Fig. 3, where many of the dots representing ≥ 50%EBMIL results lay below all four cutoff percentile curves (false positive results). Remarkably, the old ≥ 50%EWL criterion (based on life insurance tables for ideal weight) had better specificities, very likely because the %EBMIL metric is influenced more by differences in baseline BMI than %EWL [10, 22]. This study again proves how important this distorting effect can be in assessing bariatric weight loss results [8, 12, 16, 21, 23].

Criteria for weight loss success express treatment goals. They are used as simple tools to conveniently judge real-life results. The benchmark to test them should therefore be as close to real-life outcome as possible. As explained before, there is strong evidence that, although bariatric patients typically show a wide variety of BMI at baseline, their weight loss essentially is baseline BMI-independent. We therefore constructed a benchmark that was baseline BMI-independent as well. It might explain why the most adequate criteria found in this study (≥ 35%AWL and ≥ 25%TWL) were based on metrics least influenced by baseline BMI. Bariatric weight loss is also essentially dynamic over time. We therefore constructed a benchmark that expressed our outcome against time, using percentile curves. This might explain why none of the criteria were fully sensitive and specific, as they are all static over time. Still, some turned out to be quite useful among the dynamic real-life results, like ≥ 50%EWL, > 10-kg/m2 BMI reduction, ≥ 35%AWL, and ≥ 25%TWL.

A limitation of our study is that we only used LRYGB outcome for our benchmark, while the criteria are not meant to be procedure specific. As bariatric outcome varies for different procedures, we needed a gold standard to compare them to and considered the LRYGB procedure as such. Criteria commonly used in bariatric surgery are also not specified for gender, age, ethnicity, comorbidities, or other confounders on bariatric outcome. It was therefore important that our bariatric cohort was both large enough and sufficiently representative to test them. The external validation with landmark studies on LRYGB outcome showed that this was the case. It is important that other researchers confirm our findings, with other cohorts, that are large and representative as well; and with outcome after other procedures, like gastric banding, sleeve gastrectomy, and duodenal switch.

Another limitation of this study was the disbalance between short and longer-term results. Fortunately, this was caused more by the lower numbers of procedures performed in the earlier days of our bariatric program than by a lack of follow-up.

As the criteria tested are widely used by bariatric professionals and researchers, they represent a consensus. Then, they could help identify the most appropriate cutoff in a percentile-based benchmark. Although criteria ≥ 25%EBMIL, ≥ 25%EWL, ≥ 15%TWL, and < 35 BMI seemed to be inadequate for all four cutoff percentiles, the other weight loss criteria clearly returned more acceptable specificities for p15.9 and p10.9 than for p25 and p31.6. Furthermore, the weight regain criterion < 25%WRmax could be situated between p25 (22%) and p15.9 (27%). This all suggests that the %AWL p15.9 (1 SD below median) is the most expedient cutoff percentile curve for defining success. It would mean that at any time after the operation, 84.1% of all bariatric patients can be categorized successful, whatever BMI they had before the operation. In our LRYGB chart, only the ≥ 35%AWL and ≥ 25%TWL criteria came close to that curve, both with > 90% sensitivity and specificity.

Specificities of the ≥ 25%EBMIL, ≥ 25%EWL, ≥ 15%TWL, ≥ 20%TWL, and < 35BMI criteria were so low that they should be abandoned all together. The ≥ 10-kg/m2 criterion has merits, with better specificities than ≥ 50%EBMIL and sensitivities comparable to ≥ 35%AWL. The < BMI30 threshold had low sensitivities (< 80%), but high specificities (> 80%) for all four cutoffs, including the patients’ expectation cutoff p31.6. It therefore should be useful in preoperatively informing patients on what (not) to expect: that not reaching below 30BMI is not exceptional. With only one postoperative measurement per patient, we could not calculate sensitivities and specificities for the weight regain criteria. The < 25%WRmax criterion did fit the benchmark between nadir and 5 years; the < 10%WRnadir criterion did not.

Popular criteria ≥ 50%EBMIL, ≥ 25%EBMIL, ≥ 20%TWL, and < 35BMI all had low specificities, leaving too many unsuccessful results unnoticed. This was never demonstrated before and calls for a radical change in how we define bariatric weight loss success. Looking for yet another static criterion might not be the right way to go. Our second best static criterion ≥ 25%TWL had 91% sensitivity and 95% specificity, yet Wood et al. and Sjöström et al. showed that mean LRYGB outcome can drop below 25%TWL beyond 7 years [7, 35]. This suggests that static criteria suited for mid-term outcome might not be appropriate for long-term results.

A better way to define success would be with elaborate baseline BMI-independent percentile charts, based on large, representative cohorts of bariatric patients. They present success in a dynamic way, with a cutoff between good and poor responders (percentile curve 1 SD below median) that evolves over time. When similar charts are available for other types of bariatric procedures, it will become possible to compare success of different patients, with different preoperative BMIs, after different bariatric procedures, at different postoperative intervals, in a clear and simple way. Cumbersome criteria for weight regain would become redundant. All patients have to do to stay successful, even if some weight regain occurs, is to keep above the cutoff curve. Given the relative complexity of the %AWL algorithm, this all can only be possible digitally, with applications designed to convert individual postoperative weight results that are hard to compare, in to percentile results that can easily be compared, plotted in a chart.