The article by Harri and Kawashima [1] is a classic paper derived from a well-designed study that nevertheless illustrates some of the prime limitations of rigorously scientific papers concerning aesthetic procedures. Their paper (which was sponsored by Allergan, Inc.) mines the same vein as that of an earlier work [2] but in a decidedly different patient population. The authors are quite right when they point out that the majority of studies on aesthetic injections are performed using patient populations of Caucasian women. While they wisely do not go so far as to hint that there may be differences at the neuromotor junction between races, they point out that facial skin and musculature are different in different populations. To these differences I would also add fat, not only in amount but also in placement. A common four-point scale is used to measure efficacy, with success (response) defined as moving either a moderate or severe group of lines to a mild or nonexistent group. Clearly, this introduces many variables such as muscle mass, muscle length, amount of subdermal fat, skin thickness, prior skin damage from motion, and skin photoaging into the equation. By its very design, this study looks well beyond the neuromuscular junction. Not that the design is ideal. The study length of 16 weeks is clearly too short as anecdotally we realize the effect of the toxin can last well beyond this point. The patients are seen only every 4 weeks so any claims of duration are dubious indeed. The study also is purported to be double-blinded but standard 100-U vials of BOTOX® were used for the study. To achieve the two different doses of 10 and 20 U with the same volume of injection, two different volumes were used to reconstitute the vials. While not directly stated in the paper, simple arithmetic indicates that the amounts used were 2.5 and 5 cc of normal saline (NS). The authors then state that an “identical volume of placebo” was added to placebo vials containing only sodium chloride. My question is: Identical to what? The vials were coded so as not to indicate which had BOTOX and which had placebo. This implies that the injectors themselves drew the study medication from the vials. So, which amount was placed in the placebo vials, 2.5 or 5 cc? In either case, the double-blind is broken unless each site had all the same volume in each vial, meaning each site had only one dose versus placebo which would allow for even more bias.

The primary efficacy endpoint was physician-rated line severity at maximal contraction at 4 weeks. Other data collected were physician-rated line severity at rest, patient assessment of line improvement at each visit, and patient satisfaction at weeks 4 and 16. Patients in the study were nearly evenly split between moderate and severe lines. This means that to be considered a responder, at least half the subjects would need to improve at least 2 points. Unsurprisingly, at the primary efficacy endpoint patients receiving drug improved while those receiving placebo did not. However, there were certainly a few subtle surprises hidden in the data. First, the response rate in the placebo patients was zero. Not one of the 48 patients was graded even to a 1 (minimal lines). That means that not one patient was a “low 2,” in other words a borderline minimal/moderate line severity patient that even by expert graders could have been given either a 1 or 2 at any moment. Not a single patient had reduction in muscle function by the sheer trauma of having a needle inserted into it and/or NS injected into it, within its inelastic muscular fascia. This is unusual.

The real surprise is the 10-U dose. The U.S. label is for 20 U in five equally divided doses across the glabella. Clearly, the authors expected the 20 U to be the “winner” in their study as well. It was not. At their primary endpoint, physician-rated line severity at maximal contraction at 4 weeks, there was no statistical difference between the 20-U group and the 10-U group. In fact, across all maximal contraction endpoints there was never a statistical difference between 10 U and 20 U (fig2)! The next figure (3) plots the mean change in line severity at maximal contraction. Certainly, one might think that doubling the dose might lead to more 2- and 3-point improvements. But no, once again, there was no statistically significant difference across all time points between 10 U and 20 U.

For this study, the primary efficacy endpoint was at maximum frown, not at rest. Data at rest in these studies are notoriously more difficult to score because the differences between the four grades of glabellar rhytids are smaller. There is even less data to go around as over half the subjects in each of the three groups (10 U, 20 U, and placebo) had scores of 0 or 1 at rest and thus was limited as far as amount of improvement. Four-week data is also typically used for comparison. Then why is so much written in this paper about resting line data at 8 weeks? A cynic might point out that this might be because it is the only data point measured by physicians where there is a statistically significant difference between the 10- and 20-U cohorts.

As far as patient-derived data, patients were not asked to directly score their lines but to assess improvement, with at least moderate improvement being the benchmark for response. Again, it is only at 8 weeks that the data of the 10-U and 20-U doses have statistically significant differences in outcome.

The final data set collected was patient satisfaction. This, after all, is why we do what we do. “Improvement” in facial lines or other objective measures of facial beauty or attractiveness are meaningless if they do not correlate with patient satisfaction. We try to make patients appear younger or more attractive so that they feel better about themselves, not for some absolute scoring scale. Once again, while both doses dramatically improved satisfaction compared to placebo, there was no statistically significant difference between the 10-U and 20-U groups. Clearly, some of the lack of difference noted is due to the four-point rigid study design. Rigid design is necessary for truly scientific papers. But rigid, scientific study design frequently misses nuances of improvement and attractiveness. Rigid study design with cosmetic endpoints is akin to trying to digitize a strictly analog world. That is not very likely to entirely succeed. Yet, these types of papers are critically important when analyzing new drugs and devices. I am all for “evidence-based medicine” when it comes to cholesterol-lowering drugs and antihypertensives. Beware those words when it comes to aesthetic outcomes.

Overall, this is a landmark paper that shows the effectiveness of BOTOX (because all neurotoxins are different, the authors note that “these results do not apply to any formulation of BoNTA other than that used in the present study”) for improving glabellar rhytids in the Japanese female population. These findings were unequivocally found at rest as well as at contraction and correlated with subjects’ assessment of improvement and satisfaction. The authors have done an excellent job of scientifically getting us to an aesthetic endpoint which is not easy. The study, despite a small double-blind flaw, is well structured. I have designed studies for years for several neurotoxin formulations, including that of the sponsor, and have yet to see a study completely free from bias. It is an extremely difficult thing to accomplish but it still must remain our goal.

I disagree with the discussion and conclusions of this paper. Rather than look for possible causes for the lack of differences between the two doses, it is massaged. A frank discussion would have brought up the possibility of skin and fat differences but mostly muscle size differences between patient populations. Is it possible that Japanese women can have the same effect from a lower dose of BOTOX than Caucasian women? I certainly think so. However, this very basic conclusion from the data is never even addressed. Maybe the authors were reluctant to assess a possible difference in, say, muscle mass based solely on race. However, I hope that is not the case because potential racial differences were the driving point of the entire paper! In fact, in a study of crow’s feet patterns to be treated with BOTOX, there were differences between races [3]. Caucasian women are most likely to have a full-fan pattern of rhytids, while Asian women are more likely to have concentrated rhytids at the lateral canthus.

This is not to say that I think 10 U is an ideal dose or equivalent to 20 U in women of Japanese origin. I certainly do not. I have been playing devil’s advocate to drive the discussion. Do I think 20 U is a high dose for most Japanese women? Yes, I do. In my practice, I estimate that the median glabellar dose for women of Japanese descent (who are different than Japanese women because of diet and other environmental factors; remember we are trying to eliminate bias here) is between 15 and 17.5 U (nearly midway between 10 and 20 U). Unconstrained in my practice by the rigors of a scientific study, my doses are based on estimated muscle mass with no ideal dose. There are a variety of factors that may allow equivalent results with a lower dose but you will not find them in the discussion. To be fair, the discussion does clearly point out the lack of significant differences between the doses but in a roundabout manner. It states that “the two doses of BoNTA did not differ significantly on several of the endpoints.” Actually, it did not differ significantly on all of the physician-measured endpoints except for one and on all of the subject-measured endpoints except for one. Trying to interpolate a potential difference in duration of effect (which was only 1.5 weeks) with monthly data is difficult at best and a reach at worst.

When the sponsor of the paper is the drug manufacturer and marketer, I feel it is especially important for any author to attempt to eliminate even any appearance of bias. However, the above differences I have with the authors are relatively small. The conclusion of this paper, I feel, is completely wrong. The conclusion states “...the 20-U dose provides greater efficacy...than does the 10-U dose.” Maybe in the real world it does, but not according to this paper.