Trichotillomania (TTM) studies generally include measures of symptom severity such as standardized self-report questionnaires (e.g., Massachusetts General Hospital-Hairpulling scale: MGH-HPS; Keuthen et al. 1995) or clinician ratings derived from structured interview (e.g., NIMH-Trichotillomania Symptom Severity Scale: NIMH-TSS; Swedo et al. 1989). It is also common in TTM treatment studies to incorporate objective measurement of alopecia, or hair loss resulting from hair pulling. Assessment of alopecia is not a substitute for general measures of TTM symptom severity, as it does not address indicators such as the frequency or intensity of urges to pull hair, time spent pulling, distress stemming from urges to pull, and so forth. Even as a marker of amount of recent hair pulling behavior itself, alopecia measurement would be problematic. The time required for regrowth of hair that has been pulled can vary by pulling site, sex, and age of the hair puller (Myers and Hamilton 1951).

Accordingly, we do not consider alopecia a primary outcome measure for TTM research, but as a potentially valuable supplementary measure. Alopecia measurement may contribute to the social validation of TTM treatment effects. To be sure, each person with TTM has her or his own set of concerns about it, but when one performs an “inconvenience review” (e.g., Stanley and Mouton 1996) in conducting habit reversal therapy (HRT; Azrin et al. 1980) for TTM, a commonly cited negative effect of pulling behavior is the resultant hair loss and its effect on one’s appearance. The widespread use of costly and time-consuming methods for disguising the extent of hair loss (wigs, eyeliner or other makeup, avoidance of activities that might expose hair loss) also attests to the centrality of alopecia to the impact of TTM on self-esteem, relationships and quality of life more generally (e.g., Diefenbach et al. 2005b).

The research reported in this article examined an alopecia photo rating measure developed by Tolin et al. (2002). The measure consists of one item, a rating from 1 [no evidence of hair pulling] to 7 [large bald spots that are difficult to conceal] of hair loss evident in a photo of the person’s most affected site. In a recent stepped care treatment study for TTM (Rogers et al. 2014), we reported some encouraging data on the alopecia rating. First, interrater reliability was high. Two coders (masked to time of assessment and treatment condition) rated each photo. Their average rating was highly reliable (ICC = 0.82), consistent with earlier studies using this rating scale based on live observation (Diefenbach et al. 2005a) or photographs (Tolin et al. 2002). Second, the alopecia rating proved to be sensitive to change during treatment. Average scores for the sample as a whole declined significantly from baseline to the end of the stepped care program (Rogers et al. 2014).

Several other important questions about the alopecia measure could not be addressed in that initial brief report of the clinical trial, however, and have not been the subject of any published studies to date. First, the acceptability of the measure is unknown. Some patients pull mainly from private areas (e.g., pubic hair), and even those who pull from more visible sites (e.g., head, arms) might balk at having photos taken of affected sites for rating purposes. Whatever its other virtues, the alopecia rating will be of limited utility if a high proportion of TTM patients refuse to be photographed for it.

Second, retest reliability of alopecia ratings is uncertain. Knowing how stable scores on a measure are in the absence of intervention is useful for interpretation of treatment-related change. There was no significant change in scores on the alopecia rating during a wait list period in Diefenbach et al. (2006), but no retest reliability correlations addressing rank-order stability have been reported for this measure.

Third, convergent validity of the alopecia photo rating with other ways (e.g., live interviewer rating) of measuring alopecia has not been studied thoroughly. In one study (Diefenbach et al. 2005a), alopecia ratings correlated very highly (0.97) with the severity item (#6) from the Psychiatric Institute Trichotillomania Scale (PITS; Winchel et al. 1992), which is based on the PITS interviewer’s inspection of the most affected pulling site and yields a score from 0 (“No loss”) to 7 (“Total loss of hair of brows or lashes or almost total loss of scalp hair or hair on other body part”). However, in that study the alopecia rating scale itself was made live by the same diagnostic interviewer as was providing the PITS item rating. It is not known whether alopecia rating from a photo, made by a different person, would converge as well with interviewer ratings of severity of hair loss.

Fourth, concurrent validity of the alopecia rating in relation to functional impairment stemming from TTM has not been addressed in published studies. We would expect an association with impairment of one’s social life in particular, as opposed to work/school or family/home life functioning. TTM as a whole can certainly interfere with work or home responsibilities, but the extent to which it does so would be less likely to vary with extent of hair loss. Experimental research has shown that more severe hair loss is associated with greater social rejection and lower acceptance (Ricketts et al. 2012).

Finally, information on the association of alopecia with TTM symptom severity is limited. As noted earlier, there is reason to believe that alopecia could be largely independent of symptom severity, but it remains of interest to know empirically what the association is. Rating alopecia from live observation, Diefenbach et al. (2005a) found nonsignificant correlations with total TTM symptom severity. However, statistical power constraints (sample N = 28) may have contributed to nonsignificance, and the effect sizes themselves varied considerably from r = 0.10 (with MGH-HPS) to 0.25 (with NIMH-TSS) to 0.53Footnote 1 (with PITS).

In summary, this study was intended to provide further information on an alopecia photo rating measure (Tolin et al. 2002) used in TTM research. Acceptability of the measure (i.e., willingness to have the spot of greatest hair loss photographed for this purpose), retest reliability, convergence with ratings made by a live interviewer, and associations with degree of social impairment and with overall TTM symptom severity were evaluated in the context of a clinical trial of a stepped care approach for treatment of TTM.

Method

A detailed description of the larger clinical trial from which this study is derived may be found in Rogers et al. (2014), which focused on the efficacy of web-based self-help and acceptability of stepped care treatment to participants with TTM. The method is briefly summarized below with emphasis on measures used in studying the alopecia rating scale.

Participants

Participants were 60 adults with TTM (57 female) enrolled in the study. Their average age was 33.18 (SD = 10.87). Three-quarters were Caucasian (75 %), while 17 % were African American, 3 % Asian, 2 % Native Hawaiian/other Pacific Islander, and 3 % “other” race. One participant (2 %) was Hispanic. The sample was recruited via newspaper and online ads and clinician referrals. Inclusion criteria were: at least 18 years old, regular access to the Internet, and meeting DSM-IV-TR criteria for TTM except that criteria B (tension before pulling) and C (pleasure, relief, or gratification when pulling) were not required, just as they are not required in DSM-5 (American Psychiatric Association 2013).

Prospective participants were excluded if they showed any of the following within the past month: (1) suicidality; (2) major depressive episode; (3) psychosis; (4) severe anxiety; or (5) substance abuse. These are exclusion criteria for users of our Step 1 intervention, StopPulling.com, outside the research context. Prospective participants were also excluded if they were in concurrent psychotherapy for TTM, or were taking medication for TTM and not on a stable dose for >= 4 weeks.

Materials

Measures of Exclusion and Inclusion Criteria

Suicidality was assessed by administering the first 5 items of the Beck Scale for Suicide Ideation (Beck et al. 1988) with a “past month” time frame. Any response greater than 0 was followed up with a clinical interview. If the prospective participant reported a recent suicide attempt or active suicidal ideation, she or he was excluded from the study. To assess the other exclusion criteria we used the Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition With Psychotic Screen (SCID-I/P; First et al. 2002), a semi-structured diagnostic interview. If the prospective participant received a rating of 3 (threshold or true) for current major depressive episode or at least one symptom of delusion or hallucination, or met criteria for any anxiety disorder with current rating of “severe”, or met criteria for a substance use disorder with active use in the past month, she or he was excluded from the study. All prospective participants excluded from the study were given suitable clinical referrals.

The Trichotillomania Diagnostic Interview (TDI; Rothbaum and Ninan 1994) was used in diagnosing TTM. All TDIs were recorded, and a 20 % random sample of the videos was coded by a second rater (masked to assessment period and treatment condition). Interrater agreement was high (92 %, kappa = 0.77).

TTM Symptoms

The Massachusetts General Hospital Hairpulling Scale (MGH-HPS; Keuthen et al. 1995) is a 7-item self-report measure of past-week TTM symptoms. Each item is rated on a 0 to 4 scale (total = 0 to 28). Internal consistency is high, as are short-term retest reliability and discriminant validity in relation to anxiety and depression (O’Sullivan et al. 1995). In our sample, alpha for the MGH-HPS was 0.74.

The Psychiatric Institute Trichotillomania Scale (PITS; Winchel et al. 1992) is a six-item, semi-structured interviewer-rated measure of TTM symptoms. Each item is rated on a 0 to 7 scale (total = 0 to 42). The PITS shows strong convergent validity with other clinician-rated TTM measures, albeit low internal consistency (Diefenbach et al. 2005a). In our sample as well, internal consistency was low (alpha = 0.37). Our PITS interviews were recorded, and a 20 % random sample selected for coding by a second rater (masked to treatment condition and assessment point). Item 6 (severity of hair loss) could not be evaluated from videos. For the sum of items 1–5, interrater reliability was high (r = 0.95).

Impairment

The Sheehan Disability Scale (SDS; Sheehan 1983) is a 3-item measure of impairment in work/school, social life, and family life/home responsibilities. Each item is scored on a 0 (“not at all”) to 10 (“extremely”) scale with regard to the extent to which symptoms have disrupted one’s life in that domain. For concurrent validity analysis, we used the social life item only. In a large TTM sample, SDS scores correlated positively with TTM symptom severity (Woods et al. 2006).

Alopecia

As described in the Introduction, the Alopecia rating (Tolin et al. 2002) is a one-item evaluation of hair loss evident in a photo of the most affected pulling site. Photographs were taken using a Sony Cyber-shot W350 digital camera (14.1 megapixels; 4x optical zoom) and then transferred to a portable hard drive and stored in a secure room. Two raters, masked to experimental condition and to time of assessment, independently rated each photo on a 1 (no evidence of hair pulling) to 7 (large bald spots that are difficult to conceal) scale. Scores were averaged across raters.

Procedure

Design Overview

At the end of a baseline pre-treatment assessment session including the measures described earlier and other interviews and questionnaires not relevant to this report, participants were randomized to immediate Step 1 access or to a waitlist (WL) condition; those in the WL condition completed a safety check-in by phone after 5 weeks and a full post-wait list assessment in person 10 weeks after baseline, prior to Step 1. Step 1 consisted of 10 weeks of (free) access to web-based self-help via StopPulling.com (with another phone check-in at the midpoint). At an in-person post-Step 1 assessment, participants chose whether to enter Step 2, in-person individual therapy (habit reversal therapy; HRT). Regardless of what they chose, an additional in-person assessment (post-Step 2) was conducted 8 weeks later. Finally, 3 months later a follow-up in-person assessment was conducted. Data for the present study were derived from baseline or post-wait list assessment, so we do not consider the treatments further, but details may be found in Rogers et al. (2014).

Results

Table 1 shows means and standard deviations, as well as intercorrelations of all measures included in this report.

Table 1 Descriptive data and intercorrelations of baseline measures

Acceptability

Acceptability of the alopecia rating was high. At the baseline assessment session, 53 of 60 participants (88 %) consented to have the most affected pulling site photographed for the purposes of alopecia rating. Average scores were high in absolute terms, but with enough spread to permit study of individual differences in alopecia (M = 5.12, SD = 1.51 on 1–7 scale).

Retest Reliability

For the subsample of participants (n = 24) randomly assigned to the waitlist condition and with alopecia ratings from baseline as well as the post-waitlist assessment 10 weeks later, the mean alopecia rating across two raters was generally stable. The baseline mean (M = 5.15, SD = 1.55) did not change significantly within 10 weeks (M = 5.08, SD = 1.49), paired t (23) = 0.23, p = .82. The retest reliability correlation was 0.63, p < .001.

Convergent Validity

Baseline mean alopecia ratings correlated significantly and substantially (r = 0.51, p < .001) with severity ratings (item #6) from the PITS interview.

Concurrent validity

The mean alopecia rating at baseline did not correlate significantly with the self-rated extent to which symptoms had disrupted the participant’s social life (SDS item 2), r = 0.16, p = .27.

Association with Symptom Severity

The mean alopecia rating did not correlate significantly at baseline with either self-reported symptoms (total MGH-HPS score, r = −0.06, p = .68) or interviewer-rated symptoms (total PITS score, r = 0.26, p = .06).

Discussion

This study used data from a clinical trial of stepped care for TTM to provide further evidence pertaining to a 1–7 alopecia rating scale (Tolin et al. 2002). Ratings were based on photographs of participants’ most severely affected pulling site. Previously published articles had shown high interrater reliability for this scale (Rogers et al. 2014; Tolin et al. 2002) and sensitivity to treatment-related change in that average scores declined significantly from baseline to post-treatment evaluation (Rogers et al. 2014).

The present report extended prior research in several ways. The alopecia rating proved highly acceptable, and scores were generally stable across a 10-week no-treatment waitlist period. Convergent validity with an interviewer’s rating of hair loss severity from the PITS measure was high (r = 0.51), particularly considering that this association may have been attenuated by unreliability, as each measure consisted of just a single item. Alopecia ratings did not correlate significantly with the degree to which TTM symptoms were perceived as impairing the participant’s social life, or with total TTM symptom severity as measured by self-report or interviewer rating.

Acceptability of the alopecia rating was high but not universal, as 88 % permitted a photo to be taken of the most affected pulling site. Reasons for declining were not systematically or formally assessed, but they included for instance not being willing to wash off makeup or undo a hairstyle completely to reveal a pulling site for a photograph and then reapply makeup or redo one’s hair before leaving the assessment session. To achieve complete acceptability of photo rating of alopecia even in such circumstances, it may be necessary to develop standard pictures anchored to specific scores, and allow patients themselves to say which picture most closely resembles their own state of hair loss, analogous to the approach taken in developing the Clutter Image Rating scale for use in studying hoarding (Frost et al. 2008).

The absence of a positive correlation between the alopecia rating and self-reported social impairment associated with TTM was unexpected. Social impairment ratings were sufficiently variable (M = 2.98, SD = 2.59, range = 0–10) to make a positive correlation possible, and indeed social impairment (SDS) did correlate significantly (see Table 1) with interviewer-rated total symptom severity. It may be that participants with severe alopecia, or at least enough of them to lower the full-sample correlation, were sufficiently satisfied with their means of disguising alopecia (wigs, scarves, eyeliner, etc.) as to not be excessively self-conscious. Alternatively, between-participant analyses (such as the Pearson correlation we computed with cross-sectional data) might miss the effect of alopecia on social impairment. If perceptions of social impairment track changes from one’s own norm rather than differences from other people with TTM, such an effect would only be testable in future research using within-participant analyses taking advantage of repeated measurement. In other words, someone might feel self-conscious or perceive herself as being judged by others if her own hair loss is worse than it was 2 months ago, more so than by its being worse than the hair loss shown by other people with TTM, which is what cross-sectional Pearson correlations reflect.

Favorable evidence of interrater reliability, retest reliability, and convergent validity suggests that the photo rating of alopecia can serve as a useful supplementary measure for treatment studies. It is independent of total symptom severity and would not serve well as a primary outcome measure, but it addresses a side effect of hair pulling of considerable importance to people with TTM.

We speculate that alopecia rating from photographs may also have treatment utility. Clinically, photographic evidence of the extent of hair loss can be useful as a baseline measure, and photos taken later in treatment can serve to document an aspect of progress in a very compelling fashion. Gradual changes in appearance can elude one’s notice or be underestimated if we rely entirely on memory. Just as photos can shock people in everyday life (“did I look that young just 3 years ago? What happened?”), they can serve to concretize progress over the course of a few months of treatment (“yes, I still have urges and occasional lapses to pulling, but look how much fuller my eyebrows are in this photo vs. this one from before therapy”). Rothbaum and Ninan (1994) observed that taking multiple photos over the course of treatment can “help the client break through denial and see her progress” (p. 657). Future research could test experimentally the treatment utility (Nelson-Gray 2003) of alopecia rating of photos by randomly assigning people with TTM to have photos taken and rated and shown to them periodically throughout treatment vs. not and see if the alopecia-rated group shows greater treatment response.

Methodological limitations constrain interpretability of the findings. Our sample size was modest, particularly for the retest reliability evaluation based only on those participants in the wait list condition. Future reliability studies of the alopecia rating measure should attempt to enroll a larger sample. Also, 12 % of participants declined to be photographed for alopecia rating, and we did not systematically collect data on their reasons for doing so. Finally, we evaluated only one of the possible ways of rating alopecia severity. Future research might do well to test the comparative validity of alternate methods of rating alopecia. For example, if someone pulls from multiple sites, an aggregate alopecia rating, rather than the rating of the most affected site as in our study, could be useful. Alternatively, judges’ ratings of photos could be supplemented by actual counts of the hairs in defined areas of the scalp or skin, a metric that has proven useful in dermatology treatment research (e.g., Olsen et al. 2007).