Introduction

Between 30% and 50% of depressed subjects in clinical trials have a substantial reduction in symptoms during treatment with placebo (Beecher 1955; Shapiro and Shapiro 1997; Walsh et al. 2002; Fava et al. 2003). While the placebo response is both an interesting and useful clinical phenomenon, it bedevils antidepressant drug discovery. “Failed trials” (those in which drug fails to separate from placebo) are common and cost the pharmaceutical industry hundreds of millions of dollars annually (Enserlink 1999; Robinson and Rickels 2000). Reduction in the rates of placebo response could enhance the efficiency of screening potential antidepressant compounds, decrease the needed size (and therefore the cost) of clinical trials, and potentially decrease the number of failed trials (Fava et al. 2003).

Control over placebo response rates has proved to be elusive. One strategy has been to alter the design of the clinical trial, but there is little or no agreement among studies regarding which factors are crucial to control (Fava et al. 2003; Michelson et al. 1999; Zimbroff and Mendez 2002). Another strategy has been the single-blind placebo lead-in, in which subjects receive placebo for a brief period with the expectation that eventual placebo responders (PRs) will show a response during the lead-in (Landin et al. 2000). Only a small minority of subjects actually responds to placebo during a 1-week lead-in, however, and there is no overall reduction in placebo response rates (Reimherr et al. 1989; Trivedi and Rush 1994; Faries et al. 2001).

Other lines of work have aimed at identifying the characteristics of individuals who constitute likely placebo responder (LPR) subjects. The most successful work in this regard is that of Quitkin and colleagues (Quitkin et al. 1993; Quitkin 1999; Stewart et al. 1998) who examined the time-course of improvement during a clinical trial. While this “pattern analysis” method is powerful, it can identify PRs only at the end of a clinical trial.

A more useful method would be prospective identification of LPR subjects so that they could be screened out of clinical trials using a selective enrollment strategy. Demographic factors such as subject age, gender, education level, or occupation have not proved to be reproducible predictors of response (Shapiro and Shapiro 1997). Depressive symptoms may have usefulness in predicting which subjects will respond to placebo (Peselow et al. 1992). McGrath et al. (2000) found that those subjects with more severe neurovegetative symptoms on the Hamilton Depression Rating Scale (Ham-D) were less likely to respond to placebo.

We previously reported changes in brain function during treatment in depressed subjects who responded to placebo (Leuchter et al. 2002). In the present study, we examined this same group of individuals to determine whether there were pretreatment factors that would prospectively identify subjects who were eventual PRs. We hypothesized that data from three different domains would be useful for identifying LPR subjects prior to treatment. These domains were neurophysiological (quantitative electroencephalographic; QEEG), neuropsychological (results from a cognitive testing battery), and symptomatic (neurovegetative items from the Ham-D). The neurophysiological data included QEEG power and cordance, a measure that has moderately strong associations with cerebral metabolism and perfusion (Leuchter et al. 1994a,b, 1999). In contrast to our previous work that focused on a limited number of brain regions (Leuchter et al. 2002), in the present study we examined brain function from all recording electrodes to determine whether any brain region would show pretreatment differences in brain function. Neuropsychological data were collected for seven spheres of cognitive function. Four individual symptom items from the 17-item Ham-D were selected based on previous findings regarding neurovegetative symptoms (early, middle, and late insomnia, as well as appetite) (McGrath et al. 2000). Items from each domain were examined in an exploratory multiple variable model to determine whether a combination of data from these three domains might be useful for prospective identification of LPR subjects.

Materials and methods

Subjects

Subjects were recruited from community advertisement and from the outpatient clinics of the UCLA Neuropsychiatric Hospital, and were enrolled in one of two 9-week, double-blind, placebo-controlled treatment studies. These studies were conducted sequentially and independently over a 24-month period: the first study utilized fluoxetine 20 mg (n=24) and the second venlafaxine 150 mg (n=27) as total daily medication doses. The UCLA IRB approved all experimental procedures, and all procedures were performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki. Written informed consent was obtained after experimental procedures were explained fully to the subjects, and prior to their inclusion in the study.

All subjects were over 21 years of age, met DSM-IV criteria for a major depressive episode, and had 17-item Hamilton Depression Rating Scale (Ham-D) scores ≥16 (Hamilton 1960). Diagnoses were established using the Structured Clinical Interview for DSM-IV. Subjects were excluded if they previously had failed treatment with the antidepressant being studied, had a history of suicidal ideation, or suffered from any medical illness or received any medication known to significantly affect brain function. Demographic and clinical characteristics of the subjects are shown in Table 1.

Table 1 Demographic characteristics of subjects. There were no significant clinical or demographic differences between the subjects enrolled in the fluoxetine or venlafaxine trials

Assessment instruments

The severity of depression was assessed at baseline and throughout the study period using the 17-item Ham-D (Hamilton 1960). The total score of the Ham-D was used as the entry criterion for the study, as well as to determine treatment response. In order to examine differences in neurovegetative symptoms across response groups, items for early, middle, and late insomnia, as well as appetite were selected for further analysis.

Cognitive function was assessed using a battery of tests that broadly measured cognitive performance in seven spheres: (1) information processing speed, (2) executive functioning, (3) language, (4) verbal memory, (5) non-verbal memory, (6) basic attention, and (7) visuoperceptual ability. The specific tests in each sphere are detailed in Table 2. Cognitive testing data were available only on 42 of the 51 subjects; the remaining 9 subjects were unable to complete testing.

Table 2 Tests comprising the seven spheres of cognitive function

Experimental procedures

Fifty-three subjects received 1 week of single-blinded placebo lead-in treatment. Two subjects met response criteria (Ham-D ≤10) at the end of this week and were removed from the studies. The remaining 51 subjects then were randomized to receive 8 weeks of double-blind treatment with either placebo or the active medication. Subjects enrolled in the fluoxetine trial received 20 mg/day for the entire 8 weeks. Those enrolled in the venlafaxine trial began at 37.5 mg/day and increased the dose by 37.5 mg every 3 days, until a dose of 75 mg b.i.d. was attained after 1 week. They continued at a dose of 150 mg/day for the remaining 7 weeks. To preserve blinding, the number of placebo pills was increased on the same schedule as the medication.

After randomization, research staff evaluated subjects at 2 days and at weekly intervals thereafter. In addition to symptom evaluation using the Ham-D, subjects had brief supportive psychotherapy at each visit, consisting of 15–25 min of unstructured counseling and assistance in problem solving by a research nurse. This support was required by the IRB to address concerns about dispensing placebo as the sole treatment for patients with major depression. Response was defined as Ham-D ≤10 after 8 weeks of double-blind treatment. At this time, blind was broken and subjects were classified as medication responders (MRs), PRs, medication non-responders (MNRs), or placebo non-responders (PNRs).

QEEG techniques

QEEG data were examined from a recording performed at the time of enrollment, before any treatments were administered. Recordings were performed using the QND system (Neurodata, Inc.; Pasadena, CA) while subjects rested in the eyes-closed, maximally alert state in a sound-attenuated room with subdued lighting, using procedures previously described in detail (Leuchter et al. 1999; Cook et al. 1998, 1999). Electrodes were placed with an electrode cap (ElectroCap; Eaton, OH) using an extended International 10–20 System with 35 recording electrodes (Fig. 1). Eye movements were monitored with right infraorbital and left outer canthus electrodes. Data were collected using a Pz reference montage and were digitized at 256 samples/channel/s, with a high-frequency filter of 50 and a low-frequency filter of 0.3 Hz. Data were reformatted by amplitude subtraction to construct a linked-ears reference montage, and then were reviewed by a technician who was blinded to subject identity, treatment condition, and clinical status. The technician examined the record carefully for eye movement, muscle, or other artifacts, and selected the first 20–32 s of artifact-free data for processing. This amount of data may be used to obtain reliable frequency spectra (Leuchter et al. 1992, 1999; Brenner et al. 1995). A fast Fourier transform was used to calculate absolute power (the intensity of energy in a frequency band in microvolts squared) in each of four frequency bands (0.5–4, 4–8, 8–12, and 12–20 Hz).

Fig. 1
figure 1

Electrode montage. The 35 scalp electrodes from the extended International 10–20 System. “Neighboring” electrodes are linked by line segments, to denote bipolar channels that were used for averaging in the reattributional montage. All electrode pairs sharing a common electrode were averaged to obtain a reattributed power for that electrode (e.g., for electrode C3, power from the pairs FC5-C3, FC1-C3, CP5-C3, and CP1-C3 were averaged)

Cordance values next were calculated for each electrode site in each of the four frequency bands. Cordance is a measure derived from QEEG power and has a moderately strong association with cerebral perfusion (as assessed by simultaneous O15 positron emission tomography); this association is superior to that seen for conventional QEEG power measurements in each frequency band (Leuchter et al. 1999). Cordance is calculated using a three-step algorithm that normalizes power across both electrode sites and frequency bands. This algorithm has been defined in detail elsewhere (Leuchter et al. 1999) and may be summarized briefly as follows. First, absolute power values are reattributed to each individual electrode by averaging power from all bipolar electrode pairs sharing that electrode (Fig. 1). This electrode referencing method is similar to the Hjorth transformation, except that the present method averages power from neighboring electrode pairs, whereas the Hjorth transformation averages voltage amplitudes. We previously reported that electrode referencing based on power averaging provides a stronger association between surface-measured EEG and perfusion of underlying brain than either the linked-ears reference or the conventional Hjorth transformation (Cook et al. 1998). Based on reattributed absolute power, relative power (the percentage of the total energy from all bands concentrated in a single band) is also calculated. Second, absolute and relative power values undergo spatial normalization within each frequency band using a z-score transformation, yielding z-scores for each electrode site s and frequency band f (Anorm(s,f) and Rnorm(s,f), respectively). Third, the z transformed power scores are summed to yield cordance values.

For each individual over the course of treatment, we calculated power and cordance values for individual electrodes (Fig. 1). In contrast with our previous study, in which we focused on certain specific brain regions (Leuchter et al. 2002), we examined data from each recording electrode. We limited our analysis to the theta frequency band (4–8 Hz), because previous work from this and other laboratories has indicated that energy in the theta band is associated most strongly with treatment outcomes in depression (Cook et al. 1999; Ulrich et al. 1984, 1994).

Data analysis

Demographic and clinical differences among the four response groups (MR, MNR, PR, and PNR) were examined first. Relationships between categorical independent variables and a single continuous dependent variable were examined using analysis of variance (ANOVA). Chi-square analyses were used to explore possible relationships when both the independent and dependent variables were categorical.

A dichotomous treatment response variable was created to look at PRs versus all other study participants (MR, MNR, and PNR). In order to conduct a more focused examination of the responders, we also created a dichotomous response variable including only the PRs and MRs.

We hypothesized that each of the three domains of data at baseline (neurophysiological, neuropsychological, and neurovegetative symptoms) would be useful for identifying PRs. Multivariate analysis of variance (MANOVA) was performed first to determine whether there were differences among the groups of subjects in multiple dependent variables (QEEG power and cordance) in the neurophysiological domain. For the neuropsychological testing domain, a series of MANOVAs was performed to determine whether there was a difference among the groups in any cognitive sphere (if a sphere consisted of a single measure, ANOVA was used instead of MANOVA). MANOVA was utilized because it corrects for multiple comparisons by maintaining the groupwise error rate at 0.05. Box’s M was used to test the assumption of equality of covariance matrices and Levene’s test was used to assess equality of error variance across groups. In domains where the omnibus MANOVA was significant, univariate F-tests were then used to further examine the effect of the specific dependent variables that contributed to the overall effect. Pillai’s Trace was used because it is the most robust test statistic when unequal sample sizes are examined.

After the initial examination of the data, exploratory logistic regression was used to develop a multivariable model to estimate the probability of placebo response. Neurophysiological and neuropsychological variables that showed between-group differences in the MANOVAs as well as the four neurovegetative items from the Ham-D were used as the independent variables. Forward stepwise regression was used to select variables with the strongest predictive value to enter the final model based on a classification cutoff of 0.5. Nagelkerke’s R squared was used to assess the strength of association between the independent and dependent variables for the final logistic regression model. The chi-square model was used to assess the improvement in fit when the independent variables were in the model versus the null model. Logistic regression coefficients were assessed using the Wald statistic to test the significance of the individual variable, while holding constant all other independent variables in the model.

Results

Clinical outcomes

A comparison of the subjects from the two medication trials showed that their demographic and clinical characteristics, medication and placebo response rates, mean final Ham-D scores, and dropout rates were not significantly different (Table 1). Data from the two trials, therefore, were pooled for further analyses.

Overall, 52% of the subjects (13/25) receiving antidepressant medication responded to treatment, while 38% of those receiving placebo (10/26) responded; the difference in response rates was not statistically significant. Both responder groups had lower final Ham-D scores than non-responders, but no group could be distinguished from any other based on pretreatment levels of depression, number of prior episodes, family history of depression, or any demographic characteristics (Table 1). Both responder groups had similar rates of decline in depression and ended with substantially lower depression scores than non-responders (Fig. 2).

Fig. 2
figure 2

Changes in mean Ham-D scores over time for the four outcome groups. There were no differences at any time point between the mean scores of the medication responders (MR) and placebo responders (PR), although both groups of responders were significantly different from both non-responder groups at all time points (P<0.05) except baseline

Neurophysiological data

MANOVA for QEEG absolute and relative power showed no difference among the groups of subjects for any of the 35 electrodes. For QEEG cordance, the omnibus F was 2.58 with P=0.023. The follow-up univariate tests indicated the only significant group differences were in electrodes Af1 and Af2. Because these two electrodes were immediately adjacent to each other over the frontocentral region, mean values for these two electrodes were averaged to create a frontocentral cordance value for each treatment response group. Frontocentral cordance was significantly lower in PR subjects than in all other subjects in general (P=0.006, two tailed, t=2.87) and in the MR subjects in particular (P=0.004, two tailed, t=3.29) (Table 3).

Table 3 Group mean differences in frontocentral cordance. Cordance values from Af1 and Af2 electrodes were averaged to obtain measure of frontocentral cordance. PR placebo responder, MR medication responder

Neuropsychological data

Results of the MANOVA and ANOVA tests for baseline differences among the groups of subjects are shown in Table 4. Only the sphere of information processing speed was significant, within which further univariate tests indicated that the difference was attributable to the PR subjects’ performance on Digit Symbol Test, which was faster than that for all other groups combined (Table 5). PRs also performed better, though not significantly better, than all other groups combined on Trails A and the Stroop Color Word tests. There were no significant differences between the subset of PR and MR subjects alone (data not presented).

Table 4 Analyses of variance for neuropsychological tests for each sphere of cognitive function. Only information processing speed showed a difference among the groups of subjects
Table 5 Group mean difference on the Digit Symbol Test. Neuropsychological testing data were available on 42 of 51 subjects and showed a significant difference between placebo responders (PRs) and all other subjects combined. There was no difference between PR or medication responder (MR) subjects alone

Neurovegetative symptom data

There were no pretreatment differences among the treatment outcome groups in overall severity of depression as measured by the total Ham-D scores. Univariate testing for the four neurovegetative items, however, revealed that the PR subjects reported significantly lower levels of late insomnia than all other subjects combined (t=2.22, two tailed, P=0.03). There was no significant difference between PR and MR subjects alone, and no difference among the groups in the other three Ham-D items.

Logistic regression

Frontocentral cordance, Digit Symbol Test, late, middle, and early insomnia, as well as appetite were offered as candidate variables for logistic regression. Using the forward stepwise method of entry, the final model included frontocentral cordance, the Digit Symbol Test, and late insomnia. Nagelkerke’s R squared value was 0.653, demonstrating a moderately strong relationship between the independent variables and the dependent variable. The significance level of the omnibus test of the model coefficients was 0.001, indicating a strong improvement in fit when the independent variables were in the model versus the model with only the constant. The P values for the individual coefficients based on the Wald statistic were all significant (P<0.03) (Table 6). This model correctly classified 97.6% of the subjects into their treatment response category (Table 7). A separate logistic regression carried out for PR versus MR subjects alone revealed that frontocentral cordance was the only variable selected for the model, with 75% of the subjects classified correctly.

Table 6 Logistic regression. Variable(s) entered on step 1, frontocentral cordance; variable(s) entered on step 2, late insomnia; variable(s) entered on step 3, Digit Symbol
Table 7 Classification table. Classification matrix based on 42 of 51 subjects who had complete data (including neuropsychological testing). PR placebo responder

Discussion

These results suggest that it may be possible, on the basis of pretreatment measures of brain function, symptom severity, and cognitive performance, to identify prospectively those subjects who are likely to exhibit a placebo response at the end of a clinical trial. These subjects showed substantially lower cordance in the frontocentral region, had slightly less late insomnia, and had slightly faster cognitive processing times than did those subjects who did not show a placebo response. The differences in brain function were significant whether the PR subjects were compared with all other response groups combined, or only with those subjects who responded to medication. The fact that PR subjects differed specifically and significantly from MR subjects suggests that it may be possible to identify prospectively a selective group of LPR subjects for future studies.

The finding that frontocentral brain activity is associated with responsiveness to placebo treatment is consistent with prior literature implicating the anterior cingulate region in mediating the response to antidepressant medication (Mayberg et al. 1997). The frontocentral recording electrodes from which we detect brain functional characteristics of PRs overlie the anterior cingulate region, and activity from these electrodes in the theta-freqency band has been reported to reflect anterior cingulate metabolism (Asada et al. 1999; Ishii et al. 1999).

Previous studies have suggested that PRs have less severe illness than PNRs. The PR subjects in this study did not differ significantly in terms of overall severity of depression from other subjects, but did have slightly less severe late insomnia, a finding consistent with the literature (McGrath et al. 2000; Thase et al. 1993). The finding that overall severity of depression did not differ between the responder groups indicates that baseline brain functional differences between PR and other subjects cannot be explained solely on the basis of illness severity.

To our knowledge, cognitive processing speed never has been examined relative to the placebo response in depression. Previous studies have reported that MRs are more likely to show normal cognition than MNRs, who frequently show deficits on tests of executive function (Dunkin et al. 2000; Alexopoulos et al. 2000), although PR subjects did not differ from any other responder group in executive or other measures of cognitive function. The present results suggest that those subjects who respond to placebo may have the least amount of cognitive slowing of all subjects with major depression.

It is important to note several limitations of the current study. First, the number of PRs on whom we report is relatively small. These results, therefore, should be interpreted with caution, and await replication on a larger number of subjects. Second, although the proportion of subjects responding to medication was greater than the number responding to placebo, this difference was not statistically significant. The lack of a significant difference in response rates is an increasingly frequent occurrence in clinical trials (Walsh et al. 2002) and may reflect a lower than usual medication response rate (52%), a higher than usual placebo response rate (38%), or a combination of these factors. One factor contributing to the significant placebo response rate in this study may have been the supportive psychotherapy offered to all subjects. Although this interaction with the subjects was not substantially different from the support offered to subjects in most antidepressant clinical trials, it may have helped engender the placebo response in these subjects (Fava et al. 2003) and could limit the generalizability of the findings of this study. Third, it is likely that some of the subjects in the MR group actually constitute PRs (i.e., received medication but improved for reasons unrelated to medication treatment). We have no method to identify the PRs who may be embedded within the MR group. PRs had significantly lower prefrontal cordance than the group of MRs, however, so their removal could actually increase the between-group difference. Fourth, the rate of correct classification with logistic regression must be interpreted with caution both because of the limited sample size and because of the absence of an independent sample of subjects on whom to test the classification variables. The results of the logistic regression should be interpreted primarily as demonstrating the potential value of the multimodal approach to classification of PRs, and await replication on an independent sample of subjects.

It would be ideal if restrictive entry criteria could be employed to exclude LPR subjects from phase-IIb clinical trials in order to reduce placebo response rates. Reduced placebo response rates could reduce the number of subjects required for these early trials and improve the efficiency with which trials could detect medication effectiveness (Fava et al. 2003). If the findings of this study can be replicated, it might be possible to utilize a combination of neurophysiological, neuropsychological, and symptomatic measures as a screening method to identify and exclude LPR subjects from these medication development studies. It is important to note that, while this type of subject screening and restrictive entry criteria could be useful for rapid identification of lead compounds for antidepressant drug development, it would not at this juncture be applicable to phase-III trials. Restrictive entry criteria in trials of medication efficacy could result in selection of study populations that would not be representative of patients in general practice.

Whether the multimodal approach described here would be practical for phase-IIb trials would depend ultimately on a cost–benefit analysis: would the burden of the assessments proposed here constitute a barrier to subject enrollment, and would the expense of this testing battery be offset by the cost savings of smaller trials? In our experience, the assessment described here could be completed in less than 2 h and poses a modest additional cost and minimal burden to subjects. A more complete cost–benefit analysis, however, would be an important component of future studies. The preliminary results reported here are encouraging, however, and suggest that a combination of brain functional, cognitive, and symptomatic measures may be a useful strategy for reducing placebo response rates in medication development trials. This approach, if replicated by additional studies, could be a useful component of early clinical trials.