Background

Vestibular schwannomas and neurofibromatosis type 2

Vestibular schwannomas (VS; also known as acoustic neuromas) are among the most common benign central nervous system tumors, with an annual incidence in the United States of approximately 1.5 per 100,000 person years and comprising about 9% of all brain tumors [1]. No medical treatment for VS currently exists. Patients are managed either through observation with serial imaging studies or, when active treatment is required, via either surgical resection or radiation [2]. Vestibular schwannomas commonly cause unilateral hearing loss, trigeminal deficit, and (usually after treatment) facial weakness. However, in most patients with sporadic, unilateral VS good functional status and quality of life are accomplished with observation, radiation treatment, surgery, or some combination of these; many are cured by treatment.

In neurofibromatosis type 2 (NF2), a dominantly inherited syndrome caused by alterations in the NF2 tumor suppressor gene, VS occur bilaterally, commonly accompanied by other cranial and spinal nerve schwannomas and meningiomas. NF2 affects about one person per 25,000 live births, and accounts for 2–7% of patients with VS [35]. In contrast to sporadic VS, in NF2 patients, the inexorable growth of VS and other tumors and cumulative morbidity from treatment typically cause multiple cranial nerve deficits, often followed by death in young adulthood [6]. In one registry study, 60% of patients died before age 44 [7]. Nearly all NF2 patients die from sequelae of tumor growth or from treatment mortality [8]. Unlike the unilateral cranial nerve defects encountered in sporadic VS patients, the cranial nerve deficits in NF2 patients are often bilateral, including complete hearing loss, facial diplegia causing dysarthria and difficulty eating, and inability to swallow or handle secretions because of bilateral lower cranial nerve dysfunction. These can have profound effects on quality of life and may contribute to mortality. In addition, the morbidity of treatment from either surgery [9] or radiation [1012] is greater than for sporadic VS, probably because nerve fibers are much more likely to run through NF2 associated schwannomas rather than on the tumor surface, as in sporadic VS [1316].

Treatment choices in NF2 often weigh the possible consequences of progressive brainstem compression and gradual cranial nerve dysfunction from multiple enlarging tumors against the potential for devastating combinations of cranial nerve deficits from surgical or radiation treatment (Fig. 1). A particularly common clinical scenario in NF2 is progressive loss of hearing from VS growth in an only hearing ear, after the other ear has already suffered complete hearing loss due to tumor growth or treatment morbidity. For patients such as these, in contrast to patients with sporadic, unilateral VS, both surgery and radiation treatment may pose unacceptable risks, and novel medical treatments are urgently needed.

Fig. 1
figure 1

Axial gadolinium enhanced T1 weighted MRI scan showing multiple confluent cerebellopontine angle tumors in an NF2 patient

Need for standardized response criteria in NF2

Although radiation therapy has emerged as an effective treatment for VS, the lack of standardized response criteria complicates attempts to compare results in separate clinical series, e.g., with different fractionation schemes for fractionated stereotactic radiotherapy [1719]. Currently there are no established criteria specifically designed for assessing responses to antitumor drugs in benign nervous system tumors such as schwannomas and meningiomas. More recently, the introduction of targeted therapies developed for treatment of common cancers, combined with recent work on the tumor biology of schwannomas, has raised the possibility of designing rational drug treatments for NF2 patients with progressive intracranial tumors [2026]. For example, the treatment of small series of NF2 patients using erlotinib [27, 28] and bevacizumab [29] has recently been reported, with apparent benefit in some cases—including stabilization and shrinkage of some VS and/or substantial reversal of chronic hearing loss. This initial experience suggests that a range of responses (and nonresponses) are possible after drug treatment for benign nervous system tumors.

In this communication, we suggest response criteria for measuring the success of antitumor drugs used to treat NF2-related VS. Specifically, we highlight the suggested roles of volumetric tumor measurement and word recognition hearing scores in assessing both direct antitumor effects and an important component of health-related quality of life in NF2 patients, especially those who are losing the hearing in an only hearing ear. Although other potential endpoints in such trials are of great interest, and may have a place as secondary endpoints in trials, we argue here, based on our initial experience with small numbers of patients as well as published studies on NF2 and VS, that a standardized assessment of these two endpoints should be included routinely in trials of antitumor drugs in NF2. We do not here systematically address certain other important aspects of phase II trial design, such as entry criteria (including definitions of NF2 and of VS in the absence of histological diagnosis), choice of trial design (e.g., single-arm, randomized with active or placebo comparator arm [30, 31], crossover [32], or randomized discontinuation [30, 33]), measuring responses in non-VS lesions such as meningiomas or other schwannomas, toxicities that should be monitored with special significance in NF2 such as ototoxicity [34] and corneal keratopathy (to which patients with trigeminal lesions may be especially susceptible) [35], or the “go/no go” choice of what level of response justifies proceeding to phase III trials for a specific agent [36].

Justification of endpoint selection: tumor size

Typical endpoints used in phase II cancer trials

Phase II trials are trials designed to detect the first evidence of an anticancer drug’s activity against tumors in patients. (Phase I trials test the safety of drugs that have not previously seen human use, usually in a small single-arm design, and phase III trials aim to prove drug efficacy, usually in a large randomized trial.) A typical phase II design using tumor response as an endpoint enrolls 40 patients who are treated with a novel drug; if the drug produces a prespecified number of “responses”, the drug may proceed to definitive phase III testing.

Historically, most phase II trials of anticancer drugs have relied on objective tumor response rates (i.e., tumor shrinkage, by imaging or by direct measurement, in patients whose tumors are shown to be enlarging) in justifying the selection of active agents for further testing in definitive phase III trials. The assumptions behind this choice of endpoint are that most anticancer drugs work through a cytotoxic mechanism, that more cytotoxicity is reflected by greater macroscopic tumor shrinkage, and that objective tumor shrinkage is an adequate surrogate for other endpoints—such as longer overall or progression-free survival—that usually serve as primary endpoints in definitive phase III trials [3740].

Recent developments in phase II drug testing in some other tumors have emphasized alternative imaging endpoints instead of or in addition to objective reduction of tumor size measurements, for several reasons. Many targeted agents are assumed to have cytostatic, rather than cytotoxic effects, and lack of tumor shrinkage may not reflect lack of drug activity. Conversely, some tumor enlargement by imaging may represent “pseudoprogression”—an increase in enhancement due to treatment that does not reflect actual tumor growth [41, 42]—and spontaneous regression of these changes can falsely mimic active drug response. Under either of these circumstances, freedom from disease progression over time may be a more appropriate marker of drug activity than objective response. The increasingly common application of progression-free survival at 6 months of treatment as the primary endpoint in recurrent malignant glioma drug trials reflects these concerns [4345]. Even given the difficulties in accurately defining tumor progression cited above, progression-free survival deserves consideration as a primary endpoint in NF2-related tumors, particularly since these tumors are histologically benign, typically slow-growing, and metastasis is exceptionally rare.

Additionally, imaging changes other than changes in size are sometimes more sensitive early indicators of anticancer drug activity. Examples are reduction in tumor density on contrast-enhanced computed tomographic (CT) scans [46, 47], reductions in tumor metabolism detected using positron emission tomography [4648], and changes in diffusion or perfusion within tumors detected using MRI [4951]. To date, there is little clinical experience with these imaging modalities in either sporadic or NF2-related VS.

Special considerations for NF2 trial endpoints

For growing VS in NF2 patients, clinical experience suggests that a durable stabilization or reduction in tumor size from treatment would be a direct clinical benefit. NF2 patients suffer progressive declines in hearing over time, and these declines are associated with gradual VS growth [52]. In one retrospective study, 27% of 108 ears suffered decreased hearing within 2 years of diagnosis of NF2 [53]. Although hearing can decline in ears during an interval of time without measurable VS growth, sometimes suddenly, hearing decline in VS is significantly more common in actively growing tumors [5457]. In addition, the morbidity of surgery or radiation treatment of VS is highly dependent on tumor size [9, 10, 58]. Other symptoms, such as trigeminal deficits, are directly correlated with tumor size, and can resolve if mass effect on the nerve is reduced by surgery.

Response rate versus time to progression in NF2

We suggest reduction in tumor size as an endpoint for initial single-arm phase II drug trials in NF2-related VS, rather than time before tumor progression, because of our present imperfect understanding of NF2 natural history. Several studies have addressed the natural history of NF2, including the course of untreated VS [52, 53, 5963]. Most of these studies reported growth rates of VS over various periods of time in individuals or in groups [52, 59, 60, 62, 64]; one study reported the fraction of untreated tumors that grew or were stable over an unspecified period [63]. None of these studies reported time until progression in individual untreated NF2 VS. One study reported time to tumor progression in 11 untreated VS in NF2 patients; median time to tumor growth was 6 months [61]. These tumors were under observation because they were contralateral to growing VS that were treated with radiosurgery, but whether the observed tumors were growing at the time observation started was not reported. In a similar report [65], 15 of 29 untreated VS contralateral to a VS undergoing radiosurgery required treatment at an average of 27 months after starting observation. In addition, in sporadic VS followed conservatively, initial growth rate is a good predictor of subsequent continued growth [57, 66], although sporadic VS appear to have a substantially lower growth rate than that reported for NF2 VS [6769]. Although tumor growth is correlated with loss of hearing, reduction of tumor size after treatment with surgical resection or radiosurgery is only rarely accompanied by an improvement in hearing, even in sporadic VS [10, 7076]. Whether this also holds true for drug treatment remains to be seen.

These data suggest that time to tumor progression in initially growing NF2-related VS is an endpoint that, without effective treatment, would be reached in many patients in a relatively short period of time. However, a much better understanding of the natural history of untreated NF2 VS, including potentially important prognostic factors for growth such as age, the presence of meningiomas, other measures of disease severity such as number of prior surgical procedures or overall tumor burden, measured baseline tumor volume and growth rates, family history and familial pattern of disease, and specific NF2-defining genetic defects [59, 60, 62, 64] will be necessary before this endpoint can be used in single-arm phase II trials that use a historical control group. Progression-free survival could potentially be used at present in randomized phase II trials of NF2-related VS with an untreated control cohort, or placebo controlled crossover trials. Randomized phase II trials that compare active treatment to a placebo control arm are considered to be most useful when there is little prior information on the expected efficacy rate of a novel treatment, or when the endpoint is thought to be dependent on patient selection factors (such as time to tumor progression) [77]—both criteria that are applicable to NF2-VS studies. However, the present understanding of progression-free survival in NF2 related VS is sufficiently vague that it would be difficult to determine accrual goals and planned follow-up periods to properly power such trials, which also risk poor patient acceptance because of the use of a placebo arm. Better knowledge of the possible influence of patient-related prognostic factors noted above would probably also be necessary in properly stratifying the randomization and analysis of such trials. Crossover trials in which each patient alternates between active and placebo treatment periods would also be difficult to design in the absence of better natural history data.

Recommended use of radiographic response rate

We propose that objective radiographic response rate is a better choice for initial drug trials in NF2 VS. In contrast to time to progression, the expectation for rates of spontaneous tumor shrinkage in untreated VS is better understood. Spontaneous involution of untreated sporadic VS is well known; it occurs in about 8% of tumors in most reports [68, 78]. Tumors can reduce by 3–15 mm in diameter [79], usually during a long period of observation. Regression has occasionally been reported in VS that were initially growing [80, 81], although most reports do not specify initial growth. In the NF2 Natural History Consortium patients, 3 of 84 tumors (3.6%) regressed by 2 mm or more during 9 months to 2 years of followup [60]. This study was not limited to patients with initially growing tumors, in which spontaneous regression would probably be expected to be even less frequent. This low rate of spontaneous regression in NF2 related VS suggests that most objective responses in a drug trial for these tumors would likely represent true activity of the agent being tested.

Methods of measuring tumor size

Tumor response has historically been measured in relation to linear dimension, as in the 2000 RECIST criteria [82]; bidimensional area, as in the 1979 WHO criteria [83], and the 1990 “Macdonald” criteria for malignant glioma, which also incorporated steroid dose and neurological condition [84]; and volumetric measures based on cross-sectional imaging such as CT or MRI [51, 85]. For sporadic VS, measures based on greatest linear dimension are typically used in clinical practice to assess growth over time [54, 57] and operative risk of complications [58]. In following VS for growth in a “wait and scan” strategy, 2 mm of growth is commonly chosen as the threshold before recommending treatment [86, 87], or less often 1 mm [88]. Either the greatest tumor diameter measured parallel to the petrous face or simply the greatest tumor dimension (potentially including the intracanalicular portion), both on axial images, are common assessment methods [60, 89]. An average tumor diameter calculated as the square root of the product of the greatest dimension parallel to the petrous ridge and the greatest perpendicular diameter on axial images has been recommended as a standard measure of VS size [90]. While such measures may adequately summarize tumor size in sporadic VS, which typically have a regular, ovoid shape in their extracanalicular portion, NF2-related VS are often highly irregular in form (particularly after prior treatment) (Fig. 2), and such measures are likely to be inaccurate surrogates for true NF2 VS volumes. A comparative study showed that volumetric measures were much more sensitive to tumor growth (and, presumably, to tumor shrinkage) than linear measures [89], and volume estimates based on bidimensional area have been shown to be a poor surrogate for actual tumor volumes [91].

Fig. 2
figure 2

Axial gadolinium enhanced T1 weighted MRI scans. a Sporadic unilateral VS, showing regular ovoid shape. b NF2-related VS showing highly irregular shape

For volumetric measure of a VS, a simple method is the “box model” in which the greatest anteroposterior, mediolateral, and superoinferior measures are multiplied [52, 59], or a two-box modification that measures intracanalicular and extracanalicular volumes separately [62]. Two limitations of the box method are the considerable interobserver variation in measuring the three tumor dimensions used in this method [87, 92] and the irregular shape of most NF2 VS, in contrast to the regular shape of sporadic VS. A method that uses summed areas from individual cross-sectional imaging slices (typically axial slices) has been shown to yield accurate volumetric measurements, within 10% of true volumes when five or more axial slices through the target volume are used [93]. A study of volumetric measurement of NF2 VS using these methods, using 3 mm axial slices, showed an intrarater coefficient of variation of measurements of 5% or less for tumors larger than 1 cm3 [89]. 1 mm isotropic voxels are now routinely feasible on many MRI systems, and the smaller the voxel size, the smaller the VS size and change that can be accurately detected. We recommend volumetric measurement of VS volume as the means of assessing objective tumor response in NF2 VS trials.

Volumetric measurement of VS: special considerations in NF2

An additional issue in assessing tumor volume for trials is the number of lesions that can contribute to volume measurement. In cancer trials, the widely used RECIST criteria specify that up to five lesions per organ and up to 10 in total can contribute to tumor size measurement [82]. NF2 differs from malignant primary tumors in that, although multiple lesions are usually present, they do not represent metastases. In addition, because of their critical sites of origin, most individual NF2 lesions have the potential to cause distinct neurological symptoms through growth. We recommend that individual lesions be measured separately. Most trials will probably specify a single index or “target” lesion that qualifies a patient for trial entry and is tracked as the principal endpoint. For example, a growing VS on the side of an only hearing ear might be the target lesion, and the sizes of other schwannomas and meningiomas would then be followed as secondary endpoints.

Threshold for defining objective response

No universally agreed definition of the degree of tumor volume reduction necessary to declare response in phase II trials has emerged, although a 50% change in volume has been proposed for malignant glioma trials [51, 94]. For benign tumors such as VS, as well as for agents assumed to have a cytostatic rather than cytotoxic mechanism of action, a smaller change in volume may be more sensitive in detecting drug activity. Volumetric changes of 15% [95] or 20% [96] have been used to define response and progression in early phase trials in neurofibromatosis type 1 associated plexiform neurofibromas. We propose using a 20% change in comparison with baseline as the threshold to define response or progression in NF2 VS. This corresponds roughly to a 2 mm change in diameter of a 22 mm diameter sphere, which would be the typical threshold for reliable clinical measurement of a minimal change in a VS of average size.

Response should be measured at standardized timepoints to facilitate comparison across trials. We suggest the inclusion of (at least) 6-, 12-, and (when available) 24-month assessments in trial reports, although investigators often would specify more frequent scanning in trial protocols. Responses must be durable, i.e., confirmation of radiographic response should be performed at least 2 months after the criteria for response are first met.

Justification of endpoint selection: hearing response

Hearing loss and quality of life in NF2 patients

Bilateral VS are the hallmark of NF2, and most NF2 patients eventually suffer complete bilateral hearing loss either from tumor growth or from the morbidity of surgical or radiation treatment. Clinical experience indicates that many, if not most NF2 patients lose all useful hearing before death. Because many patients have their VS completely excised after hearing is lost through tumor growth, prolonged survival with complete hearing loss as the dominant neurological deficit is possible if the burden of other NF2-related tumors (such as meningiomas and spinal tumors) is low; such patients are frequently encountered in clinical practice. Although data on the frequency of hearing loss in an NF2 population are not available, Baser et al. [64] reported that 37% of patients in the UK NF2 registry had undergone bilateral VS removal by age 50; most such patients would have no useful hearing, and many more who had not had bilateral surgery would have lost all hearing as well through tumor growth. Because most NF2 patients lose hearing after acquisition of language (i.e., they are postlingually deafened), the effect of hearing loss on their quality of life is typically severe, affecting psychological and social function, and employment [97]. Although no quality of life studies specific to NF2 patients have been performed, the adverse effect of severe and even moderate hearing loss on health-related quality of life (HRQoL) has great face validity and has been well documented in children and adults who lose hearing from other causes [98102]. Conversely, treatments that improve hearing function in patients with moderate to severe hearing loss, such as hearing aids, cochlear implants, and ear surgery, cause improvements in HRQoL that are easily measured [103110].

Non-hearing neurological consequences of VS

Other neurological consequences of VS deserve consideration as possible response measures in NF2 clinical trials. Balance difficulties and tinnitus are common in NF2 patients, especially those with bilateral surgery, but both are too difficult to quantify to use as a primary endpoint in trials [111115]. Facial nerve palsy, unilateral or bilateral, is frequent in NF2 patients, can be graded using standard scales [116], and has a significant impact on HRQoL [117], but it more often represents an effect of surgical treatment or of coincident facial nerve schwannomas than of growing VS. Similarly, trigeminal nerve deficits often reflect consequences of treatment or of trigeminal schwannomas, and are difficult to quantify. Although we know of no comparable NF2-specific research, quality of life studies in patients with sporadic, unilateral VS consistently identify hearing loss as the deficit with most impact on their subjective well being, particularly after surgery [115, 118, 119]. Although non-VS tumors such as schwannomas of other cranial nerves, meningiomas, and spinal tumors can certainly affect HRQoL adversely in some NF2 patients, not all patients have such tumors [7, 59, 120] and their neurological consequences vary so widely that uniform quantification is not possible.

Objective measurement of hearing function

Although hearing seems to be the obvious choice for a trial endpoint with direct relevance to patients’ quality of life, hearing can be measured in many ways in modern audiometric practice, and adoption of a uniform hearing measure for the purposes of NF2 trials would be an advantage. Hearing measures include those included in most routine audiograms, such as simple detection of sound (measured as pure tone thresholds at individual stimulus frequencies or their average of several frequencies [pure tone average, PTA]) and speech discrimination (measured as word recognition score based on standardized word lists and presentation conditions). One composite hearing measure, commonly known as the Gardner–Robertson classification [90, 121] has been recommended for reporting hearing preservation after VS surgery and is commonly used to grade hearing after radiosurgical VS treatment as well. This scale combines PTA and word recognition scores into a four-level scale. Other hearing tests are frequently abnormal in VS patients, such as electrocochleography or auditory brainstem response (ABR) testing, but these methods commonly detect changes in auditory function that are not subjectively perceived by patients—that is, these abnormalities may be statistically, but not clinically significant. Other tests appraise the subject’s assessment of their own hearing function in more subjective terms, such as the Glasgow Benefit Inventory [122], the Hearing Participation Scale [123], the Hearing Handicap Inventory for Adults [103], and the Hearing Satisfaction Scale [124], among others. These scales are commonly used in assessing the societal benefit of interventions that improve hearing (i.e., cost-utility studies, especially across different interventions), but are not typically used as primary outcome measures in such trials.

Word discrimination and hearing in NF2

Of the available means of testing hearing, we suggest that speech discrimination (word recognition score) be used as the primary hearing endpoint in phase II NF2 VS trials, for several reasons. First, word recognition scores directly measure the aspect of severe hearing loss that affects patients most acutely in daily life, namely communication [125]. Second, word recognition is affected disproportionately to pure tone thresholds in VS patients. VS cause hearing loss primarily through dysfunction of the auditory nerve, presumably because of direct compression from the tumor. Recent histopathologic data suggest that secondary degeneration of the cochlea may also be present, as manifest by loss of cochlear hair cells and spiral ganglion neurons and other histopathologic changes [126]. Together, these cochlear and neural alterations lead to the characteristic hearing loss pattern in VS patients, in which word recognition is affected to a greater extent than detection of pure tones [127]. In many cases of hearing loss from non-VS etiologies, such as conductive hearing loss, hearing aids can improve hearing loss with elevated thresholds by amplifying incoming sound presented to the cochlea. However, hearing aids cannot reliably address hearing loss-related impaired word recognition, such as VS-related hearing loss, because poor quality word information delivered at loud levels provides limited improvement. Indeed, while hearing with poor word recognition but relatively preserved pure tone thresholds may still have value in detecting environmental sounds (such as a car horn), its main value in communication may be only in detecting speech onset and orienting toward the speaker to assist lip reading. An improvement in word recognition from treatment would therefore be of special value in NF2 patients, who often have only one hearing ear and whose hearing deficits usually impose limited benefit from amplification. As a third reason for basing hearing response assessment on word recognition scores, we cite our own small clinical experience with treating NF2 patients with targeted agents, in which we observed some cases of striking improvement in word recognition with only modest PTA improvement; these patients considered their gains in hearing to be of real benefit [2729] (S.R. Plotkin, unpublished data).

Gardner–Robertson scale as an NF2 trial endpoint

The Gardner–Robertson scale, a composite hearing endpoint that is in wide clinical use, has previously been recommended for use in reporting hearing preservation results after VS surgery [90, 121]. This scale uses both PTA and word recognition scores and classifies hearing into four levels. This scale was designed to identify “useful” hearing in ears where the ear being tested was the worse-hearing ear and the contralateral ear was normal, or nearly so. It therefore uses only one category to score all hearing ears with word recognition less than 50%. We feel that hearing differences that fall within this broad range have clinical significance when the ear being tested is the only hearing ear, as in many NF2 patients, and do not recommend this scale for use as a primary endpoint in NF2 trials because the score lacks sensitivity in this application. Similarly, the various criteria for success of steroid treatment of idiopathic sudden sensorineural hearing loss often assume good hearing function in the contralateral ear in defining response (that is, the affected ear must improve to a certain degree toward “normal” as defined by the unaffected ear) and are typically insensitive to minor improvements in the affected ear [128].

Time to hearing failure in NF2

As for tumor size, the natural history of hearing changes for PTA and word recognition in NF2 are only partly understood; few NF2 natural history reports have included data on hearing. No study reports the time over which PTA or word recognition scores can be expected to be stable in NF2 ears with growing VS, values which would be required to design single-arm phase II trials with a time to progression hearing endpoint. Abaza et al. [52] reported that baseline PTAs were worse in NF2 patients (ears) with larger VS and that older patients had worse PTAs. PTAs were more likely to decline over time intervals with active tumor growth. Similarly, worse speech recognition scores were found in ears with larger VS tumors. In the NF2 Natural History Consortium patients, PTA values showed a slow decline, with differences becoming significant only after 3–5 years of observation [53]. Only 4 of 106 ears (4%) declined from Gardner-Robinson hearing class A or B to class C or D over 7 months to 2 years of followup, and only 3 of 36 ears (8%) showed similar declines over 3–5 years of observation. This cohort was not limited to patients with growing VS, whose hearing might be expected to be at higher risk. The authors recommended against using decline in hearing as an endpoint in NF2 trials because the endpoint was too infrequent (using their definition) and unrealistically large trials would be required.

Proposed hearing response criteria

We propose that improvement in hearing as measured by word recognition scores be considered as an important endpoint in initial NF2 VS phase II trials. Information from natural history studies does not exclude possible transient improvements in hearing in some patients, although the overall pattern of progressive decline over time is very clear. Clinical experience would indicate that such improvement in untreated patients is uncommon, small in magnitude, and transient in duration. In addition, neither surgery nor radiation treatment of VS is followed by improvement in hearing except in unusual cases [7076]. For initial phase II trials in NF2 related VS, we suggest using the smallest change in word recognition score that can be detected with statistical significance as the criterion for a hearing response (see below). If an additional category of “minor hearing response” is desired, an appropriate change in PTA can be used (see below). Responses must be durable, i.e., confirmation of hearing response should be performed at least 2 months after the criteria for response are first met.

As an important research goal, we propose including an additional test such as the Glasgow Benefit Inventory to confirm the clinical significance of hearing improvement in only hearing ears when responses measured in this way are achieved.

In the next sections, we offer more detailed suggestions for assessing tumor volume and hearing function in phase II drug trials in NF2-related VS.

Measuring VS tumor response by volume

Definitions

At baseline, VS will be categorized as follows: measurable (lesions that are 0.5 ml or greater in volume when imaged with contrast enhanced MRI as detailed below) or nonmeasurable (all other lesions). VS previously treated with radiation are considered nonmeasurable, because both early increases and later gradual decreases in tumor volume after radiation might confound response assessment [129135]. (An exception may be made for index VS that are unequivocally enlarging >5 years after prior radiation treatment.) VS that have previously been partially resected are measurable if investigators are certain that enhancement represents solid schwannoma tissue rather than linear postoperative enhancement that represents scar or inflammatory tissue, as is sometimes seen in the IAC after a complete resection [136139]. Measurements should be recorded in metric notation. Baseline evaluations should be performed as closely as possible to the beginning of treatment and not more than 6 weeks prior to beginning treatment.

Patients with NF2 may have special circumstances that hinder measurement of a VS. Collision tumors (i.e., tumor masses formed by multiple tumors that abut each other; Fig. 3) in which the VS cannot be clearly distinguished from the neighboring tumor(s) are considered nonmeasurable. VS obscured by artifact related to placement of an auditory brainstem implant (Fig. 4) or other devices such as some cerebrospinal fluid shunts are also considered non-measurable.

Fig. 3
figure 3

Axial gadolinium enhanced T1 weighted MRI scan showing NF2 related VS with multiple abutting tumors, probably both meningiomas and schwannomas of other cranial nerves. Such a tumor would be nonmeasurable by MRI volumetric technique

Fig. 4
figure 4

Axial gadolinium enhanced T1 weighted MRI scan showing imaging artifact from an auditory brainstem implant device, preventing accurate MRI volumetric measurement of the ipsilateral VS volume

Specification of methods of measurement

The same techniques should be used to characterize VS at baseline and during follow-up. To declare a tumor response, patients should be on stable or decreasing corticosteroid doses compared to baseline.

Radiographic response is determined by quantitative volumetric tumor analysis of contrast-enhanced MRI scans [89]. MRI scans should include pre- and post-contrast images of the brain including thin spin-echo or gradient-echo T1-weighted axial images through the internal auditory canal (slice 1.5 mm or thinner, skip 0 mm). Fat-suppression techniques should be used for post-contrast studies if there is a prior history of surgery in the internal auditory canal to avoid confusing fat grafts in the internal auditory canal for enhancing tumor [140]. Similarly, true fluid cysts within a VS or arachnoid cysts abutting a VS should be excluded from volume measurement. Tumor volumetry will typically be performed centrally in multicenter trials, and should be performed using post-contrast images through the internal auditory canal with the Vitrea2 (Vital Images, Minnetonka, MN) semi-automated segmentation software (or comparable software), edited manually as needed by the reviewer. The coefficient of variation for VS greater than 1 cc using this technique is generally below 5% [89], so we suggest that 5% would be a reasonable lower bound for definition of minor response. Although drugs tested to date do not affect VS enhancement, T2 images should also be examined to confirm that apparent volume reduction represents a true change in tumor size (as apparent from T2 visualization of tumor borders) rather than a change in intrinsic enhancement of VS tissue.

We anticipate that most NF2 phase II trials will define an “index” lesion to be used for response assessment, such as the VS on the side of an only hearing ear. (The definition of an index lesion may vary for trials that focus on other NF2 histologies such as meningiomas or ependymomas, or for patients with bilateral VS and no hearing in either ear.) In patients with bilateral VS the index VS should be selected before trial enrollment. Some NF2 patients have seemingly innumerable meningiomas and schwannomas; we propose that up to five lesions other than the index VS that can each be accurately measured by MRI scan should be identified as non-target lesions. Non-target lesions for NF2 patients should be defined by the study radiologist at baseline as contralateral VS, non-VS, and meningiomas. The volume of the contralateral VS and individual volumes for all non-VS and meningiomas (or, alternatively, total non-target schwannoma and total meningioma volumes) should be recorded at baseline. The change in volume of non-target lesions will not be used for response criteria but can be used and reported as secondary endpoints, if desired.

Radiographic response

In the proposed criteria, an objective radiographic response is defined as ≥20% reduction in tumor volume, comparing the nadir of tumor volumes measured during treatment to the baseline volume. If desired, a minor radiographic response may be defined as a reduction in tumor volume between 5% and 19%. Radiographic progression is defined as an increase in tumor volume greater than 20% compared to the baseline evaluation. In general, only measurable tumors can be assessed for response. Occasionally, a VS that was nonmeasurable at baseline may cause progression to be declared, i.e., a tumor that was smaller than 0.5 ml at baseline that enlarges to greater than 0.5 ml during treatment.

Measuring VS tumor response by hearing assessment

Definitions and methods of measurement

Hearing in phase II studies will be measured by standard audiologic techniques including determination of word recognition scores and pure-tone averages ipsilateral to the target vestibular schwannoma.

Word recognition is measured using 50-item CID-W22 (Central Institute for the Deaf; Ira Hirsh recording) monosyllable wordlists and ranges from 0% to 100% [141]. Other standardized monosyllable recorded lists may be used including NU#6 [142], CNC [143] and others as long as similar performance-to-intensity functions can be shown for the recording used. Because word lists can differ in difficulty, the word list used for an individual in a trial should be noted and kept consistent for all testing throughout the protocol treatment. The principle of using equivalent performance functions may also be applied to word lists in other languages, such as bisyllabic Spanish words [144]. When the tester is an English speaker, we recommend using a trial entry criterion of “testable in English” (rather than “English as a first language”) which promotes inclusion without reducing validity [145]. In all cases, live voice presentation and truncated (e.g., 25-word) lists are not acceptable.

Pure-tone thresholds are most often measured at the frequencies of 250, 500, 1000, 2000, 3000, 4000, 6000 and 8000 Hz. The pure-tone average is calculated as the average of thresholds at 500, 1000, 2000 and 3000 Hz as recommended by the American Academy of Otolaryngology (AAO) [146]. In cases of VS, the hearing loss is sensorineural. However, an unrelated, additional conductive loss may also be present which would further elevate the air conduction thresholds. While air conduction thresholds (with masking as necessary) are most often the best data, in cases with additional conductive loss, the masked bone conduction results are a better measure of the sensorineural hearing loss. It is recommended that masked bone conduction results be used when air-bone gaps larger than 10 dB are found at more than one frequency.

When forming a PTA, it is sometimes necessary to include data where the upper limits of the audiometer is reached and the patient has not responded. This should not be treated as missing data. Our recommended approach is to extrapolate conservatively by adding one audiometric step (5 dB) to the upper limit value, thereby capturing the large magnitude of the loss at that frequency [145].

Hearing response

Because reduced word recognition directly limits patients’ quality of life, the primary audiologic outcome will be word recognition, with response defined as the smallest improvement in word recognition score over baseline that meets criteria for statistical significance at the P = 0.05 level. Table 1 shows values for detecting a statistically significant change in word recognition at the P = 0.05 level for different baseline scores [147]. The table can be used to compare any two scores from the same patient, such as before and after treatment [148].

Table 1 Clinical criteria for definition of hearing response based on a 50-word hearing test

When hearing variables are used as response criteria, it is important to recognize the presence of “ceiling” and “floor” effects. Normal audiologic values (100% correct for word recognition and 0 dBHL for thresholds and PTA) represent a limit for improvement at one extreme of the range (i.e., “ceiling” effect). At the other end of the spectrum, non-detectable audiologic values (0% correct for word recognition and equipment maximum [e.g., 110 dBHL] for thresholds and PTA) represent a limit for decline at the other extreme of the range (i.e., the “floor” effect). As a result of the ceiling effect, patients with baseline word recognition scores above 94% cannot achieve a statistically significant improvement in hearing at the P = 0.05 level with treatment and hence are nonevaluable for hearing response using these criteria. (As a practical matter, one or two-word lapses in attention will prevent patients with baseline word recognition scores ranging between 86% and 94% from achieving 100% on subsequent tests even if their hearing improved to normal. Investigators may wish to consider such patients nonevaluable for hearing responses in designing protocols for which hearing response is a primary endpoint.) In practice, the magnitude of potential improvement on audiologic measures is related to the baseline hearing loss. For example, a patient with word recognition scores of 60% can potentially recover “twice as much” as a patient with word recognition scores of 80%. This concept applies equally to PTA: patients with worse (larger) starting threshold values have more freedom to improve than those who with better threshold values.

A hearing response is defined as an improvement in the word recognition scores above the 95% critical difference threshold (Table 1), comparing the best word recognition score achieved during treatment to the baseline word recognition score. To declare a hearing response, patients should be on stable or decreasing corticosteroid doses compared to baseline. Hearing should be stable at baseline or slowly decreasing, i.e., patients with sudden hearing loss immediately before beginning treatment should not be considered eligible for a hearing response. Such patients often regain hearing spontaneously or with corticosteroid treatment, at least in sporadic VS [149151].

If desired, investigators may also define a minor hearing response as an improvement in the pure-tone average of 10 dB in the setting of stable word recognition scores (i.e., within the 95% critical difference threshold defined in Table 1), taking as reference the baseline pure-tone average.

Progression in hearing deficit is defined as a decrease in the word recognition scores below the 95% critical difference threshold (Table 1), taking as reference the baseline word recognition score. Stable hearing is defined as all other responses.

Discussion

In this communication, we suggest two important means of assessing tumor response of NF2-related VS to drug treatment: radiographic tumor response (defined as a 20% reduction in tumor volume) and hearing response (defined as a statistically significant improvement in word recognition score). Prior experience with chemotherapy for VS has been very limited, and no standard response criteria for such trials have been defined. Riccardi reported improvement in depression, weight gain and improved muscle mass, gait improvement, absence of tumor-associated pain, and “slowing of neurofibromas growth” in one NF patient treated using ketotifen [152], and Jahrsdoerfer and Benjamin [153] reported cessation of tumor growth and stabilization or improvement in hearing in two patients treated with cyclophosphamide, doxorubicin and dacarbazine. Neither report used predefined criteria to classify response.

Recent advances in the understanding of schwannoma tumor biology [2026], combined with development of active anticancer drugs designed to target specific molecular pathways in neoplastic cells, suggest that trials of drug treatment for NF2-related VS may now be warranted. We suggest that the criteria defined here deserve consideration for inclusion as primary endpoints in such trials, at least initially. When more detailed information becomes available on the time to tumor progression and time to hearing loss in untreated NF2 VS, these endpoints are likely to replace tumor shrinkage and hearing improvement in trials. However, because many NF2 patients represent new somatic mutations in the NF2 gene and do not belong to known NF2 kindreds [4], some patients will continue to present as new diagnoses with poor hearing in both ears. Such patients would benefit from drugs that cause hearing improvement, rather than stability. In addition, a recent review of phase II trial design for targeted anticancer agents (many of which would be expected to be cytostatic) showed that most agents reaching regulatory approval generated objective responses in phase II testing; this was correlated with endpoints reflecting tumor stability [154].

Special considerations for primary endpoints in NF2 trials, and composite endpoints

Some special considerations in NF2 phase II trials may prevent either radiographic response or hearing response from serving all of the typical functions of a phase II primary endpoint. For example, patients in a phase II trial whose tumors progress according to the primary endpoint (e.g., radiographic criteria) do not normally continue to receive the study drug and are no longer treated on protocol. However, in an NF2 trial some patients may achieve a hearing response to treatment while developing progressive disease by MRI scan. Although treatment is deemed to have failed in these patients by traditional response criteria, we feel these patients should be allowed to continue treatment with the study drug on a compassionate-use basis at the discretion of the study chair in an NF2 trial. We believe this can be justified based on the current standard of care for NF2-related VS, in which growing tumors are often monitored without active intervention to preserve hearing. Conversely, some patients might enter an NF2 VS trial because of a growing VS in an only hearing ear with word recognition scores above 94%. Such patients would be ineligible for hearing responses because of the “ceiling effect” described above, although drug activity could still be detected through changes in tumor size.

One approach to the problem of two endpoints, either one of which might be more important in a given clinical situation, is to combine the two into a composite endpoint. Table 2 shows one possible means of integrating objective radiographic response and hearing response in NF2 VS. As shown, the combination of major and minor radiographic and hearing responses can be used to construct a composite endpoint with four response categories: major response, minor response, stable disease, and progressive disease. Whether a composite endpoint of this nature had value would probably become clear as clinical experience with such trials accrues. Initially, we suggest using objective (radiographic) tumor response as the major endpoint for designing phase II trials, because its measurement in relation to known NF2 VS natural history seems most secure. All patients enrolled in an NF2 VS trial would be expected to be eligible for volumetric response, while our preliminary experience suggests only half of patients might be eligible for a hearing response or composite endpoint response. A volumetric primary endpoint would increase efficiency in this rare disease. Hearing response rates, though, also deserve significant weight in judging a drug’s clinical benefit in NF2 related VS and whether to progress to phase III testing—as do response durability, treatment toxicity, and other factors.

Table 2 Composite response criteria (combining radiographic and hearing response) for phase II studies of NF2-related VS

Future research and novel endpoints

This communication also identifies several important research goals in NF2 natural history in order to facilitate future development of new endpoints for NF2 trials. These goals include identifying the time to tumor progression and time to measurable hearing loss in untreated NF2-related VS, and the relation of both endpoints to patient prognostic factors (such as age, baseline tumor volume, and measures of disease severity). In addition, researchers should seek better definition of the smallest improvement in hearing (specifically word recognition) that has clinical significance for NF2 patients with only hearing ears (i.e., the minimal clinically significant difference). If the test for word recognition using standard audiometric technique that we describe proves to be insensitive to small but meaningful hearing improvements in the context of an only hearing ear, more sensitive hearing endpoint measures may need to be developed. Finally, the possibility of using a composite endpoint in NF2 trials, such as the one we describe here, should be considered for future use in selecting drugs for phase III trials in this rare disease.