FormalPara Key Points for Decision Makers

Trial-based economic evaluations require careful consideration of design, conduct, analysis and reporting.

Areas for improvement include the need for more robust methods to estimate resource utilisation, use of multiple imputation for missing data, and application of appropriate modelling for the assessment of the impact of non-adherence.

This review identifies economic methods that may also enhance clinical trials such as for calculating minimally important differences using discrete choice experiments, and undertaking data extrapolation based on modelling.

1 Introduction

Randomised controlled trials (RCTs) are an important foundation for economic evaluation. They generate patient-level data that provide an unbiased estimate of the efficacy or effectiveness of an intervention being tested, as well as an opportunity for collecting resource use and cost data necessary for calculating an intervention’s cost effectiveness. However, RCTs are not a panacea and rarely do they provide the full information necessary for an economic evaluation or to inform a decision [1]. Approaches to trial-based economic evaluation have consequently evolved, as detailed in recent guidelines and reviews [24]. This article focuses on specific aspects of trial-based evaluations where there is scope for improvement, and introduces novel methodologies that may complement or supplant current approaches. The topics discussed can be broadly categorised into the methods of interpreting trial-based data, and techniques for using trial-based data in model-based evaluation. The former category considers methods for resource use estimation, choice of outcome measure, handling missing data, accounting for patient non-adherence and analysis of multinational trial data. Factors relating to the application of trial-based data in modelling include consideration and discussion of issues on calculating minimally important differences and methods for extrapolation. Based on a purposive review of the literature, the concepts discussed are internationally relevant but with a focus towards practice in the UK.

2 Trial-Based Economic Evaluations

2.1 When to Conduct Economic Evaluation?

Several factors signpost the need to undertake an economic evaluation, including the following.

Anticipated difference in therapeutic benefit between interventions. New health technologies are less likely to be cost effective (and thus more reason to conduct an economic analysis) if any difference in health outcome is anticipated to be small when compared with existing interventions.

Large cost differences. An economic analysis is warranted if the new health technology is potentially very costly and/or the two (or more) interventions being compared are of greatly different cost; or if there are considerable downstream costs that occur or continue beyond the trial time horizon that may differ between interventions, such as increased intensity of care or frequency of testing.

High levels of uncertainty around the cost effectiveness or economic impact of health technology; the greater the uncertainty, the stronger the case for economic evaluation.

Sizeable budget impact as a result of a potentially large population or population subgroup. Health economic analyses may be less appropriate if the decision rule is based on budget impact and the overall budgetary impact is very small as the opportunity cost associated with an incorrect decision may be limited.

2.2 Does the Economic Evaluation Need to be Based on Randomised Controlled Trial (RCT) Evidence?

Where an economic evaluation is indicated, it is imperative that due consideration is given as to whether an RCT is the most suitable platform for evidence acquisition. Decisions may have to be made before results are available from clinical trials; or, put another way, the value of generating economic evidence from a clinical trial might be less than the costs of postponing a decision. In such circumstances, an economic model may be preferred. Trials which include comparators that differ from routine care, include selective subpopulations of patients, or which are not pragmatic in design may not lead to reliable estimates of the comparative effectiveness of treatments, and may therefore bias the cost-effectiveness result.

Organisations such as the UK National Institute for Health Research Health Technology Assessment (NIHR-HTA) programme aim to inform clinical decisions by funding pragmatic studies that address issues of clinical and cost effectiveness. A review of 78 randomised studies funded by the NIHR-HTA programme published to June 2009 identified 75 (96 %) studies that included an economic analysis [5]; however, not all trials are amenable to economic analysis and not all interventions require trial-based economic evidence. It may be the case, for instance, that the intervention is low cost but non-inferior to current practice. While this might only be known ex post, there are specific instances where RCTs are less relevant for economic evaluations. A trial-based economic evaluation would be unnecessary in the case of a comparison of two similar interventions that are delivered in different settings, whereby incremental benefits might mean less use of expensive resources. The cost of either saline or albumin for resuscitation in patients with severe sepsis, for instance, is dwarfed by the potential savings from a reduction in expensive intensive care should one prove to be more effective. Cost would not be a factor to influence the decision to change clinical practice. Economic evaluation can also, in the appropriate situation, take data from non-RCT sources. For example, a cost-minimisation analysis comparing endometrial ablation performed as a day case under general anaesthesia or as an outpatient using local anaesthetic used hospital data that were routinely collected [6]. Similarly, a cost-minimisation analysis, for instance, would be appropriate without the need to source economic evidence directly from trials of generic medicines that are bioequivalent to their branded counterparts.

2.3 Modelling of RCT Data, or a Trial-Based Evaluation?

It is generally accepted that some form of economic evaluation is necessary to support funding decisions on new healthcare interventions. However, it is important to note that only a small proportion of healthcare interventions and services are prioritised for appraisal by HTA organisations. The National Institute for Health and Care Excellence (NICE) in the UK expect an economic model to form part of the assessment, but whether this is a trial-based economic evaluation is not a high priority [7].

A model can flexibly evaluate different patient populations, subgroups, interventions, outcomes, resources and cost effects. Such an approach is almost always justified in order to synthesise the evidence base and to take into account the uncertainty around the cost-effectiveness point estimate. However, the choice of whether to use RCT data, as one of several information sources to populate an economic model, or to undertake a trial-based economic evaluation and subsequent model whereby the majority of the model inputs are obtained from the trial, is dependent on a number of factors.

Situations that would point towards a trial-based economic evaluation being more appropriate might include those where costs are largely driven by, or estimated from, the primary outcomes of the trial (such as for cardiovascular outcomes or chronic obstructive pulmonary disease [COPD] exacerbations). Other factors that point to a preference for trial-based evaluations include situations where considerable heterogeneity of costs is expected, thereby justifying a more individual patient-specific data collection method, or where adherence, only apparent in a more pragmatic trial, may drive costs and outcomes.

2.4 Is the Trial in Question a Suitable Vehicle for an Economic Evaluation?

The decision as to whether a given trial is a suitable vehicle for an economic evaluation is based on good trial design that is capable of providing unbiased answers to the clinical question; whether current practice (which may be best supportive care) is one of the alternatives being compared and that the trial setting allows for generalisable results. Factors to be considered include the trial type, relevance of the outcome measures and populations, expected effect size, external validity, as well as logistical considerations for data collection [811].

With regard to trial type, explanatory trials seek to identify whether an intervention works under ideal conditions (a measure of efficacy), while pragmatic trials consider whether an intervention works in real-life conditions (a measure of effectiveness). Ideally, trials must be at least partly pragmatic and therefore relate to actual clinical practice if they are to be suitable for economic analysis. However, in the case of new medicines, most RCTs are explanatory since they are related to efficacy (as demanded by regulatory authorities), are typically conducted within well-resourced settings, recruit a select sample of patients (e.g. defined by their adherence, co-morbidity, age) and often measure a short-term surrogate outcome. These do not always contain all the elements necessary for economic evaluation since they may be limited by factors such as lack of appropriate comparators, insufficient duration of follow-up, particularly of lifelong treatments for chronic diseases, inadequate powering for economic outcomes, and poor external validity due to limited generalisability to other populations or care settings. These factors may be mitigated by the use of modelling, for instance to synthesise an indirect comparison against a relevant comparator or to extrapolate treatment effects and costs over a longer time horizon.

On the other hand, pragmatic trials have broad inclusion criteria for trial participants, measure patient-relevant outcome measures and are thus more suited to inform clinical decision making. They are intended to reflect usual practice and are better suited as vehicles for economic evaluation, but again modelling is invariably necessary in order to meet the evidential requirements of decision makers. However, in the absence of such studies (as is normally the case with new medicines), the analyst is faced with the challenge of how effectiveness, and hence cost-effectiveness, might be predicted from data pertaining to a treatment’s efficacy. Neither explanatory nor pragmatic trials are typically sufficient for data on adverse events, especially if these are rare.

2.5 Trials of Public Health Interventions: A Special Case?

Many public health interventions are best described as complex [12]. They have several components, and effects depend on the context of delivery, synergistic impact of other socioeconomic factors and general societal trends that are outside the remit of those conducting a trial. This makes it difficult to attribute cause and, by contrast to trials of clinical interventions, trials of public health interventions are conducted as pragmatic, community-based, and often cluster RCTs. Cluster RCTs require appropriate statistical methods of analysis, such as the use of multilevel modelling to control for the effect of clustering on outcomes and costs, and other considerations required for the sample size calculation, the univariate analysis of incremental costs and outcomes, the correlation between individual costs and effects, and the distribution of costs and outcomes [13].

There is a growing literature pertaining to solutions to methodological challenges of conducting economic evaluations of public health interventions [1417]. Fischer et al. [18] question the ‘fit’ of traditional methods of HTA and preoccupation with evidence-based medical decision making to a public health context, arguing that this will inhibit spending decisions on public health interventions. They state that many population-focussed public health interventions seek to make small changes in behaviour across large populations, and large RCTs with sufficient power to detect such changes are impractical or prohibitively expensive. Consequently, many trials are underpowered and unable to demonstrate whether public heath interventions are effective; however, even if underpowered, such trials can generate point estimates that can serve as inputs to economic models.

Fischer et al. [18] propose instead a decision theory approach where decision makers use their prior beliefs and experience in a structured Bayesian approach. They suggest that this approach is suitable for health improvement programmes such as screening and immunisation, behaviour change ‘nudges’ such as food labelling, government policy on health harming products, e.g. taxation and pricing, and social and economic policies such as those relating to the environment, housing and education that have important health impacts. Such community-level interventions are associated with large population health effects that would not be readily captured in RCTs without the application of extremely large sample sizes. Applying a decision theory approach involves the use of prior belief evidence combined with RCT evidence, even if underpowered, with the intent of generating a set of informed posterior beliefs about the effectiveness of a public health intervention to be weighed up against costs. This approach differs from the standard evidence-based medicine approach, which relies heavily on RCT evidence.

Trials with embedded ‘process and economic evaluations’ help researchers understand mechanisms of change and implications for cost effectiveness across the socioeconomic gradient in society, hence helping to address a key tenet in public health—that of reducing inequalities in health.

3 Methods for Resource Use Estimation

Costs are determined by the perspective adopted, and this varies according to the payer and healthcare system, and by country. For example, a US study might take an insurers’ perspective and measure direct costs such as hospital stays, whereas the preference for a societal perspective in The Netherlands might lead to an economic evaluation to estimate indirect costs as well as the broader public sector costs of social services, educational and/or criminal justice services, etc. In developing countries, patients may be asked about coping costs and what assets they have to sell in order to fund their treatment. Regardless of which items are to be costed, reliable estimation of resource use depends on the availability of valid methods of measurement. Methods employed in trial-based economic evaluations typically include a combination of one or more of the following: abstraction of data from routine medical records (e.g. patient notes, electronic medical records or patient administration systems), data collated mainly for the purpose of payment (e.g. hospital episode statistics, medicines dispensing data, insurance claims), dedicated sections within case report forms and administration of questionnaires to patients [5, 19] (Fig. 1). The choice of data collection method is particularly important to minimise missing values and achieve a high level of accuracy.

Fig. 1
figure 1

Methods for measuring resource use and assigning costs within trial-based economic evaluations

3.1 Based on Patient Recall

Our experience with the Database of Instruments for Resource Use Measurement (DIRUM), (http://www.dirum.org) suggests that patient questionnaires are used extensively in many countries. For example, the European version of the widely used Client Service Receipt Inventory (CSRI) [20] has been piloted in Spain, Italy, Denmark and The Netherlands, and found to be an acceptable method for collating service receipt and associated data alongside assessment of outcomes in patients with schizophrenia [21]. In the UK, methods that are reliant on patient recall are the most common for measuring resource use among NIHR-HTA programme-funded trials [5]. This is partly due to ease of use compared with the logistic challenges associated with obtaining electronic resource use data from local or national databases, and partly because there are no other means of collecting certain data such as time off work (productivity costs) or non-medical costs (such as transport to/from place of care). Data reliant on patient recall are subject to bias, which may adversely affect the resulting cost estimates. It is generally accepted that there is an inverse relationship between length of recall period and accuracy of data [22] owing to patients forgetting specific events and/or their timings. It is difficult to define an optimum recall period as this is context-specific, depending on factors such as the nature of the resource use being asked (which might define how memorable the event was), the ability of the respondent to understand what is being asked of them, the nature of the illness or disease and the frequency of measurement and/or follow-up in relation to patients’ clinical requirements.

While Brown and Adams [23] suggested that patients could recall their clinical encounters effectively up to 3 months after the event, Evans and Crawford [24] suggest recall is in fact related to the saliency or the impact of the event on the patient’s life. For example, a recall of 6 months might suffice for hospitalisations, whereas a shorter time frame would be required for repeated visits to a general practitioner (GP) or community pharmacy. Questionnaires that require recall periods of greater than 12 months are not recommended [25]. Richards et al. [26] highlighted the association between age and recall bias, advising caution with older patients and suggesting there are disparities, even within shorter timeframes of 3 months, on high saliency items such as hospitalisations. It is somewhat reassuring that the median recall time frames employed in a sample of 100 UK NIHR-HTA programme-funded trials was 3 months (interquartile range 0.5–6 months) [27].

Where the likelihood of comprehension is low (e.g. in children) or cognitive impairment is high, researchers may have no alternative but to rely on proxy recall, especially if collecting data on a broad spectrum of resource usage to a given costing perspective. Levels of agreement between patient and proxy or healthcare record and proxy are not well-established [24], although, in palliative care at least, there is some evidence to suggest information that relies on observable aspects of resource use can have a good level of agreement between patient and proxy [28].

3.2 Based on Routinely Collected Data

In the UK, health and social care is provided by multiple agencies that use different database systems, currently with limited record linkage. Routinely collected data such as from hospital patient notes and GP records are nonetheless used widely alongside patient and proxy-reported healthcare use in trial-based economic evaluations [5]. GP records have traditionally been very diffuse, although systems such as EMIS-web (https://www.emishealth.com) and the Clinical Practice Research Datalink (http://www.cprd.com) provide a basis for more accessible data capture. One limitation in the use of GP records relates to the trial setting. When studies recruit patients from a hospital setting, for instance, sourcing data from individual GPs can be challenging as patients will be served by many. Patients recruited via community pharmacies can be geographically disperse, meaning tracking to general practice or hospital can be logistically very challenging. However, the availability of national hospital datasets can mitigate this problem.

Acute inpatient care data for hospitals in England are collated for submission to the secondary uses service (SUS) primarily for the purposes of payment by results (PbR), planning, commissioning and management, but also for audit and research [29]. Similar data are collected within the Patient Episode Database for Wales (PEDW) and the Scottish Morbidity Database in Scotland. All use Healthcare Resource Groups (HRGs) as a currency; however, whereas bundled HRGs, which constitute all the elements of care associated with a particular intervention, are nationally agreed and can be costed using the national tariff, unbundled HRGs are more reliant on the average unit cost to the National Health Service (NHS) as these are commissioned, priced and paid for at a local level [29]. The flow of electronic data is presented in Fig. 2, which also highlights data amenable to costing purposes.

Fig. 2
figure 2

Flow of electronic data in secondary health care in the English National Health Service [29]. Paler green cells represent the key areas where routinely collected data could be utilised for trial-based economic evaluations

The advantages and disadvantages of alternative data sources are listed in Table 1. For example, Hospital Episode Statistics (HES), a data warehouse of details of all admissions, outpatient appointments and accident and emergency attendances at NHS hospitals in England, do not contain unbundled HRGs but do contain operational procedure codes (OPCS) which can be used to generate some unbundled HRGs such as those associated with diagnostic imaging or expensive drugs. Perhaps the biggest limitation of HES data is that they cannot be used to generate unbundled HRGs for critical care, which can vary by as much as several thousand pounds per day per HRG. Costing carried out by NHS Trusts can also provide patient spell-level data using outputs from patient level information and costing systems (PLICS). These provide spell HRGs, admission dates and discharge dates, and can also identify clinical codes and total costs associated with critical care and expensive drugs, although these costs tend not to be disaggregated into their unbundled HRGs, as seen in SUS PbR.

Table 1 Advantages and disadvantages of routinely available data in the National Health Service in the UK

In addition to the perspective and methods of resource use data collection, the scope of the data being collected needs to be considered; whether only intervention-related resource use data should be collected, or all resource use, remains a moot point. Electronic databases offer the advantage of providing often complete and extensive resource-use data, allowing for cost effectiveness to be presented at various levels of inclusion of resource use. However, there are disadvantages to accessing routine electronic data, including the cost, the delay between a patient exiting a trial and the data becoming available, potential for missing data, and the complexity of data management. The prospect of more extensive linked data sources will benefit health economists engaged in trial-based evaluations, especially if they can ease the burden associated with accessing, extracting and interpreting the coding within the data systems in order to determine the relevant costs. In Wales there already exists the secure anonymised information linkage (SAIL) databank, which uses a matching algorithm for GP care, secondary care and local authority social services [30]. A comparable initiative across England (Care.data) is currently under development.

4 Methods for Measuring Health Outcomes

The conventional view from an economics perspective, that RCTs are mainly an opportunity for economic data collection, needs to be challenged. Economists need to be embedded within the trial management group and encouraged to lead on Studies Within a Trial (SWATs) to develop novel methodologies applicable to trial design, conduct or analysis. In addition to costs, there are also opportunities for developing and refining health outcome measures. Key to this is the shift towards involving patients generally, and ascertaining their preferences, specifically, in patient-reported outcome measures of health.

4.1 Choice of Outcome Measure

The choice of economic outcome measure for a trial is dependent on the research question, the setting and characteristics of the trial participants. In the UK, the NICE favours utilities derived from the European Quality of Life 5-Dimension 3-Levels (EQ–5D–3L) in adults, although accepts mapped EQ–5D–3L utilities from other health-related quality of life (HR-QoL) measures in situations where direct measurements are not available. A five-level version of the EQ–5D (EQ–5D–5L) has been developed, but as yet no scoring algorithm has been released for calculating utility values. Thus, the established EQ–5D–3L is currently more widely used in economic evaluation. In cases where the EQ–5D lacks content validity, which considers whether an instrument contains all the domains required to measure what it is attempting to measure, the NICE suggests that alternative HR-QoL measures may be used but does not explicitly refer to preference-based, disease-specific utility measures.

Challenges with eliciting patient-reported health outcome data can arise when the study population has cognitive limitations, for example dementia or learning difficulties, or when the study population involves children and young people. A systematic review of the use of the EQ–5D–3L with people with dementia found a satisfactory completion rate for people in the mild to moderate stage of dementia due to its brevity [31]. To achieve a satisfactory completion rate in studies involving children and young people, the design of data collection materials, e.g. questionnaire, form or diary, should be tailored for age. Satisfactory completion rates may be achieved, for example, by adapting the questionnaire design in terms of colour and personalisation for different age categories [32, 33]. Alternative validated preference-based utility measures validated for children include the Health Utility Index Mark 2 (HUI-2), Child Health Utility 9 Dimensions (CHU–9D) and Assessment of Quality of Life 6 Dimensions (AQoL–6D) [34].

In cases where trial participants are unable to answer questions themselves, the use of a proxy to complete questionnaires on behalf of the participant may be considered. This in turn raises the question of whether the proxy should be a family member with direct experience of the participant’s health, or a healthcare professional who has a broader experience of the condition. With reference to people with dementia, EQ–5D-rated quality of life is consistently scored higher by patients than by their proxies [31]. Similarly, children and young people may rate their quality of life higher than their proxies, such as in the case of diabetes where they respond more positively to the Pediatric Quality of Life (PedsQL)–General Module, diabetes self-efficacy (PedsQL–Diabetes Module) and their health utility (EQ–5D) than their proxies [32]. In the context of clinical trials, it is important to record who is completing proxy questionnaires to avoid any uncertainty over whether the questionnaire was completed by the family member, carer, healthcare professional or researcher. This may be facilitated by providing training and instructions to researchers on how to interview and how to administer questionnaires consistently [35].

Healthcare interventions may have an HR-QoL impact beyond the index patient, for example on family carers. Al-Janabi et al. [36] posed the argument that using an HR-QoL measure to assess carer quality of life imposes on them a ‘patient identity’, misinterpreting their role. Their suggested solution was to measure care-related QoL instead.

Use of generic, preference-based utility measures allows for the consistent comparison of results across different settings, populations, and interventions; however, disease-specific, preference-based measures allow a greater degree of sensitivity. By way of example, the epilepsy-specific quality-adjusted life-year (QALY) measure, NEWQOL–6D, was derived from the NEWQOL measure of HR-QoL, first by psychometric and Rasch analyses to establish the dimension structure of NEWQOL and, second, by valuing the health states generated by the classification system using time-trade-off [37]. The results were modelled to generate a utility score for every health state defined by six dimensions (worry about attacks, depression, memory, concentration, stigma, and control). A reanalysis of economic evaluation of an RCT comparing standard versus new antiepileptic drugs [38] revealed a different rank ordering of cost effectiveness based on the NEWQOL–6D than with the EQ–5D, suggesting that selection of dimensions of health more relevant to epilepsy might affect policy choices. However, the increased sensitivity of disease-specific preference-based outcome measures is traded against a loss of comparability of utility across different interventions and a potential insensitivity to side effects and/or comorbidities [39]. Further development of disease-specific utility measures, their comparison against generic instruments and impacts on decisions regarding technical and allocative efficiency is warranted.

Within a trial, QALYs are typically analysed according to treatment allocation; however, trials are rarely powered to detect differences in QALYs. An alternative is to adopt an event-based approach, whereby clinical events (e.g. stroke, fracture, myocardial infarction) are included as covariates in the regression equation [40]. Results allow a utility (decrement) to be attached to clinical events, and provide a coefficient (mediated by clinical events) that captures utility differences by trial arm. Briggs et al. [41] demonstrated that for an RCT where no significant difference in utility was observed using a traditional approach, event-based methods resulted in a moderately significant difference between treatment groups. This approach is more closely aligned with model-based economic evaluations, where utilities are typically attached to health states.

4.2 Calculating Minimally Important Differences

Preference surveys and ranking exercises can be used to inform the choice of outcome of most importance; however, methods that measure patients’ willingness-to-trade enables researchers to profile the impact of differences in health technologies with common characteristics to ascertain the most acceptable threshold at which harms and benefits need to be considered (traded).

Minimally important differences are necessary for calculating the sample sizes of trials to ensure adequate powering; however, these are typically determined by clinicians, whose views may not necessarily align with those of patients.

Consider an example of a drug that is effective in causing remission of disease symptoms but is associated with dose-dependent toxicity that may reduce survival. A trial using a lower dose is to be designed, with survival as the primary outcome, and which requires a minimally important difference to calculate the sample size. Using a discrete choice experiment (DCE) to determine preferences in the absence of other data, with remission and survival as attributes, the relative importance of movements between levels of each attribute can be estimated as the marginal rate of substitution (MRS) [42]. This can be used to estimate patients’ maximum acceptable decrease in survival, for a given increase in remission. The MRS is calculated as follows:

$$ {\text{MRS}}_{\text{remission}} = \beta_{\text{remission}} /\beta_{\text{survival}} $$

where β remission and β survival are the coefficients for each attribute, respectively.

Using a known decrease in remission between low and standard dose (∆remission = remissionlow remissionstandard dose), the minimum percentage difference in overall survival (=∆survival) that a DCE respondent would be willing to accept (where the utility associated with the lower dose is greater than or equal to the utility of standard dose) is estimated as:

$$ - {\text{MRS}}_{\text{remission}} *\Delta {\text{remission}} \le \, \Delta {\text{survival}} . $$

If the difference in survival between treatment groups is less than ∆survival, then, from a patient’s perspective, the treatment option with lower recurrence would be chosen.In the case of multiple attributes, this generalises to:

$$ - \mathop \sum \limits_{1}^{N} {\text{MRS}}_{\text{attribute(N)}} *\Delta_{{{\text{attribute}}\left( {\text{N}} \right) }} \le \Delta_{\text{survival}} . $$

This trade-off between different outcomes of interest provides the point of indifference from the respondent’s perspective and therefore represents the minimally important difference in the primary outcome measure (probability of survival) between choosing the trial intervention or not.

Caveats are drawn from the type of study required to derive this result; sufficient qualitative research into the design of the DCE are required to ensure that attributes are meaningful, both to the patient and clinically. It is noted that the DCEs are estimated with uncertainty, and responses vary depending on an individual patients’ situation and preferences. Indeed, for a given patient, the MRS may not be constant along the indifference curve. Notwithstanding these caveats, this method may provide a quantitative means of involving patients in the design stage of an RCT, resulting in an effect size that is relevant and meaningful to the patient, rather than chosen based solely on clinician consensus or in concordance with previous historic trials where the initial effect size may have been chosen somewhat more arbitrarily.

5 Handling Data-Related Methodological Challenges

5.1 Methods of Extrapolation

Trial follow-up periods are typically shorter than the period during which differences in health effects and use of healthcare resources between interventions persist. Consequently, time horizon bias is a problematic feature in trial-based economic evaluations. Consider two time scales of importance; t RCT which refers to the duration of an RCT (the observed period), and t TH which refers to the time horizon of an economic evaluation (the unobserved period). In general, an economic analysis at t TH requires extrapolation of the incomplete (censored) data from t  t RCT in order to generate predictions of health outcomes and costs after the RCT ceases. Multivariable regression-based statistical models fitted to RCT data allow simple extrapolation for trial participants beyond the observed period, given information on individual-level characteristics such as age, sex, disease history, or the presence of other known risk factors that appear as predictor variables in the model. However, this approach may be limited by the validity of assumptions such as the linear dependence of health outcomes on predictor variables or the appropriateness of extrapolating associations beyond the trial period, plus factors such as censored (or missing) individual-level data or bias arising from loss to follow-up, therefore such models should be used with caution.

Other common approaches include the use Markov models, which take the form of dynamic discrete-time models represented by difference equations (or, in continuous-time, differential equations) or extrapolating survival/time-to-event data by fitting parametric models to RCT data and assuming proportional hazards (PHs) to determine intervention group survival. However, in the latter case, PHs become an increasingly unreliable assumption as t TH increases towards lifetime duration due to additional factors such as age-dependent mortality [43]. Alternatives to parametric models for time-to-event data (such as spline-based methods, piecewise modelling, and other semi-parametric or non-parametric models) have also been proposed [44].

Whatever the method, robust extrapolation relies on thoughtful model development, reliable parameterisation, and intelligent fitting to data; uncertainty in extrapolation increases as the ratio t TH/t RCT increases, therefore the time horizon and time-discounting rates are critical. When fitting models to RCT data, fitting to only (stable) data in the observed period that most closely replicates expected conditions in the unobserved period, rather than the full dataset, may lead to more efficient use of the available evidence [45]. If models used for economic evaluation are underpinned by dynamic mechanistic disease models (such as in the case of infectious disease), consideration must be given to how the time scales of disease processes relate to t RCT in order to ensure that model fit is not based on transient model behaviour that will not be representative of long-term disease dynamics reached within the unobserved period (Fig. 3).

Fig. 3
figure 3

HIV infection dynamics based on the dynamic mechanistic transmission model (and parameters) of Keeling and Rohani [79] showing the total AIDS incidence (per year per 1000 population) across all sexual activity classes. Calculating the cost effectiveness of a new intervention against HIV, based on extrapolating data obtained from an RCT (which typically takes place over a relatively short timescale), will depend strongly on what phase of the epidemic curve is used for disease predictions and the time horizon of interest. If the RCT takes place near the start of a new HIV epidemic (when incidence increases approximately exponentially), this will give very different estimates of long-term cost effectiveness compared with estimation in a population where the disease has already become endemic (in this study, after approximately 60 years). In the context of endemic infections, this early rapid growth and subsequent decline in the number of cases is an example of transient model behaviour, which differs significantly from the long-term disease dynamics, which, in this study, reach a steady state. RCT randomised controlled trial

Model design and parameterisation should be based on underlying assumptions informed by the problem (and disease) in hand, which should, in turn, be motivated by factors such as pathophysiological processes, clinical and epidemiological knowledge, previous modelling work, statistical evidence, and/or desired model complexity. Assimilation of this knowledge unavoidably introduces uncertainty in the model construction process, together with uncertainty arising from the data itself (such as quality, bias or missing data). Bias in model selection due to (subjective) modeller preference also represents a potential source of error.

The central methodological issue in extrapolation is to understand the robustness of (and confidence that should be attached to) long-term model predictions (and economic decisions) given the array of structural, parameter, and methodological uncertainties (as well as variability and heterogeneities) associated with the modelling process [46]. This question ideally necessitates (a) identifying and specifying all uncertainties arising from the modelling process; (b) quantifying (and representing) the resultant total uncertainty arising in the economic evaluation associated with extrapolation; and (c) assessing which sources of uncertainty most significantly affect the conclusions of the economic evaluation, although this three-step procedure is currently rarely followed in full. Included in the process of quantifying uncertainty in model development should be a critique of whether structural assumptions in model design for t  t RCT are valid for (and can be trusted to generalise to) t > t RCT; examples include consideration of whether new interventions may change in efficacy over time (e.g. new antimalarial drugs due to resistance) or whether temporal changes in disease parameters (e.g. due to seasonal fluctuations) are important. Structural uncertainty may be addressed using scenario analysis or model comparison/averaging [47], but it is important to recognise that good model fit for t  t RCT does not guarantee model reliability, accuracy or robustness of assumptions for t > t RCT; if t RCT is considerably less than t TH, structurally different models with almost equivalent goodness-of-fit in the observed period can produce very different predictions for cost effectiveness at t TH (see Fig. 4).

Fig. 4
figure 4

Illustrative example of extrapolating RCT survival data from an observed period (in this study, 1 year) to a follow-up period of 4 years using a simple two-state Markov model, after which an economic evaluation of the intervention cost effectiveness is undertaken. In this case, the Markov model consists of two states (‘well’ and ‘dead’, with the probability of moving from the former to the latter within a time-step given by the survival distribution of interest). In this study, monthly simulated RCT data is generated from a log-Cauchy distribution, and five parametric survival models (gamma, log-normal, Dagum, log-logistic and Weibull) are fitted to the RCT data using least-squares and then extrapolated using the Markov model to a time horizon of 5 years. Despite a near identical fit to the trial data, estimation of the cost effectiveness over the follow-up period differs significantly depending on the parametric model adopted. RCT randomised controlled trial

Ultimately, model validation, assessing a model’s accuracy in making predictions for the specific purpose of interest, is critical to confidence in the extrapolation process. As a minimum, this should include a careful critique of model structure, parameterisation, and problem formulation (in the specific context of t > t RCT conditions), but should ideally consider more rigorous validation principles, including examination of different models addressing the same problem (and assessment of the implications for resultant economic evaluations and decision making), comparison with relevant available data from sources beyond those from the RCT used for model fitting, and identification of opportunities to compare the predictions of models with observable events and outcomes that may occur for t > t RCT (arguably the most desirable form of validation) [48]. Dominant sources of uncertainty in the unobserved period may not be the same as those arising in the observed period, and the results may depend on the problem in hand, the type of outcomes of interest, and the nature of the model itself.

5.2 Dealing with Missing Data

Missing data is almost unavoidable in clinical trials and can impact on economic evaluations in terms of loss of precision and statistical power, which may bias the results such that any consequent conclusions made might lead to incorrect policy decisions. Resource use, cost or health outcome data derived from patient questionnaires or case report forms are prone to being incomplete, as are alternative electronic data sources, such as hospital episode statistics, which also have the potential to contain anomalies that are required to be treated as missing data.

The reporting of missing data in trial-based economic evaluations and the methods used to handle missing data are varied and unclear, with sensitivity analysis around the missing data rarely being performed [49]. Poor handling of missing data not only introduces the risk of imprecision, loss of power, and bias but also missing baseline data becomes problematic when estimating effect size [50], leading to challenges in cost-effectiveness analysis when baseline adjustments are considered [51]. Sterne et al. [52] offered guidance on the reporting of missing data, and the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) Task Force report on conducting cost-effectiveness analysis alongside RCTs [53] makes explicit recommendations on both the analysis and reporting of missing data. The Consolidated Standards of Reporting Trials (CONSORT) guidelines require that RCTs report the flow of patients through the trial, as well as the numbers included for analysis [54]; however, no such clause is included for reporting how missing data is analysed [55]. Neither the NICE reference case [7] nor the Consolidated Health Economic Evaluation Reporting Standards (CHEERS) [56] explicitly require the reporting of missing data or give guidance as to how missing data ought to be analysed, although it may be assumed that missing data contributes to uncertainty, which is covered by the guidelines [49].

The potential for missing data should be realised at the trial design stage, with procedures implemented to minimise wherever possible [57]. Patient questionnaires require careful design and piloting, with consideration given as to how, when and where they are administered, as described above. With regard to cost data, it may be prudent to explore multiple sources of data—electronic in combination with case report forms—so as to have overlap and a secondary source from which to replace missing data.

The mechanism of missing data can be classified into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [58]. Complete case analysis, effectively ignoring incomplete data, remains popular for trial data [49, 57] but is valid only when data are assumed to be MCAR. However, when the MCAR assumption does not hold, the data may no longer be representative of the target population, thus compromising external validity [59]. Ad hoc methods, such as mean imputation, attempt to address missing data but are also only valid under MCAR. Whilst preserving point estimates, mean imputation attenuates variance, leading to artificially small confidence intervals.

Maximum-likelihood estimation (MLE) methods such as expectation maximisation are valid under the less restrictive MAR assumption; however, separate MLE models are required for each different analysis. Additionally, MLE can be challenging for certain approaches, such as generalised linear modelling [60]. Multiple imputation (MI) [50] is also valid under MAR and requires only a single model; however, other challenges are introduced as multiple imputed data sets require to be analysed, and methodology for pooling results is not always intuitive, often with transforms required prior to pooling [61]. Nevertheless, MI has been shown to be robust to departures from normality, in cases of low sample size, and when the proportion of missing data is high [62]. MNAR-specific methods include the Heckman selection model [63] and pattern mixture models [64], and a novel weighting approach adapts MI specifically for data that are MNAR [65]. However, it is noted that for data which are MNAR, MLE and MI analysed using information relevant covariates often produces accurate unbiased results [66, 67]. Methods such as these should be applied, where indicated, in trial-based economic evaluations.

5.3 Accounting for Non-Adherence

A key issue when generalising the findings of controlled trials to clinical practice is whether efficacy can translate to effectiveness; that is, whether the intervention which evidently works, works in practice. While adherence encompasses healthcare professionals’ delivery of an intervention, as well as following clinical guidelines, a major factor that can impact on a treatment’s effectiveness is the effect of patient non-adherence. It is generally recognised that significant differences exist between adherence to (or persistence with) treatments in efficacy trials and routine care. In the case of 3-hydroxy-3-methylglutarate-CoA (HMG-CoA) reductase inhibitors (statins) for instance, adherence following 4–6 years’ follow-up of patients in pivotal trials of secondary prevention ranges from 81 to 99 %. This contrasts with pragmatic studies conducted in community settings, where adherence ranges from 21 to 87 % [68]. Differences may be due to intensity of monitoring, frequency of follow-up clinic visits and selection of patients, either through protocol specification or through self-selection of interested (and hence more adherent) patients. Failure to consider patients’ non-adherence to treatments or interventions in economic evaluations can consequently lead to biased estimates of cost effectiveness, with treatments potentially not being as cost effective as anticipated. Reviews of economic evaluations indicate that only a minority assess the impact of non-adherence explicitly [69].

Methods for adjusting effect and cost data for non-adherence and/or premature discontinuation are mainly reliant on conventional Markov or decision analytic approaches, with assumed reductions in benefit and a change in cost, for reductions in adherence [70, 71]. Patients who discontinue treatment are commonly assumed to experience the benefits and incur the costs associated with a placebo group. These strong assumptions may not be valid given that patients may rationally discontinue treatment if they experience adverse reactions or perceive no benefit from treatments for symptomatic diseases. Placebo data, representing no treatment, may not be available in many trials. Alternative approaches require knowledge of baseline covariates that predict adherence for each treatment group of the trial. Fischer et al. [72] proposed a structural mean modelling approach to obtain adherence-adjusted estimates for treatment effects in an RCT comparing two active treatments, and Kadambi et al. [69] proposed discrete event simulation for economic analyses as this may be more amendable to modelling individuals’ risk of non-adherence and the link between reduced efficacy or increased adverse events and treatment discontinuation. The use of mechanism-based modelling, in which individual doses are simulated according to known patterns of adherence, and the pharmacokinetic and pharmacodynamic consequences are propagated to estimate the reduction in treatment effect, is an alternative approach that may assist in predicting the dilution of effect from perfect to typical behaviour of remedicating [73]. Further research is warranted for methods of estimating the impact of variable adherence on the efficacy, and hence cost-effectiveness, of treatments.

6 Multinational Trials

Drug trials are increasingly conducted at multiple centres across many nations. The decision to include more than one country for a given trial is typically driven by regulatory requirements, achieving the required sample size (e.g. for treatments of a rare disease), or a desire to complete within a shorter time frame [74]. Trials conducted in multiple countries may also enhance the generalisability of the findings to the participating countries [3, 75] and serve to facilitate immediate adoption following regulatory approval; however, multinational trials are faced by methodological challenges that need to be considered at each stage.

While multinational trials increase the heterogeneity in the sample in terms of patient characteristics and disease severity [3, 74, 76], it remains common practice for most regulatory authorities to accept pooled estimates of clinical efficacy across countries. This is usually based on the assumption that the clinical response is unlikely to vary significantly and can therefore be considered homogenous [76]. However, difficulties arise when economic evaluations are conducted alongside multinational trials, where other sources of heterogeneity become pertinent. Moreover, an aggregate estimate of cost effectiveness (or economic value) would be of limited, if any, value for each participating country’s decision makers, where the use of country-specific estimates is usually a requirement for reimbursement authorities.

Costs of care vary widely from one country to another. This variation originates from differences in demographics, underlying disease epidemiology and patient mix. However, it is usually the differences in the price weights (both absolute and relative) and clinical practice that are the most influential [75, 76], and these must be taken into account when analysing trial data. Prices may themselves influence clinical practice and resource use, for instance with substitution of more costly (or less subsidised) resources by cheaper ones. Differences in health care financing mechanisms also affect the choice of the perspective from which an economic evaluation is undertaken, be it a national health service, private insurer or the patient [35].

Due consideration should also be given to the choice of outcome measures to be used in the trial, the availability of translated and validated versions [77], and the selection of tariffs for preference-based measures. Differences in clinical practice between countries typically mean that the comparator may be less relevant, necessitating alternative or supplementary analysis, e.g. using indirect comparisons.

The methods for multinational trial data analysis have been reviewed by Reed et al. [74], who developed a classification system that combines both the type of analysis used for clinical effectiveness and resource use (fully pooled, partially split or fully split) and the costing method (one country or multi-country costing). Random effects/multi-level modelling has been identified as a promising approach that offers an intermediate solution between fully split and fully pooled analysis, and offers country-specific estimates without compromising the power of the analysis [78]. In choosing an approach, Reed et al. [74] recommended that health economists need to balance a number of issues, including generalisability and transferability of data between countries, awareness of the statistical power, and consideration of uncertainty within the analysis.

Researchers engaged in the design and analysis of economic evaluations based on multinational trials are advised to consult relevant guidelines [76].

7 Conclusions

This review highlights selective key challenges in the design, conduct, analysis and reporting of trial-based economic evaluations. It unapologetically focuses on the research interests of the authors, and consequently adopts a UK focus, but through this introduces some concepts and methods that have not yet gained mainstream adoption in trial-based economic evaluations. It should serve to stimulate debate around the appropriate approaches to best serve the evidential needs of decision makers.