FormalPara Key Points for Decision Makers

The current practices in healthcare discrete choice experiments (DCEs) incorporate multiple methodologies to account for preference heterogeneity.

Understanding unexplained preference heterogeneity through the use of mixed models (mixed logit/random parameter logit or latent class) is more prevalent in the health-related DCE literature than investigations into explained differences through the use of interaction terms.

Few researchers have attempted to show alternative specifications (i.e., distributional assumptions) to understand the evidence of unexplained preference heterogeneity or to apply more complex models (such as using correlated distribution for preference parameters or controlling scale to account for preference heterogeneity) to enrich the interpretation of the results.

1 Introduction

Systematic literature reviews of discrete choice experiments (DCEs) in healthcare have shown increases in both the number of applications and the complexity of models; however, these reviews have disregarded the detailed information on the modeling of preference heterogeneity, specifically [1,2,3]. Understanding heterogeneity in preferences for health and healthcare-related goods and services motivates researchers to apply alternative methods to estimate preferences across individuals. Most of the preference heterogeneity analyses in the stated choice experiments have a tendency to explore heterogeneity from different angles, which are guided by the choice modeling research questions, fundamental assumptions of choice behavior (i.e., whether choice heterogeneity varies systematically or randomly, whether heterogeneity varies through observable characteristics or latent characteristics), and model specifications (e.g., model identification, distributional assumption, model fitting and selection criteria, convergence aspects) [4,5,6]. Furthermore, the confounding nature of the preference parameter with the scale parameter motivates many researchers to account for scale heterogeneity (i.e., heteroskedasticity) while estimating preference/taste heterogeneity (i.e., differences in the relative effects of attribute levels) [7]. These make the existing research fragmented, and no systematic reviews have yet been conducted to summarize the existing preference heterogeneity approaches applied in healthcare studies. As published in Pharmacoeconomics [1], Zhou and colleagues conducted a systematic review specifically on the latent class (LC) models applied in healthcare preference studies, and many other systematic reviews on healthcare DCEs have long existed that mostly provided descriptions of the general state of the science [2,3,4, 8]. Hence, the objectives of this manuscript are to systematically summarize the current practices that account for preference heterogeneity, focusing on the published DCEs related to healthcare and identifying methodological gaps to inform future researchers in the DCE health preference heterogeneity research field.

This systematic review was conducted by the authors in collaboration with a strategic working group within The Professional Society for Health Economics and Outcomes Research (ISPOR) Health Preference Research Special Interest Group (HPR SIG), which was established in early 2020. The acknowledgment section of the paper duly credited the contributors of the HPR SIG members. In order to provide a comprehensive overview of the established practices, the working group also performed an online survey among health preference researchers and methodological experts in choice modeling. Excerpts from this literature review were published alongside the survey evidence to provide context for the scientific community [9]. As the thorough discussion on the publication trend was beyond the scope of the HPR SIG’s report, the aim of the current paper is to document the findings of the systematic literature review in detail, emphasizing noteworthy patterns and trends in the examination of preference heterogeneity within health-related DCEs. For example, this systematic review includes in-depth descriptions of the research questions, the samples recruited, and an evaluation of the various approaches to experimental design. The paper also included a critical review of the analytical models applied in analyzing preference heterogeneity. Finally, the results are synthesized into a narrative summary that shows methodological gaps in current research practices for future healthcare DCEs to investigate preference heterogeneity.

2 Methods

2.1 Search Strategy

The literature review employed the search strategy in line with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist [10] and earlier published systematic reviews of healthcare DCEs [1, 2, 8, 11]. The literature search was performed in the PubMed, OVID (EconLit, Medline, Embase), and Web of Science electronic databases using the joint search terms shown in electronic supplementary material (ESM) Appendix 1, Fig. A1). In addition, supplementary citation searches were performed on two of the latest published systematic reviews on health DCEs [1, 2]. The date of the search period for published articles was from 1 January 2000 to 31 March 2020, as preference heterogeneity is little discussed in articles before 2000 [4]. During the search process, records were removed if (1) author name or abstract was missing; (2) published in non-English journals; (3) published as proceedings or conference paper, abstract only, or as a book section; and (4) systematic reviews and published protocols. Finally, duplicates from the identified articles were removed by hand and with Endnote X9 tools [12]. The final list was shared within ISPOR’s health preference research SIG [13] to identify any of the missing articles on preference heterogeneity that were not retrieved earlier.

2.2 Inclusion and Exclusion Criteria

Articles were included if (1) a DCE was used to elicit preferences for health and healthcare, such as health valuation studies, treatment studies, structure/policy studies (e.g., examine job preferences of health workers) [1, 2]; (2) the analysis was based on a random utility model (RUM) under microeconomic theory [14]; and (3) included analyses of preference heterogeneity either observable difference (e.g., included interaction terms between attributes/attribute levels and respondents’ characteristics in a pooled data set) [15] or unobservable differences (e.g., finite and continuous mixture models with and without covariates effects). For the purposes of this review, informative heterogeneity (i.e., the systematic correlation between unobservable factors associated with choices, e.g., nested logit) is included within unobservable differences; however, we recognize that such correlations may be unrelated to individual factors.

Articles were excluded if (1) studies analyzed preference of food (e.g., high sugar), transportation (e.g., road safety), and environment (e.g., air quality control) that may or may not be related to health unless addressing health and healthcare audiences (published in health, health economics, or methodological journal); (2) studies focused on choice heterogeneity only to evaluate heuristics (e.g., attribute non-attendance), information processing (i.e., differences in utility function), and data mining perspectives; and (3) studies that stratified the data with separate analyses (e.g., multiple countries or studies compared patients and physicians unless a random parameter logit [RPL] ran in subsample). To concentrate the discussion of analytical methods only on DCEs, studies with adaptive or scale-based preference-elicitation tasks, such as best-worst scaling (case 1, case 2, or case 3) [16], conjoint analysis (where respondents were required to rate or rank alternatives or attribute levels) [17], kaizen [18], and ranking [19] were also excluded.

2.3 Study Selection

At the first stage of article selection, all abstracts were examined by a pair of trained reviewers. Sixteen members from the ISPOR preference heterogeneity working group voluntarily participated in the training and screening process and were paired into eight groups. The pooled articles from the initial search were distributed among the eight groups with no overlapping. The detail of the working group is mentioned in the contributor section. Based on the inclusion/exclusion criteria, each reviewer independently classified the abstracts into accepted, rejected, or remained undecided. If both reviewers accepted an abstract, the article advanced to the next step for full-text review; however if reviewers differentiated in opinion, a third reviewer (CV) independently arbitrated the disputed abstract.

After the abstract review, the articles were collected, curated, and distributed again. As with the abstract review, the steps of the dual review were repeated in the full-text screening stage. If disagreement remained after third-party arbitration, the study team (BC, CV, SH, SK) discussed the articles and achieved a consensus-based decision. Once the full list of included articles was finalized, the study team sought feedback from the ISPOR HPR Special Interest Group [13] and the heterogeneity working group to include any missed articles.

2.4 Data Extraction

Given the articles that passed the full-test screening stage, the reviewers extracted data using a Microsoft Excel tool developed in Visual Basic for Applications (VBA). This VBA tool (ESM Appendix 2) was developed through an iterative consultation process with working group members, and each reviewer participated in the beta testing to clarify its terminology and to ensure systematic extraction of the key study features of interest. The VBA tool promoted consistency in extraction across the working group members who had different backgrounds and varying levels of expertise.

Using the VBA tool, reviewers extracted background information such as date of publication, objective (e.g., preferences for treatments, health outcomes, jobs), country of origin, sample type and size, number of alternatives and attributes, and experimental design. Detailed consideration was given to analytical models such as articles that conducted an RPL, data were extracted on number and type or draws, and the type of distribution specified for the random parameters. In studies reporting LC analysis, details about the number of classes and approaches used to select classes (e.g., Bayesian, Akaike information criteria [AIC]) were extracted. Other relevant extracted information was the statistical software name, estimation approach, statistical tests for the presence of heterogeneity, post-estimation calculations (e.g., willingness to pay, maximum acceptable risk, relative importance [RI]), and subjective categorization of the extent of discussion on preference heterogeneity into the article. Inferences were discouraged (for example, if an article did not state certain details, it was not inferred from the software used but rather marked as ‘unreported’). Apart from differences between reviewers, data extraction may vary between articles due to differences in research practices and reporting quality.

2.5 Analysis

To understand the change in the characteristics of publications over time, we analyzed the articles in three different time frames: 2002–2010, 2011–2015, and 2016–2020 (March)Footnote 1. The background information of articles was summarized through counts, percentages, means, and standard deviations (SDs). The major background information was the year of publication, country, study objective, study field and population. Data were also extracted on the study design, including sample size, number of attributes, number of alternatives, and experimental design approach. Finally, information on the presentation of results was also recorded. The review summarized the methodology of each study to understand how preference heterogeneity was addressed. Key outcomes related to methods of each study comprising modeling approaches such as observable and unobservable heterogeneity, consideration of scale, statistical specifications such as details of the distribution of estimated parameters and estimation techniques, software use, application of alternative models, and the interpretation of heterogeneous preferences.

3 Overview of Healthcare Discrete Choice Experiments (DCEs) with Preference Heterogeneity Analyses

The initial search from the electronic databases and citation of relevant articles identified 1202 records, of which 975 were accepted after the first screening. Chronological screening steps of abstracts and then the full texts identified 379 articles for data extraction. During the data extraction phase, another 45 studies were red-flagged and expert guidance incorporated another eight articles. Thus, the final database contained 342 articles (Fig. 1); the full list of articles can be found in ESM Appendix 1.

Fig. 1
figure 1

PRISMA diagram [9]. PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses

3.1 Trends and Sample Size

The trend of published healthcare DCEs shows that preference heterogeneity analyses increased substantially after 2010 (i.e., yearly publication increased by more than 10 since 2011) (Fig. 2). The average number of published preference heterogeneity analyses was 3.6 per year until 2010, increasing to 29.2 between 2011 and 2015, and 33.4 between 2016 and 2020. In 2015, the number was highest (n = 65). Among the 342 articles, the sample sizes ranged from 35 to 4600, with a mean of 583.9 (SD 567.83) and a median of 406 (interquartile range [IQR] 213.8–722.8) (Fig. 2). Only around 16% (n = 59) of the analyses had over 1000 respondents.

Fig. 2
figure 2

Trend of publications from 2000 to 31 March 2020 (left); distribution of sample size of the reviewed articles (right). IQR interquartile range

3.2 Areas of Application, Regions, and Population

Among the 342 articles, Table 1 shows that the preference heterogeneity analyses were largely focused on the preferences for treatment, or medicines (n = 163, 47.8%), followed by preferences for devices, screening, or procedures (n = 64, 18.8%) and health policies (n = 46, 13.5%). Around 7% (n = 24) of the studies were focused on job or education choices, 3% on health states/outcomes (within the quality-adjusted life-year framework), 1% on preferences for trial or program participation (e.g., blood donation, smoking cessation program), and around 9% had other types of objectives (e.g., preference for insurance, lifestyle, services, learning, and information).

Table 1 Background of discrete choice experiment studies

The majority of the preference heterogeneity analyses were performed in Europe (n = 193, 56.4%), followed by North America (n = 94, 27.5%) and Oceania (n = 38, 11.1%), namely Australia (Table 1). Within Europe, 25% of the articles were from the UK (n = 48), 16% were from Germany (n = 31), and others were from other countries, such as France (n = 15), Spain (n = 8) and Italy (n = 12). In North America, all studies are from the US (n = 55, 16.0%) and Canada (n = 38, 11.1%) except one from Central America. In Asia (n = 24, 7.0%), almost half of the studies are from China (n = 11) and 8% of the total extracted papers are from Africa (n = 27). South America produced the least number of studies across the world (n = 3) [20,21,22,23].

Within these areas of application and regions, most studies were conducted among patients (n = 163, 47.5%), followed by the general population (n = 94, 27.4%) and healthcare workers (n = 52, 15.2%) [e.g., physicians, nurses, pharmacists].

3.3 Attributes, Alternatives, Designs, and Estimates

Around half of the reviewed papers (n = 173, 50.58%) had five or six attributes, with the minimum and maximum number of attributes being 2 and 20, respectively (Fig. 3, Table 2). Furthermore, some studies described the choice tasks by presenting a combination of subsets of attributes rather than using the standard practice of presenting all attributes in each choice task [24,25,26,27,28,29,30,31,32]. The majority of studies had two alternatives (n = 271, 80.2%), followed by three alternatives (n = 42, 12.4%), excluding the reference case (e.g., opt-out, status quo). In most of the papers, the alternatives (n = 279, 81.4%) were unlabeled and around 35% of the studies included a reference case in the choice scenarios (Table 2).

Fig. 3
figure 3

Number of attributes in the included studies (left; n = 341); number of alternatives in the included studies excluding the reference case (e.g., opt-out, status-quo) [right; n = 333]. IQR interquartile range

Table 2 Alternatives, attribute and designs of discrete choice experiment studies

In order to retrieve information on the experimental design, we collected data on specific terms that the authors mentioned explaining the study design. In their description, nearly all articles reported the number of choice tasks per respondent (n = 334, 97.6%), which ranged from 3 to 48, with a median of 12 (IQR 8–16) (Table 2). Most studies (n = 304, 88.9%) reported information on set selection, and among them, 61% of the articles (n = 208) reported using some form of efficient design. The most prevalent design type is d-efficient (n = 185, 54.1%) (Table 2). Other prominent design types are fractional (n = 88, 25.7%), orthogonal (n = 45, 21.9%), and balanced (n = 35, 10.2%); however, these terms may not distinctively describe their approach (e.g., an efficient design is fractional too). Thirty-four percent of the studies reported constructing their design accounting for the main effects only, 6% accounted for two-way or higher-order effects, and most did not report the particular information (59%).

The review demonstrated the diversity of software used in experimental design and estimation. The most prevalent software package to design the experiment was Ngene (n = 112, 32.7%), followed by Sawtooth (n = 63, 18.4%) and SAS (n = 43, 12.6%) (Table 2). Around 76% (n = 261) of the reviewed articles reported the software package used for the analysis. Among them, the highest used software package was NLOGIT (n = 95), followed by Stata (n = 93), Sawtooth/Lighthouse Studio (n = 27), and Latent GOLD (n = 23).

In their description of the results, 80.8% reported coefficient estimates, and 28.6% of the studies reported willingness-to-pay (WTP). After 2010, studies also focused on other representation formats such as odds ratio (OR) and maximum/minimum acceptable risk (MAR). In addition, the application of RI and preference share has increased substantially in recent studies after 2016 (Table 2). In addition, a few studies reported the valuation space as WTP space [33,34,35,36,37,38].

4 Current Practices to Explore Preference Heterogeneity in Healthcare DCEs

4.1 Overview

To describe alternative methods used in preference heterogeneity analysis, we classified the models based on whether heterogeneity was explained by observable factors/covariates through subgroup analysis using interaction terms (explained preference heterogeneity) [39] and/or considered heterogeneity as unobserved (i.e., latent), which cannot be explained by observable correlates (unexplained preference heterogeneity). Accounting for unexplained preference heterogeneity needs further assumptions about the structure of latent heterogeneity [40, 41]. The major models identified to explore the unexplained heterogeneity are mixed logit (MXL) models with a continuous distribution of preference parameters (also known as RPL models), LC logit model with a discrete distribution of parametersFootnote 2, and scale heterogeneity models where scale parameter is random either with the continuous or discrete distribution [40, 42].

Among the 342 articles, the most common applied models are the MXL/RPL (n = 205, 59.9%), LC (n = 112, 32.7%), and conditional logit (CL; n = 90, 26.3%) models. Among the 205 analyses using the MXL/RPL, 42% (n = 86) were published between 2011 and 2015, and 50% (n = 104) were published between 2016 and 2020 (Fig. 4). A similar trend was observed in LC models, where 33% (n = 37) were published between 2011 and 2015, and 61% (n = 68) were published between 2015 and 2020 (Fig. 4). The application of LC models increased by 15% between 2011 and 2015 and 2016 and 2020, whereas the application of the MXL/RPL increased by 4%. The review also identified that the application of a particular model is often associated with software usage. Such as, classical MXL models were mostly performed in Stata (n = 62) or NLOGIT (n = 72); Bayesian mixed logit (hierarchical Bayes) models were mostly performed through Sawtooth (n = 18)Footnote 3, and models were mostly analyzed by using Latent GOLD (n = 21), Sawtooth (n = 23), or NLOGIT (n = 31).

Fig. 4
figure 4

Alternative models applied in the included studies

Around 38% (n = 126) of the reviewed paper explored heterogeneity using multiple models. Among them, 72% (n = 91) of the studies used two models, 24.4% (n = 31) used three different models, and only four studies used four different types of models [43,44,45,46]. Among the papers that reported estimation with two different models, 25% (n = 23) of them applied MXL/RPL and LC, 24% (n = 22) used CL and LC, and 16.8% (n = 15) used CL with MXL/RPL. The combination of CL, MXL/RPL, and LC is also the highest (35.4 %, n = 11) among the papers that used three different models for estimation. Among the papers that applied a single model (n = 216), 59.4% (n = 128) of the papers used any form of the MXL model (i.e., classical, Bayesian, or generalized), followed by the LC model (n = 43), CL model (n = 16), and random-effect probit/logit (n = 17).

4.2 Explained Heterogeneity

By holding the assumption that preferences are correlated with observable characteristics, differences in preferences can be explained by observable (e.g., clinical and socioeconomic) factors. In this case, heterogeneity is usually modeled from the pooled sample by estimating a coefficient on the interaction between an individual-level characteristic variable and an attribute variable. The CL/multinomial logit (MNL) models have most frequently been used to identify heterogeneity through observable differences [15]. Around 25% (n = 90) of the reviewed papers used CL/MNL models with such an interaction [47,48,49,50,51]. Some other studies also included interaction within MXL/RPL and random-effects models [52,53,54,55]. Among the total reviewed articles, 38.2% (n = 131) of studies included subgroup analysis (e.g., where an individual-level characteristic variable has interacted with all attribute variables), and 13% (n = 45) of studies also included interaction between attributes.

4.3 Unexplained Heterogeneity

Rather than limiting the source of heterogeneity by deterministic segmentation, the underlying assumption can be that preferences vary across individuals due to latent factors that may or may not be easily observed. Based on this key principle, different applications capture unexplained heterogeneity by specifying the model with random preference coefficients with specific distributional assumptions and correlation structures [56]. In its most simple form, the MXL/RPL model accounts for unexplained heterogeneity through individual deviations from the mean estimate in a continuous distribution specification. Among those reviewed, the majority of the MXL/RPL models (n = 177, 86.3%) were estimated using the classical estimation procedure (i.e., maximum likelihood), and the remainder used the Bayesian estimation technique (commonly known as hierarchical Bayes [HB]). Majority of the studies reported SD or the spread of the random parameters to identify the degree of preference heterogeneity, whereas some studies introduced an interaction between the mean estimate of the random parameter and a covariate to understand preference heterogeneity around the mean on the basis of the observed covariates. Of models that are required to specify distributional assumptions (e.g., MXL, HB, generalized multinomial logit model [GMNL]; n = 212), just over half (n = 117, 55.2%) have reported the specific distribution. Among them, most of the papers reported specifying a normal distribution (n = 110), followed by lognormal (n = 15) and triangular (n = 2) distributions. Many studies also specified distribution(s) for parameters that approximate reality as closely as possible (lognormal distribution for non-negative parameters, imposing constraints to control the spread/SD of the parameters, fixing coefficient[s]) [38, 57,58,59,60,61,62,63]. Only nine of the reported studies explicitly mentioned using the correlated distribution of preference parameters [22, 64,65,66,67,68,69,70]. Two studies had infused the error component model with mixed logit as a model specification [71, 72].

Some studies that used classical MXL/RPL reported further details in the specification; for example, 39.3 % (n = 71) reported the number and types of draws for simulation. Seventy of these used Halton draws with an average of 1528 draws (minimum 50, maximum 10,000), two studies used pseudo-random draws [73, 74], and one study used Modified Latin Hypercube Sampling (MHLS) draw with 500 draws [75]. One study also reported to have multidraws to test the stability and precision of the estimated coefficient [57]. Two studies estimated MXL using both the classical and Bayesian estimation techniques [76, 77]. Other types of random parameter models are the random-effect logit/probit model (applied in 7%, n = 25 studies).

In contrast to the MXL/RPL, the LC model specifies unexplained heterogeneity as classes (clusters) of individuals with similar (but unobserved) preference structures. LC models replace the continuous distribution assumption of MXL/RPL with a discrete distribution in which preference heterogeneity is captured by membership in distinct classes or segments [40]. The assumption of preference homogeneity holds within a class. Among the papers that estimated the LC model (32.2%, n = 110) (Fig. 4), 98% of the studies reported the class size of the best-fitted model. The range of reported best-fitted class sizes varied between two and six. Around 80% (n = 88) of the LC papers mentioned the model selection criteria, and the most common approach to selecting the optimal number of classes was the AIC (58.1%) [78], the Bayesian information criterion (BIC; 48.2%), and the consistent AIC (CAIC; 9.0%) [78]. Forty-one (37.2%) of the studies reported using one model selection criterion, 23 (26.1%) reported two criteria, and 22 papers reported using more than two criteria to select the best-fitted model. Studies that found conflicting recommendations for best-fitted classes by different information criteria tend to choose BIC and CAIC over AIC, as AIC was reported to overfit the model and BIC imposes more stringent criteria [24, 30, 79, 80]. In addition to information criteria, interpretability of class coefficients, sample size, adjusted R2, entropy score, and bootstrapping procedure to obtain the log-likelihood difference between classes have also been used to rationalize the optimal number of segments in LC models [24, 28, 30, 81,82,83].

In addition to the wide applications of the above models, around 7% (n = 25) of the studies used random effect probit/logit models to control for unobserved heterogeneity using panel specification of the data [32, 55, 84,85,86,87].

4.4 Scale Heterogeneity

In addition, to explore heterogeneity only through taste variation among respondents, some papers included the source of heterogeneity from the scale parameter (which relates to variation from the choice behavior and creates differences in the absolute effects of all attributes) and allowed for observable or unobservable differences in scale between individuals (i.e., scale heterogeneity) [7, 42]. For example, the heteroskedastic conditional logit (HCL) model (n = 12) allows for observable differences in scale between individuals [15]. The scale-adjusted LC model (n = 2) specifies scale heterogeneity as classes (clusters) of individuals with similar (but unobserved) scale structures [88, 89]. Overall, 11.7% (n = 38) of the articles explored scale heterogeneity in the data, and the generalized mixed logit model (GMNL) is identified as the most common model in this regard (n = 20).

5 Discussion

Our findings agree with the earlier systematic reviews; before 2012, 15% of the published healthcare DCEs dealt with heterogeneity, which increased to 67% from 2013 to 2017 [2]. This study provides an overview of how preference heterogeneity has been incorporated into healthcare DCEs. Most of the studies were focused on identifying heterogeneous preferences for different treatments/medications followed by preferences for different medical procedures or screening devices. A large number of studies also included heterogeneous preferences in developing health policies. In most of the studies, the targeted population was patients and the general population, followed by healthcare workers and informal caregivers. We have also identified that preference heterogeneity analyses in DCEs are becoming increasingly common across regions. The studies are conducted over a broad range of sample sizes, numbers of attributes, and numbers and types of alternatives. Furthermore, we have found a number of software packages used in analysis where STATA and NLOGIT were the most common. This can be because both software can handle CL, MXL/RPL, and LC models, which are the most frequently used models. On the other hand, some of the software are associated with a particular type of model, such as Latent GOLD with LC and Sawtooth with the hierarchical Bayes.

The trend of shifting heterogeneity analysis through random variation (i.e., discrete distribution or continuous distribution) of taste parameters rather than identifying deterministic segmentation (explained preference heterogeneity) has increased the scope of investigation by capturing a wider extent of preference heterogeneity [56]. This may be possible due to a significant increase in the performance of computational power of personal computers and straightforward implementation of different software. However, researchers should cautiously use the models by understanding the hidden assumptions of each model. Even though all the studies explicitly mentioned their model, many studies did not clearly report key details of the chosen model. For example, in analyzing heterogeneity through MXL/RPL models, a proper investigation should be done on identifying which parameters may consider as random and justification of the distributional assumption of the random parameters, whether to use multiple draws to identify the type and number of draws for simulation, and whether to account for correlation across alternatives or an individual’s multiple-choice situation [90, 91]. Similarly, in the LC model, the optimization algorithm in LC logit models, the covariates to identify class membership, and the type of attribute coding. In MXL models, the typical use of Halton draws is found to be consistent with methodological papers, and even though there is no standard practice of a required number of draws, usually the number of draws increases as models become more complex (i.e., correlation of attributes and alternatives, number of attributes and alternatives) [90, 92]. Furthermore, studies with too few Halton draws might have the possibility to produce bias estimation [91]. The application of multiple models to understand the structure of heterogeneity is observed in many of the reviewed papers and is considered a good practice in different literature [1, 93]. Moreover, authors often use simple models (i.e., CL) to inform subsequent analyses (i.e., specifying utility function, select starting values). In addition, comparing with simple base-case models (i.e., CL) with heterogeneous model output, often facilitates researchers to validate the model (i.e., sign on coefficients) and to identify the merit of the presence of heterogeneity across different attributes [94,95,96].

Although we did not systematically assess how the preference heterogeneity analysis was reflected in discussing its implication at the practical context or policy level, the subjective categorization of reviewers identified 40% of the reviewed papers extensively discussed heterogeneity, another 40% discussed it to some extent, and the remaining 20% did not discuss heterogeneity and its implications. Those papers that discussed the heterogeneity issue extensively were also likely to do further post-estimations such as obtaining individual-level parameters, regression analysis to examine effects of individual characteristics, and estimated distribution of MAR, WTP, or preference shares. However, the literature contained few confirmatory studies [15], even though, in the wider choice modeling field, there is mixed support regarding the ability of researchers to correctly identify and account for scale heterogeneity in their estimations. Data from this review identified a small number of papers that controlled for scale while identifying preference/taste heterogeneity, which is consistent with views from the earlier published review [3]. The usual motivation for accounting for scale heterogeneity is to either improve model fit or to avoid biased results [7, 9, 71, 97]; however, identifying the exact tools to account for scale while identifying preference/taste heterogeneity and the success of those models to separate taste and scale heterogeneity is an area under development.

5.1 Limitations

The study is limited by several weaknesses. First, we have curated only the papers that analyzed preference heterogeneity and extracted information primarily on preference heterogeneity relevant issues. Hence, it was beyond the scope of this study to directly compare any model that does not capture preference heterogeneity. Multiple reviewers were involved at the extraction stage, which may arise the possibility of different interpretations of selected literature, and which might have affected the data extraction and, as a consequence, the results presented. Furthermore, this paper summarized what is reported in reviewed papers, not necessarily what was done. To improve the consistency, the reviewers met biweekly during the data collection period to discuss any confusion, completed the beta test with the extraction tool, and each paper was reviewed by two extractors. Second, if a study included multiple sample types, the sample size showed the aggregate number. Similarly, if multiple models were used in the analysis, each paper is counted as one application of the model. Furthermore, we may have had duplicate studies where different models have been published on the same data set for different journals/audiences, therefore there is the potential for double-counting as it was too difficult to identify and remove. Lastly, the review focused on DCEs, the most popular stated preference method; however, the findings are likely to be applicable to other preference elicitation methods, such as best-worst scaling, which utilizes similar models. Future research may be conducted to understand more complex issues regarding preference heterogeneity, such as the effects of experimental design on heterogeneity analysis, covariate association to identify LC membership or subgroup analysis, accounting for correlation between choice situations or across parameters, and the influence of preference heterogeneity results on practitioners and regulators in decision making.

6 Conclusion

This systematic review has shown that preference heterogeneity analysis is a growing trend in healthcare DCEs. This study provides the current practices of a wider variety of models to account for preference heterogeneity. A large number of studies not reporting methodological details were identified, suggesting the need for greater guidance on good research practice. As heterogeneity can be assessed from different stakeholder perspectives with different methods, the research should focus more on providing complete information on analytical methods, for example, a theoretical framework for model selection, behavioral assumptions for variable selection, distributional assumption, convergence criteria, and other statistical details. The inclusion of preference heterogeneity analysis in healthcare DCEs signifies that developing guidance specifically for the analysis of unexplained heterogeneity, such as MXL/RPL and LC models, might positively impact the quality of health preference research, increase confidence in the results, and improve the ability of decision makers to act on the preference evidence.