FormalPara Key Points for Decision Makers

The paper provides researchers and decision-makers with relatively limited experience with stated-preference methods with an overview of the opportunities and challenges of statistical estimation of stated-preference importance weights.

The paper introduces methods for analyzing choice data using simulated data from an example choice experiment. It reviews each step in the analysis, including construction of a dataset from the raw survey data, coding categorical data, evaluation of data quality, selection of estimation method, and interpretation of results.

It illustrates various topics using the DCE synthetic data. The dataset and both R and Matlab code are provided to allow readers to reproduce the results themselves.

1 Introduction

Data collected from discrete-choice experiments (DCEs) indicate individuals’ preferences over choice sets of constructed alternatives. Quantifying and interpreting the factors that affect the resulting observed choices elicited by the choice-experiment survey instrument require a statistical model [1]. This article discusses the advantages and disadvantages of alternative modeling approaches. It also illustrates applications of each approach using an example DCE.

In the case of discrete-choice data, the researcher or analyst only observes the choice of one alternative of the two or more alternatives in each choice set. Thus, the dependent variable in these cases is the selected preferred alternative over other options. The analytical challenge is to understand the influence of the features or attributes describing the alternatives on perceived utility and choices, as well as assess the role of respondent-specific covariates and other possible factors [2].Footnote 1

The main analytical results are marginal, quantitative preference weights.Footnote 2 These estimates provide information on marginal rates of substitution indicating how much of one attribute respondents would give up in exchange for another attribute. These values are useful for translating utility trade-off values into more intuitive equivalence metrics based on money (willingness to pay), risk (maximum acceptable risk), efficacy (minimum acceptable efficacy), or time (healthy-time equivalents).

2 An Example DCE

In this article we discuss analysis options in the context of a specific empirical example. The data have been constructed for illustrative purposes. The attributes and levels do not necessarily reflect realistic clinical tradeoffs. Table 1 summarizes the attributes and levels for a DCE to elicit preferences for headache treatments. The first attribute is an efficacy measure for how many severe headaches the respondent would experience each month, with treatment-outcome levels ranging from 3 to 0 headaches per month. Thus, fewer headaches indicate more benefit. We define treatment risks as the percentage of people who took the treatment and subsequently had a heart attack within 10 years as a result. Risk levels range from 0 to 3%. The study design also includes attributes for mode of administration (infusion, injection, and tablet) and out-of-pocket cost ranging from $0 to $150 per month. Figure 1 illustrates an example question.

Table 1 Example DCE: attributes and levels
Fig. 1
figure 1

Example DCE: question format

3 Choice Data

3.1 Dataset Structure

The study design and experimental design determine the attributes and levels, as well combinations of attributes and levels that define treatment-profile alternatives that respondents evaluate. We expect these attribute levels to affect observed choice patterns. In addition, personal characteristics including socio-demographic variables such as age, gender, education, and income often help explain differences in choices among respondents.

In our example there are two generic-alternative treatment profiles, Medicine A and Medicine B. In this “unlabeled” question format, alternative descriptions have no meaning apart from the indicated differences in attributes. Because the two alternatives do not include an opt-out or status-quo alternative, it is a “forced-choice” question. If an opt-out or status-quo is included in the choice question, the data will include an additional alternative, which is a form of labeled design. Many health applications use forced-choice formats because “no treatment” often is not a clinically realistic option. However, studies designed to understand adherence or screening uptake, or where standard of care is a realistic option will include an opt-out alternative [3].

Opt-out alternatives are modeled with an alternative-specific constant. The alternative may or may not have attribute levels assigned to it, but if so, the attribute levels are inherently confounded with the constant term. The opt-out attribute levels should not be included in estimating the corresponding levels in the opt-in alternatives. A significant disadvantage of a study design with an opt-out alternative is that opt-out choices provide no statistical information for estimating tradeoffs among the opt-in alternatives. In relatively small samples, less statistical information could affect obtaining acceptable precision for the attributes in the experimental design [4,5,6].

Most choice-data analysis software requires the data be organized in a so-called long format. Table 2 shows the layout for a two-alternative design. In this format each row of the dataset is a choice-question alternative, so one choice question requires two rows of data. Table 2 presents data for five choice questions similar to Fig. 1, so there are 10 rows for each respondent, two rows for each of five pairs of alternatives. The chosen alternative is indicated by a 1 in the Choice column. Some software requires a wide data format, where the information in the second row is placed next to the first row on the same line.

Table 2 Example choice dataset structure

Attribute levels vary between alternatives for each respondent (columns 5–8). However, respondent characteristics such as age, gender, education, income, health history, treatment experience, current health status, and health-related attitudes do not vary between alternatives. Thus, the variable “age” in Table 2 for each choice question is the same for every alternative for each respondent.

3.2 Different Types of Variables

There are four possible variable scale types: nominal, ordinal, interval and ratio. Nominal-scale variables simply are labels with no quantitative meaning. Examples include mode of drug administration, type of side effect, or treatment location. Ordinal-scale variables can be ranked by relative size, but the distances between points on the scale are indeterminant. For example, an ordinal severity scale could consist of four points: mild, moderate, severe, and very severe. While such variables often incorrectly are converted to a continuous 1-2-3-4 scale, the perceived distance between mild and moderate could be very different from the perceived distance between severe and very severe. Interval-scale variables are ordered, and the distances between points are quantified. The distance between 2 and 4 on the scale is the same as the distance between 8 and 10. Temperature is one example. While distances between points are meaningful, interval scales lack a true zero. The zeros for Celsius and Fahrenheit correspond to quite different temperatures, and neither corresponds to “no temperature.” Interval-scale variables can be added and subtracted, but not multiplied or divided. Finally, ratio-scale variables have a true zero, which permits multiplication and division. Examples include height, weight, and money. Thus, it is possible to calculate price per pound.

Headache, risk, and cost in Table 2 are ratio-scale variables. However, we are interested in the utility values of differences among the levels. There is no reason to think that the difference between 0 and 1 headache or 0% and 1% risk is exactly half as important as the difference between 1 and 3 headaches or 1% and 3% risk. Thus, we will treat these attributes as ordinal-scale variables to allow for nonlinear preferences. We discuss two alternative ways to code nominal and ordinal variables below in section 5.

4 Internal Validity

Obtaining valid survey data is increasingly difficult, not least because of the intrusion of bots and artificial-intelligence to complete surveys. Good survey design should include traps to identify non-human respondents [7]. The amount of time respondents take to complete the choice questions can indicate how attentive they are to a careful evaluation of the tradeoffs [8]. You may use some minimum time as an exclusion criterion if you find that some respondents are not taking enough time to read the question before answering. Except for extreme cases of rushing where respondents should be excluded from the analysis, it generally is better to include such behavioral measures as covariates in the analysis rather than use them to exclude respondents from the sample [9,10,11].

Quiz questions can provide an indication of respondent attentiveness or understanding. For example, after introducing the icon array for the risk attribute in our example, respondents could answer a quiz question requiring them to interpret the graphic correctly. The completion rate, i.e., the percentage of choice questions that is validly answered, also can serve as a data-quality check. More complex studies involving larger numbers of attributes or choice alternatives are likely to have higher incompletion rates than those with only a few attributes and choices per task [12]. The statistical properties of the experimental design assume that everyone completes the full series of questions shown to them. Thus, some researchers exclude respondents with incomplete answers. However, for small samples, it could be advisable to use all the answered questions available. The effects of including or excluding respondents should be evaluated in sensitivity analysis. Nevertheless, as suggested previously, it is preferable to model the effects of such quality indicators as respondent covariates rather than delete respondents from the sample.

Janssen et al. [11] recently reviewed DCE applications in health and identified five internal-validity tests, including within-set dominated pairs, across-set dominated pairs, attribute dominance, attribute non-attendance, and profile positioning. Jonker, Roudijk, and Maas [13] conducted a review of the DCE literature and found that within-set dominant choice tasks and stability tests (repeated choice tasks) were the most commonly used tests. However, these tests require additional constructed, non-experimental choice tasks, which can increase respondent burden. Moreover, Jonker et al. [13] found that these tests are poor indicators of response-data quality while imposing costs in terms of statistical power, as well as being inconsistent with random-utility theory.

The primary drawback of such internal-validity tests is that there is no principled standard for determining what to do with apparent internal-validity test failures. While there is significant literature on the inclusion of internal-validity tests in DCEs in the health and non-health fields [11,13,14,15,16,17], it is unclear how many unexpected responses on validity tests classify a respondent as having unacceptably poor-quality data, or how best to account for failures in choice models. Veldwijk et al. [17] provide guidance on possible causes and analytical strategies for dealing with apparent internal-validity failures.

A common response pattern is respondents who always choose the alternative with the better level of a single attribute. Such dominated choice patterns could be an indication that respondents did not take the trouble to evaluate the tradeoffs. In that case, the responses are uninformative about their preferences. As argued in Tervonen et al. [18], excluding respondents on the basis of dominance tests is problematic because it could introduce bias and affects the external validity of the experiment if respondents choosing the dominated option are not random in the study population. It also is possible that respondents’ decision criteria are different from that assumed by the researchers. For example, Johnson et al. [19] found that 30% of respondents failed a dominated-pair choice between no treatment and an intervention for a terminal disease with evidence of poor efficacy and high risks. Subsequent analysis confirmed a pro-treatment value independent of evidence on outcomes.

Rather than dropping respondents, their behavior can be modeled using an interacted dummy variable to test whether their data significantly affect the mean estimates for the rest of the sample. Using only the estimates without the interaction term assumes their preferences are invalid. Combining the estimates without and with the interaction assumes their preferences are valid. The difference between the two results provides a sensitivity analysis of these assumptions. Alternatively, choice modelers have begun employing latent-class analysis (discussed below) to isolate preference patterns that yield no statistically informative information [20]. This strategy avoids subjective criteria for judging internal validity by identifying informative and non-informative latent classes. The preference estimates in the informative preference class are based only on statistically valid information. Good analytical practice includes reporting summary statistics of internal-validity tests and explaining how they were accounted for in the analysis [20].

5 Model Specification

The statistical analysis of the choice data itself starts with an appropriate model specification. Model specification refers to defining the relationships between respondents’ pattern of choices and various explanatory variables. Correct model specification is essential to drawing correct conclusions from stated-preference data. (For a general introduction to the choice-modeling conceptual framework, see Chapter 2 in Train [21]Footnote 3.)

5.1 Linear Additive Model

The simplest description of choice defines the total utility of alternative i as a linear additive function of attribute levels in a given alternative. This specification is called a main-effects model. For example, the utility of Medicine A, U(A), and Medicine B, U(B), in Fig. 1 can be written as:

$$U\left(A\right)={\beta }_{\text{headache}}\times \left(1\right)+{\beta }_{\text{risk}}\times \left(0.03\right)+{\beta }_{\text{infusion}}+{\beta }_{\text{cost}}\times \left(150\right)+ \varepsilon$$
$$U\left(B\right)={\beta }_{\text{headache}}\times \left(3\right)+{\beta }_{\text{risk}}\times \left(0.01\right)+{\beta }_{\text{pill}}+{\beta }_{\text{cost}}\times \left(50\right)+ \varepsilon ,$$

where the β parameters for headache, risk, and cost are marginal preference weights for continuous variables, and ε denotes the random-error term. Model parameter estimates are empirical estimates of preference weights. Because mode (infusion, injection, or tablet) is a nominal-scale variable, each type of mode has a separate preference-weight parameter. The error terms are added to U(A) and U(B), indicating unobserved effects and measurement error related to individual variability in interpreting attribute levels, effort in evaluating tradeoffs, perceived difficulty of the question, and other individual-specific factors.

Note that no intercept is included for unlabeled designs because the labels Alternative A and Alternative B have no utility significance. Including an intercept will estimate an alternative-specific constant, which we expect to be statistically insignificant. A statistically significant alternative-specific indicates a left-right preference regardless of the attribute levels shown and could indicate many respondents used a simple answer strategy that provided no statistical information about attribute preferences.

Interval-scaled and ratio-scaled attributes are continuous and can be modeled as either linear or nonlinear. The linear additive model imposes the assumption that utility is linear in attribute levels. In our example, this specification requires that the utility difference between three headaches per month and one headache per month is twice as large as the utility difference between one headache and zero headaches. Linearity should not be assumed without testing that specification. The type and amount of curvature over the range of a continuous variable generally is unknown, although empirical research or theory could suggest hypotheses about whether marginal changes are increasing or decreasing.

5.2 Categorical Specifications

Best practice is not to assume any functional form by initially treating continuous variables as if they were categorical. In our example, the Risk attribute has levels 0%, 1%, and 3%, so the values could be treated as “0%,” “1%,” and “3%” labels. A fully categorical model for both nominal and continuous attributes allows each level to have its own preference weight. The specification for the example in Fig. 1 looks like this:

$$U\left(A\right)={\beta }_{1\_\text{headache}}+{\beta }_{3\%\_\text{risk}}+{\beta }_{\text{infusion}}+{\beta }_{\$150}$$
$$U\left(B\right)={\beta }_{3\_\text{headaches}}+{\beta }_{1\%\_\text{risk}}+{\beta }_{\text{pill}}+{\beta }_{\$50.}$$

If the difference between the parameters for 3% and 1% is twice as large as the difference between the parameters for 1% and 0%, then there is support for a linear model for using continuous levels of 0, 0.01, and 0.03 with a single slope parameter. Researchers also could choose to report the categorical results to avoid imposing any continuous functional form on the data.

Much confusion can be avoided by remembering that only relative differences matter in choice models. We cannot say that the categorical value of 3% is X times as large as 1%. We can only say that the difference between the categorical values of 3% and 1% is X times the difference between the categorical values of 1% and 0%.

Because only differences matter, different coding scales can convey the same statistical information. There are two possible approaches to coding categorical preference attributes of any kind of variable with discrete values: dummy coding and effect coding [22]. Categorical coding requires omitting one of the level values for each attribute because, once all levels but one are determined, the remaining level is whatever is left of the dependent variable (e.g., omitting the 0% risk attribute). Including all the levels in the model will result in perfect correlation and cause estimation to fail.

In the headache example, all of the attributes have three levels. For a categorical model, first construct 0/1 dummy variables, i.e., indicator variables for two of the three attributes. Table 3 presents indicator variables for no headaches and 1 headache. The omitted category, also called the reference category or level, is three headaches. The difference in the two coding schemes is that the dummy-coded omitted category is coded as 0 in both columns, while the effect-coded omitted category is coded as − 1 in both columns.

Table 3 Comparison of dummy coding and effect coding

Both dummy-coded and effect-coded variables contain exactly the same statistical information because the spaces between levels are the same, no matter how zero is defined. Zero is the dummy-coded parameter for all omitted categories. Parameters for included levels are interpreted as the utility difference relative to an omitted reference category for dummy coding. In contrast, zero is the mean of the effect-coded parameters. If the mean is zero, then the sum of the parameters has to be zero. This will be the case if the omitted-category parameter is set equal to the negative sum of the included-category parameters for each attribute. Thus, there is a separate parameter estimate for every attribute level, but parameters are defined as differences relative to the mean. Thus, some coefficients will have positive signs, and some will have negative signs. Negative signs do not indicate negative utility, just utility levels that are less than the mean. For dummy variables where the omitted category is a true zero, positive and negative signs indicate positive and negative utility.

Because zero can be defined in different ways, t statistics can be misleading. In general, it usually is not of interest whether a parameter estimate is different from a dummy-coded zero or the effect-coded mean effect. Attribute levels farther from a worst-level zero are more likely to be statistically significant. Likewise, levels on the ends of the range of attribute levels are more likely to be statistically significant from the effect-coded zero, while levels in the middle will have small t values. It is more interesting to ask whether parameter estimates are different from each other, rather than relative to an arbitrary zero. If the difference between adjacent attribute-level parameter estimates is significantly different from zero, then we can conclude that respondents discriminated between the relative importance of the two levels and thus the estimates are informative about relevant tradeoffs.

Many choice modelers prefer effect-coded specifications because they directly obtain a coefficient and standard error for each attribute level, which simplifies secondary analysis of the estimates. However, software may or may not provide standard errors for the effect-coded omitted categories. If not, standard errors must be separately calculated using the delta or Krinsky–Robb method [23]. Dummy-coded models require choosing which level to omit. If there is a natural zero, such as no treatment, then dummy coding has the advantage of interpreting all estimates relative to a clinically meaningful omitted category. T tests of statistical significance for dummy-coded coefficients then are relative to an intuitive omitted category. Overall significance for an attribute can be assessed with a likelihood-ratio chi-squared test or a Wald test.

Figure 2a and b show the same categorical preference-weight estimates for dummy-coded and effect-coded attributes. Note that the shapes of the attribute groups and the distances between coefficients are the same for both coding approaches.

Fig. 2
figure 2

a Example dummy-coded preference-weight estimates; fully categorical model. b Example effect-coded preference-weight estimates; fully categorical model

5.3 Continuous Nonlinear Specifications

Figure 3 illustrates an example comparing risk preferences [24]. Benefit–risk analysis commonly assumes that expected utility is linear in probability. Example DCE categorical estimates with 95% confidence intervals are shown as blue triangles indicating violation of the linear in probability assumption. The green line specifies risk as a nonlinear continuous variable using a weighted-probability function proposed by Tversky and Kahneman [25].

Fig. 3
figure 3

Nonlinear risk-preference estimates

Typical nonlinear transformations of continuous variables include natural log and quadratic (squared) functions or spline functions. The number of levels of a continuous attribute in the experiment restricts the type of nonlinearity that can be estimated. For example, a quadratic term requires at least a three-level attribute.

5.4 Attribute Interactions

Experimental designs generally require attribute-level variations for one attribute to be uncorrelated with attribute-level variations in all other attributes. However, that does not mean that attribute-level utilities are independent of one another. For example, headache consists of both pain intensity and pain duration. Thus, the (dis)utility of pain depends on its duration and vice versa. Ignoring such interactions would result in biased estimates, so the experimental design must include the ability to identify interactions [26]. Taking interactions between attributes into account implies increasing the number of parameters to be estimated and therefore usually requires more choice questions. Both main effects and interaction effects should be included in the choice model, even if the main effects are not significant.

Such interactions are straightforward to implement and interpret for continuous variables but can be more complicated for categorical variables or continuous variables that are dummy-coded or effect-coded even though inclusion could improve the predictive accuracy of the model [27]. Linear or nonlinear continuous specifications therefore are recommended for continuous variables when interactions are suspected.

5.5 Modeling Individual Characteristics

Unlike regression analysis of a continuous dependent variable, choice models do not allow appending respondent-specific variables to the utility function for estimation. The reason is that choice models use the statistical information about utility differences among choice alternatives to estimate model parameters. A variable that is constant between choice alternatives, such as age or income, therefore simply cancels out in the utility difference. A simple solution is to split the sample into groups by age, income, current symptom severity, or other personal characteristic and estimate a separate choice model for each group. However, there are several disadvantages of split-sample models, including:

  • Differences could be imputed to, for example, income, when in fact the difference is a result of a personal characteristics correlated with income such as education or age.

  • Experimental designs generally are not powered for subsample analysis, so it is harder to reject null hypotheses.

  • It is difficult to compare results between groups because of scale differences. Remember, you cannot compare individual parameter estimates between subsample models, only relative differences [28].

Subsample analysis is best done in a joint model using hypothesis-driven interactions between subsample characteristics and particular attributes [29,30,31]. Interacting attribute levels or labeled alternatives such as opt-out with individual characteristics allows controlling for multiple characteristics simultaneously but can result in an unfeasibly large number of parameters to estimate with the relatively small samples available in many preference studies. Specification searches for a parsimonious set of respondent-characteristic interactions should be guided by theoretical or clinical hypotheses rather than simple trial and error based on significance testing.

6 Choice of Estimation Method

After model specification, the next step is to choose the estimation method. Several different approaches are used in practice.

6.1 Conditional (Multinomial) Logit

McFadden [32] demonstrated that the conditional logit model, which is sometimes also called multinomial logit, is consistent with random-utility theory. Key assumptions of this model include that the utility of an alternative depends only on the characteristics of that alternative, and preference weights are constant across choices. Assuming also that the odds of choosing alternative A over alternative B are independent of the other choice sets for all pairs means we do not have to account for the fact that each respondent answers a series of choice questions. We therefore can stack the data as if each respondent answered only one choice question.

By assumption for this model, Medicine A will be chosen over Medicine B if U(A) − U(B) is greater than a random error. Under standard assumptions about the error distribution, namely the standard Gumbel distribution, the probability that alternative A is chosen over B is defined for the simple conditional-logit model as follows:

$$\text{Pr}\left(\text{Choice}=A\right)= \frac{\text{exp}\left[\mu \cdot U(A)\right]}{\text{exp}\left[\mu \cdot U(A)\right]+\text{exp}\left[\mu \cdot U(B)\right]}.$$

The scale parameter µ is related to the inverse of the variance of the error term. It sometimes is described as reflecting the “ability to choose” or ability of the attributes to reflect choice and can vary over alternatives, respondents, or contextual factors such as learning and fatigue effects through the sequence of tasks. Unfortunately, the scale parameter is confounded with the preference parameters in the conditional-logit model and cannot be separately estimated. Hence, it is assumed to be equal to one. In fact, to the extent people with similar preferences make choices that are more prone to error than others, the scale varies among people. Confounding scale with preference parameters means that people with the same preferences can appear to have larger differences among preference parameters than others [33].

The advantages of the conditional-logit model are that it is simple to estimate and interpret. However, it requires strong assumptions, including that all respondents in the sample have the same preferences. Thus, the main limitation of the conditional-logit model is its inability to capture variability in preferences or scale across people or across choice situations or contexts. Nevertheless, our example below shows that differences in estimates between the conditional-logit model and more general models sometimes are relatively small. Because estimation is easy and fast, the conditional-logit model often is used in exploratory analysis to evaluate alternative model specifications and to guide more advanced subsequent analysis.

Table 4 contains estimation results for our headache example from a conditional-logit model. Note that coefficient estimates for ordered attributes are correctly ordered: less-desirable levels have lower coefficient estimates than more-desirable levels. All estimates are precisely estimated with coefficients that are more than 2.5 times larger than the corresponding standard errors. That means that respondents discriminated well between attribute levels shown in the study design.

Table 4 Example parameter estimates: conditional-logit model, effect coded

6.2 Uncorrelated or Fully Correlated Mixed Logit

An alternative to the conditional-logit model is the random-parameters or mixed-logit model that relaxes the uniform-preference assumption in the conditional-logit model. Relaxing the restriction of identical preferences is important for accurately measuring preference weights. It is likely that risk tolerability varies in a respondent sample, so a single parameter could misrepresent the preferences of most respondents. Mixed-logit models estimate taste-heterogeneity means and standard deviations for each preference weight [34]. The standard-deviation estimates can be interpreted as the degree of consensus about the relative importance of the mean value.

Estimation requires specifying an assumption about the shape of the distribution of preference variability. Researchers often assume a normal distribution, although the rationale for that assumption can be weak. The standard normal distribution is symmetric around zero, but we would not expect preference weights for cost, for example, to be positive for some people. Lognormal could be a better assumption in that case. The mixed-logit model also requires specifying how correlations among attributes will be handled. A model that limits the correlation matrix of the parameters to include only variance terms for each parameter is called uncorrelated or constrained mixed logit. Fully correlated or unconstrained mixed-logit models estimate correlations among all preference parameters. Correlations between preference parameters can occur if respondents who prefer one attribute also tend to prefer (or not prefer) another attribute [21].

Advantages of the fully correlated mixed-logit model include that it describes the variability of preferences of the respondents in the sample, accounts for all forms of correlation, including correlation among choices made by a single respondent, correlations among preference weights, and correlations among parameters induced by scale variations [35]. However, the mixed-logit model requires assumptions on the distribution of the preferences, and requires more data to estimate the larger number of parameters with the same level of precision. As part of specification testing, analysts could check which standard deviations were insignificantly different from zero. Those parameters could be held fixed in the final model to limit the number of standard deviations that must be estimated with limited statistical information.

In addition, estimating mixed-logit models requires simulation techniques. The simulation algorithms required to estimate the parameters can generate different results depending on the particular algorithm chosen, the number of simulations selected, the method used to implement simulations, and the starting values for the estimation. Finally, fully correlated models often are not feasible with typical health-study sample sizes because they require estimating a large number of parameters. Most published studies report uncorrelated mixed-logit estimates. See Train [21] for additional details on estimating mixed-logit models, including examples and code.

Table 5 contains estimation results for our headache example from an effect-coded, uncorrelated mixed-logit model. Coefficient distributions are reported as means and standard deviations with separate standard errors. Note that mean coefficient estimates for ordered attributes again are correctly ordered. All estimates also are precisely estimated with small standard errors relative to corresponding coefficients.

Table 5 Example parameter estimates: mixed-logit model, effect coded

There are no standard-deviation estimates for the omitted-category coefficients, because they are the negative sum of included coefficients. The standard deviations indicate the degree of preference consensus in the sample. Standard deviations for 0 headaches and both risk levels are highly significant and large relative to the mean estimates, indicating strongly heterogeneous preferences for those attributes. If the goal is to estimate money-equivalent value, cost often is assumed to not have a taste distribution, so taste heterogeneity relative to cost is absorbed in other attributes.

Table 6 contains model diagnostics of the fitted model. The goal of estimation is to maximize the log likelihood. Thus, values closer to zero are better. The unconstrained log likelihood based on attributes (− 1357.52) is significantly larger than the constrained log likelihood without accounting for attributes (− 3327.11). It is not possible to calculate R2 for choice models in the same way as for regression models. Apollo provides “rho-squared” values on the basis of setting all the coefficients to zero (equal shares).Footnote 4 Rho-squared values for choice models typically are much lower than corresponding regression R2. Likelihood increases with the number of parameters. The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) both penalize the likelihood value for the number of parameters to balance fit with parsimony, but BIC uses a larger penalty for more parameters. Smaller values are better and help compare alternative model specifications. For example, the BIC value for the conditional-logit model is 3761.65, while the BIC value for the mixed-logit model is considerably smaller, at 2825.23.

Table 6 Example parameter estimates: mixed-logit model, effect coded

6.3 Comparison of Estimates for Headache Data

While the example results are specific to this particular application, they serve to illustrate that choice of estimator can significantly affect parameter estimates and conclusions that can be supported with the data. Because mixed-logit estimates tend to be numerically larger than conditional-logit estimates, dividing the preference-weight estimates by one of the estimates eliminates possible scale differences among attributes [28] and can provide a more intuitive interpretation of preferences. We use the absolute value of the parameter for the continuous-cost variable as the marginal (per-dollar) utility of money. That value serves as an “exchange rate” between utility and money.

Figure 4 compares the resulting money-equivalent values for conditional-logit and mixed-logit models with dummy-coded omitted categories of three headaches, no risk, and pill. Note that the qualitative results are similar for headache and risk attributes, but mode results for conditional logit indicate that pill is logically preferred to injection, while mixed-logit indicates no difference in importance between pill and injection. In addition, the confidence intervals for middle levels of Headache and for Risk are overlapping, while the values of the best levels of both attributes are statistically different.

Fig. 4
figure 4

Example comparison of money-equivalent estimates, dummy coded

Money-equivalent values rescale attribute-level parameter estimates by dividing each estimate by the absolute value of the cost coefficient, which is the marginal utility of 1 dollar, providing an intuitive measure of how small or large utility differences are. However, the rescaled values themselves are not willingness-to-pay estimates. Willingness to pay is calculated as the difference in money-equivalent values for a utility gain between two attribute levels or two treatment profiles. For example, holding risk constant at zero and mode of administration constant at pill, the willingness to pay for an improvement from one headache to no headache is $87.09 (3.452–0.086)/0.03865) for mixed logit. The corresponding value for conditional logit is $64.73. Hence, choice of estimation method can result in quite different estimates of relative importance.

We also can compare improvements in efficacy with risk tolerance. The money-equivalent values of an improvement from three headaches to one headache are $93.76 and $78.12 for mixed-logit and conditional-logit, respectively. Increases in risk result in money-equivalent utility losses. Increases in risk from 0 to 1% result in − $99.47 and − $78.41 money-equivalent value.Footnote 5 Thus, in this example the two models show large differences for changes within attributes, but show quite similar measures of relative importance between attributes, depending on which comparison is chosen.

An alternative way to rescale utility values in a common metric is to convert outcomes to risk equivalents rather than money equivalents. This measure is directly relevant for benefit–risk assessments. In the headache example, the risk increase that would exactly offset the benefit of an improvement from three headaches to one headache is about 1% (for both models). Thus, 1% is the maximum acceptable increase in risk of a heart attack for that benefit. Any increase in therapeutic risk less than 1% would result in a positive net treatment benefit, while any increase in risk greater than 1% would result in a negative net treatment benefit.

6.4 Latent-Class Analysis

An alternative way to characterize variability in preferences among respondents is to assume that differences in mixed-logit preferences take the form of discrete groups or “classes” rather than distributions [37]. By design, there are two distinct groups in the example headache study. The risk-tolerant group places a relatively high value on reducing headaches and a relatively low value on avoiding treatment risks; the risk-averse group places a relatively high value on avoiding treatment risks and a relatively low value on reducing headaches. Combining such contrasting preferences in a conditional-logit or mixed-logit model could result in both attributes appearing equally important on average. Latent-class models can test whether there are such discrete differences in DCE data [38,39,40].

Latent-class or finite-mixture models capture this form of variability by assuming some number of classes containing respondents that have similar preferences. Estimated parameters define the probability of respondents being members of each preference class and the probability of respondents choosing an alternative, given membership in a given class.

$$\text{Pr}\left(\text{Choice}=A\right)= \sum_{k}{\pi }_{k}\frac{\text{exp}\left[{\mu }_{k}\cdot {U}_{k}\left(A\right)\right]}{\text{exp}\left[{\mu }_{k}\cdot {U}_{k}\left(A\right)\right]+\text{exp}\left[{\mu }_{k}\cdot {U}_{k}\left(B\right)\right]}\quad {\uppi }_{\text{k}}=\frac{\text{exp}\left[{f}_{k}\left(z\right)\right]}{1+\text{exp}\left[{f}_{k}\left(z\right)\right]}$$

where k indexes latent classes with utility Uk and πk is the probability of class membership, which can be a function of individual characteristics z. As in other logit models, attributes influence choices of alternatives, but other variables can be used to predict class membership. For example, income could be a significant factor in predicting whether respondents are more likely to be members of a class that is more sensitive to co-pay levels. Also, because the scale parameters μk vary among classes, it is possible to estimate relative scale differences.

A challenge in latent-class models is determining the right number of classes. Usually, several models are estimated, varying the number of classes. A combination of statistical criteria such as stability of model convergence, BIC, and AIC, as well as parsimony and interpretability of results are used to select the best model [41].

Figure 5 presents preference-weight estimates for the two-class latent-class model using the headache example data. Note that it identifies the risk-tolerant and the risk-averse classes as expected. While the mixed-logit model found strong taste heterogeneity in the headache and risk attributes, the assumption that taste heterogeneity is normally distributed is clearly inconsistent with the discrete-distribution pattern shown in Fig. 5. While we intentionally induced these results, it is not unusual for empirical data to be a mixture of continuous and discrete heterogeneity. It is good practice to estimate both mixed-logit and latent-class models to evaluate model fit and conceptual consistency. Some researchers report both types of models to allow readers to compare results [42].

Fig. 5
figure 5

Example preference-weight estimates; latent-class analysis, two classes, effect coded

A common misperception is that latent-class models sort individual respondents into particular classes. While there is a numerical method called cluster analysis that does that, it is not based on random-utility theory. Latent-class analysis estimates only respondents’ probability of class membership. Describing results as “40% of respondents belonged to Class 1” should instead be stated as “the average probability of having Class-1 preferences was 40%.” Average money-equivalent or risk-equivalent values can be calculated by weighting class-specific scaled preference weights by membership probabilities.

An attractive feature of latent-class analysis is that combinations of what attributes vary in importance in different classes and which covariates are significant determinants of class membership often suggest intuitive characterizations of what kinds of respondents belong to which groups. Identifying such groups can be useful in guiding patient–physician shared decision-making, regulatory assessments, and marketing.

For secondary analysis, individual respondents sometimes are assigned to classes on the basis of the modal class membership probability. If that value for most respondents is close to 1, that could be a reasonable guess. However, if in a two-class model, modal class membership is close to 0.5, then class-membership assignment essentially is arbitrary.

Jonker [20] has proposed using latent-class analysis to identify response patterns that are uninformative. This strategy avoids the judgments required to infer data quality on the basis of internal validity tests. Constrain all the parameters for a so-called “garbage class” to be equal to zero—that is, a class that contributes no statistical information to the model. Without dropping any observations from the analysis, classes with statistically informative data thus are isolated from the influence of the uninformative class, and the model estimates the probability that each respondent is a member of that class.

Even within relatively homogenous classes, we can expect some preference variation. As in other logit applications, it is difficult to distinguish how much variation is a result of taste differences and how much is a result of scale differences. Some researchers have advocated use of scale-adjusted latent-class models or combining latent-class and mixed-logit approaches [43]. Other researchers are skeptical about whether these models offer an improvement over fully correlated mixed logit [35].

7 Available Statistical Software

There currently is no consensus on a preferred statistical software package to analyze preference data. However, four of the most commonly used packages for estimating the conditional-logit, mixed-logit, and latent-class models include:

LatentGOLD (https://www.statisticalinnovations.com/latent-gold-6-0/) facilitates estimating both simple and advanced latent-class models, including scale-adjusted latent class analysis, with an accessible user interface.

Sawtooth Software offers DCE data analysis as part of a self-contained suite of programs for designing surveys, constructing experimental designs, hosting survey administration, and analysis. (https://www.sawtoothsoftware.com/products/advanced-analytical-tools/cbc-hierarchical-bayes-module).

Kenneth Train’s Berkeley website (https://eml.berkeley.edu/~train/software.html) offers downloads for his mixed-logit code in MATLAB and R, as well as extensive tutorial and documentation resources.

Mikołaj Czajkowski maintains a full suite of open-source Matlab choice-modeling code ( https://github.com/czaj/DCE). It is possible to read in a dataset and set it to sequentially run nearly every kind of choice model.

Stephan Hess’ Apollo suite of choice-modeling software (https://cran.r-project.org/web/packages/apollo/index.html) is written in open-source R. Apollo has a very active user group to help identify problems and solutions. Specifying Apollo models can be complicated, but for advanced users Apollo can estimate almost any imaginable model. The Supplementary Material includes Apollo programs and output for the analysis of the headache example in this article, as well as code from the simpler gmnl package for all the specifications discussed.

As with model specification, these software options come with unique combinations of advantages and disadvantages. Each has an active user base, and it is likely that someone already has answered any question you may have. Note that that, especially for the estimation procedures that are based on simulation methods, while generally close, results between packages can differ due to different random seeds, convergence criteria, and other assumptions and settings. However, there is a risk of reaching a local maximum from a single starting point. It is strongly advised to replicate your final model from different starting points and to report in detail the choices made in the analysis and estimation.

8 Conclusion

Our headache example suggests that results can differ considerably among model approaches, but there is no simple generalization about which approach works best for out-of-sample prediction [44]. Thus, researchers must assess the model structure that accounts best for their research objectives, study design, and sample size. More complex models place heavy demands on data and may not be a feasible option for small samples. Estimation with simulation methods also can take several hours to find a solution if the algorithm can even find one. A prudent approach is to start with conditional-logit to explore the structure of the data and test model specifications. After identifying one or more plausible model specifications, proceed to estimate either mixed-logit or latent-class models, depending on study objectives, prior knowledge of the likely form of taste heterogeneity, ability of sample size to support correlated mixed logit estimates, and model performance.