1 Introduction

Binary choice models for longitudinal binary data play an important role in the recent statistical and econometric literature (Diggle et al., 2002; Hsiao, 2014). The simplest models of this type are based on a logit or probit transformation of the probability of the binary response being equal to 1, which is assumed to depend on individual intercepts, a linear function of the covariates, and, possibly, the lagged response variable. These models can be formulated following a random- or a fixed-effects approach, according to whether individual intercepts are treated as random variables or fixed parameters to be estimated. The main feature of the latter approach is that it does not require distributional assumptions for the individual-specific parameters, with advantages in terms of robustness of the conclusions that may be reached on the basis of the model.

The Maximum Likelihood (ML) estimator for the parameters of a binary fixed-effects model for panel data is however inconsistent because of the incidental parameters problem (Neyman & Scott, 1948; Lancaster, 2000). The incidental parameters are the individual intercepts and a relevant bias in their estimation arises whenever sample units are observed over a small number of time occasions T, so that the ML estimator is based only on a few observations per unit. Moreover, this bias spreads to the estimator of the regression parameters, as they are not informationally orthogonal to the incidental parameters, and is of order \(O(T^{-1})\).

The literature proposes two different approaches to overcome the methodological issues arising within the ML estimation of fixed-effects binary choice models. A first approach is focused on reducing the order of the bias of the ML estimator to \(O(T^{-2})\), by means of bias corrections of the parameter estimator (Hahn & Newey, 2004; Fernández-Val, 2009; Hahn & Kuersteiner, 2011; Dhaene & Jochmans, 2015) or by using a modified likelihood (Bester & Hansen, 2009; Arellano & Hahn, 2016; Bartolucci et al., 2016) or score function (Carro, 2007). This strategy is very general and can be adapted to both probit and logit as well as to other nonlinear models. It can also accommodate dynamic formulations that include the lagged response variable among the covariates, which is sensible for a variety of economic applications. However, these estimators exhibit poor finite-sample performance when T is small, a situation that is likely to occur in practice with rotated surveys or even with long panels, if these are highly unbalanced.

The second approach is represented by conditional inference (Andersen, 1970, 1971), which is based on conditioning the joint probability of the response configuration provided by each subject on a sufficient statistic for the specific intercept, thereby eliminating the incidental parameters problem. Differently from the bias-corrected ones, the resulting Conditional Maximum Likelihood (CML) estimators are consistent as the sample size, n, goes to infinity with T fixed. The viability of this strategy is however tied to the existence of reasonable sufficient statistics for the incidental parameters for the specific model formulations.

With binary panel data, CML estimation has been applied to the static logit (Chamberlain, 1980; Arellano & Honoré, 2001; Wooldridge, 2010) and Quadratic Exponential (QE) model (Cox, 1972a; Bartolucci & Nigro, 2010). Traditionally, CML for binary clustered data has also been adopted in psychometrics to obtain consistent item estimates in the popular Rasch model (1960, 1961) and related latent class models (Lindsay et al., 1991; Agresti, 2003). Conditional inference has also been applied to generalized additive mixed models (Zhang & Davidian, 2004). More recently, simplified versions based on pairwise conditional likelihood approximations have been used to estimate the homophily parameters of network formation models by Charbonneau (2017) and Jochmans (2018), while Graham (2017) advocates the use of CML to estimate the parameters of the tetrad logit for undirected networks; see also de Paula (2020) for a recent review.

Sufficient statistics for the individual intercepts are not available in the case of the Dynamic Logit (DL) model, for which a weighted CML estimator has been derived by Honoré & Kyriazidou (2000) and a Pseudo-CML (PCML) estimator based on an approximating QE model has been proposed by Bartolucci & Nigro (2012); see also Bartolucci & Pennoni (2007).

From a practitioner perspective, CML estimators are appealing for economic applications based on small-T panel data, as well as on long, but highly unbalanced, longitudinal surveys. However, CML estimators require the maximization of peculiar likelihood functions, whose computation in the standard way implies an excessive burden when T is large, thus limiting the applicability of these techniques that may become infeasible in certain situations. One way to overcome these computational issues is to exploit recursive algorithms as we here propose.

Several works deal with the recursive computation of the log-likelihood function for a variety of models in the conditional inference framework. The seminal work of Howard (1972) shows that the conditional likelihood function of the model proposed by Cox (1972b) is a symmetric function and it can be written recursively. Another popular application of recursive algorithms is relative to the analysis of epidemiological stratified case-control studies. Smith et al. (1981) propose an algorithm for the CML estimation of the coefficients of a logistic regression and, since this procedure becomes computationally challenging for large strata, Krailo and Pike (1984) derive a recursive computation of the log-likelihood and its derivatives. Moreover, Levin (1987) extends this result to models for multinomial outcomes. Other recursive solutions come from Item Response Theory (Hambleton & Swaminathan, 1986; Bartolucci et al., 2015), where Andersen (1972) and Gustafsson (1980) show a recursive structure for the conditional estimating equations of the Rasch model (Rasch 1960).

In this paper, we propose a general recursive algorithm to compute the conditional likelihood of a series of models for binary data in an efficient way, so that the CML approach may be easily applied even when a large number of observations is available for a few sample units. The main contribution is represented by the generalization of the result by Krailo and Pike (1984) to the QE model and its extensions to accommodate the dynamic formulations recently proposed by Bartolucci & Nigro (2010), Bartolucci & Nigro (2012), and Bartolucci et al. (2018), whose computation would otherwise be unfeasible with a large time dimension. Furthermore, we implement the proposed recursive algorithm in the cquad packageFootnote 1, which provides software routines for the CML estimation of QE model and the related aforementioned contributions (Bartolucci & Pigini, 2017). A Monte Carlo simulation shows how the proposed algorithm avoids the computational burden of the QE model for large-T datasets. This broadens the applicability of conditional inference for dynamic models in many economic applications. As an example, this work includes an empirical application concerning brand loyalty based on real data on yogurt purchases, which is a popular example in economics (Chintagunta et al., 2001).

The present work is organized as follows. Section 2 illustrates the main theoretical aspects of the models and the conditional estimation framework. Section 3 shows the proposed recursive algorithm for the QE model and its extensions. The performance of the algorithm in terms of computational time is evaluated by the Monte Carlo simulation in Sect. 4 and the analysis of brand loyalty data is illustrated in Sect. 5. Finally, Sect. 6 provides the main conclusions.

2 Model Assumptions

This section first introduces the main theoretical aspects of the fixed-effects logit and QE model. Then, the CML estimator for both models is presented.

2.1 Static Logit Model

Consider a sample of n individuals, where each individual i, \(i=1,\ldots ,n\), is observed at \(T_i\) time occasions, with \(T_i\) allowed to vary across individuals in order to account for unbalanced panel structures.

Let \(y_{it}\) be the binary response for subject i at occasion t and \(\varvec{x}_{it}\) a column vector of covariates, where \(t=1,\ldots ,T_i\), with \(\varvec{X}_i = (\varvec{x}_{i1}, \ldots , \varvec{x}_{iT_i})\) being the matrix collecting all such covariates. Consider the logit formulation

$$\begin{aligned} p(y_{it} | \alpha _i, \varvec{X}_{i}) = \frac{\exp \left[ y_{it}\left( \alpha _i + \varvec{x}_{it}'\varvec{\beta }\right) \right] }{1 + \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta }\right) }, \end{aligned}$$
(1)

where \(\alpha _i\) is the individual intercept, capturing the time-constant subject-specific unobserved heterogeneity, and \(\varvec{\beta }\) collects the common slope parameters.

As mentioned in Sect. 1, the conditional inference approach (Andersen, 1970) has been applied to derive the conditional likelihood for the static logit model by Chamberlain (1980). Consider the joint probability of the individual response configuration \({\varvec{y}}_i = (y_{i1}, \ldots , y_{iT_i})'\) that, under the logit model, is

$$\begin{aligned} p(\varvec{y}_i | \alpha _i, \varvec{X}_i)= & {} \prod _{t = 1}^{T_i} p(y_{it} | \alpha _i, \varvec{X}_{i}) = \frac{\exp \left[ y_{i+}\alpha _i + \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}\right) '\varvec{\beta }\right] }{\prod _{t=1}^{T_i}\left[ 1 + \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta }\right) \right] }, \end{aligned}$$
(2)

where \(y_{i+} = \sum _{t =1}^{T_i}y_{it}\) is the total score, which can be proved to be a sufficient statistic for \(\alpha _i\). In this regard, consider the probability of observing a given total score for the logit model so that, conditional on \(\alpha _i\) and \({\varvec{X}}_i\), we have

$$\begin{aligned} p(y_{i+} | \alpha _i, \varvec{X}_i) = \frac{\displaystyle {\sum _{\varvec{z}:z_+ = y_{i+}}}\exp \left( z_{+}\alpha _i\right) \exp \left[ \left( \sum _{t=1}^{T_i}z_t\varvec{x}_{it}\right) '\varvec{\beta }\right] }{\prod _{t=1}^{T_i}\left[ 1 + \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta }\right) \right] }, \end{aligned}$$
(3)

where the numerator corresponds to the sum of all possible binary response vectors \(\varvec{z} = (z_1,...,z_{T_i})'\) such that \(z_+ = \sum _{t=1}^{T_i}z_t\) equals \(y_{i+}\). The probability of \({\varvec{y}}_i\) conditional on \( \alpha _i\), \(\varvec{X}_{i}\), and \(y_{i+}\) is the ratio between expressions (2) and (3), so that

$$\begin{aligned} p(\varvec{y}_i | \alpha _i, \varvec{X}_{i}, y_{i+})= & {} \frac{\exp \left( y_{i+}\alpha _i\right) \exp \left[ \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}\right) '\varvec{\beta }\right] }{\prod _{t=1}^{T_i}\left[ 1+ \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta }\right) \right] }\nonumber \\&\times \frac{\prod _{t=1}^{T_i}\left[ 1 + \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta }\right) \right] }{\sum _{\varvec{z}:z_+ = y_{i+}}\exp \left( z_{+}\alpha _i\right) \exp \left[ \left( \sum _{t=1}^{T_i}z_t\varvec{x}_{it}\right) '\varvec{\beta }\right] } \nonumber \\= & {} \frac{q_1({\varvec{X}}_i,{\varvec{y}}_i)}{\sum _{\varvec{z}:z_+ = y_{i+}} q_1({\varvec{X}}_i,{\varvec{z}})} = p(\varvec{y}_i | \varvec{X}_{i}, y_{i+}), \end{aligned}$$
(4)

where

$$\begin{aligned} q_1({\varvec{X}}_i,{\varvec{y}}_i)=\exp \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}'\varvec{\beta }\right) . \end{aligned}$$
(5)

Notice that the conditional probability \(p(\varvec{y}_i | \varvec{X}_{i}, y_{i+})\) no longer depends on the individual intercept.

On the basis of the previous results we obtain the conditional log-likelihood function

$$\begin{aligned} \ell ({\varvec{\beta }}) = \sum _{i = 1}^{n}\mathbbm {1}\{0<y_{i+}<T_i\}\log p(\varvec{y}_i | \varvec{X}_{i}, y_{i+}), \end{aligned}$$

with \(\mathbbm {1}\{ \cdot \}\) being the indicator function, which is maximized by a Newton-Raphson algorithm in order to obtain a CML estimate.

Starting from a initial guess for the parameter vector, denoted by \({\varvec{\beta }}^{(0)}\), at step h the algorithm updates the parameter vector obtained at the previous step through the usual rule

$$\begin{aligned} {\varvec{\beta }}^{(h)}= {\varvec{\beta }}^{(h-1)}-\left[ \frac{\partial ^2 \ell ({\varvec{\beta }}^{(h-1)})}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}'}\right] ^{-1} \frac{\partial \ell ({\varvec{\beta }}^{(h-1)})}{\partial {\varvec{\beta }}}, \end{aligned}$$

where

$$\begin{aligned} \frac{\partial \ell ({\varvec{\beta }})}{\partial {\varvec{\beta }}}= & {} \sum _{i = 1}^{n}\mathbbm {1}\{0<y_{i+}<T_i\}\frac{\partial \log p(\varvec{y}_i | \varvec{X}_{i}, y_{i+})}{\partial {\varvec{\beta }}}, \end{aligned}$$
(6)
$$\begin{aligned} \frac{\partial ^2\ell ({\varvec{\beta }})}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}'}= & {} \sum _{i = 1}^{n}\mathbbm {1}\{0<y_{i+}<T_i\}\frac{\partial ^2\log p(\varvec{y}_i | \varvec{X}_{i}, y_{i+})}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}'}. \end{aligned}$$
(7)

This is repeated until convergence obtaining the CML estimate \(\hat{{\varvec{\beta }}}\). Note also that minus the Hessian matrix obtained at convergence may be used to estimate the variance-covariance matrix for \(\hat{{\varvec{\beta }}}\), and then its standard errors, in the usual way.

It is important noting that computation of \(p(\varvec{y}_i | \varvec{X}_{i}, y_{i+})\) is feasible only with moderate values of \(T=\max _i T_i\) because expression (4) involves a sum over the binary vectors \({\varvec{z}}\) with a fixed total score, the number of which quickly grows in T, as will be detailed in Sect. 4. This is a typical problem of computation of a normalizing constant that occurs in many statistical contexts as, just to mention one of the many relevant cases, in the models for spatial data described in the seminal paper of Besag (1974).

2.2 Dynamic Models

The DL model is a straightforward extension of the static logit model in (1) as it includes as a regressor the lagged dependent variable, \(y_{i,t-1}\), along with the exogenous covariates and the individual intercept. The probability of the response variable under the DL model is

$$\begin{aligned} p(y_{it} | \alpha _i, \varvec{X}_i, y_{i0},\dots ,y_{i,t-1}) = \frac{\exp \left[ y_{it}\left( \alpha _i + \varvec{x}_{it}'\varvec{\beta } + y_{i,t-1}\gamma \right) \right] }{1 + \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta } + y_{i,t-1}\gamma \right) }, \end{aligned}$$
(8)

where the parameter \(\gamma \) measures the so-called state dependence (Heckman, 1981), which represents how much the experience of a certain event affects the probability of experiencing the same event in the future. Here the conditioning set also includes the past values of the response variable, with the initial observation \(y_{i0}\) assumed to be known.

The probability of the response configuration \(\varvec{y}_i\) under the DL model is

$$\begin{aligned} p(\varvec{y}_i | \alpha _i, \varvec{X}_i, y_{i0}) = \frac{\exp \left[ y_{i+}\alpha _i + \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}\right) '\varvec{\beta } + y_{i*}\gamma \right] }{\prod _{t=1}^{T_i}\left[ 1 + \exp \left( \alpha _i + \varvec{x}_{it}'\varvec{\beta } + y_{i,t-1}\gamma \right) \right] }, \end{aligned}$$
(9)

where \(y_{i*} = \sum _{t=1}^{T_i}y_{i,t-1}y_{it}\). Under the DL model, the total score is no longer a sufficient statistic for the individual parameter \(\alpha _i\). Therefore, Chamberlain (1993) derives a CML estimator for the DL model only for the case with no covariates and when \(T = 3\). Moreover, Honoré & Kyriazidou (2000) propose CML estimation of the DL model with explanatory variables by exploiting a weighted conditional log-likelihood. However, their strategy rules out the use of trending regressors and dummies; the resulting estimator has a rate of convergence that is slower than \(\sqrt{n}\).

Among the dynamic formulations for binary panel data, the QE model closely resembles the DL model, as it allows us to include individual specific effects and the lagged dependent variable among the covariates. This model directly formulates the joint probability of \({\varvec{y}}_i\) as

$$\begin{aligned} p(\varvec{y}_i | \delta _i, \varvec{X}_i, y_{i0}) = \frac{\exp \left[ y_{i+}\delta _i + \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}\right) '\varvec{\eta }_1 + y_{iT_i}\left( \phi + \varvec{x}_{iT_i}'\varvec{\eta }_2\right) + y_{i*}\psi \right] }{\sum _{\varvec{z}} \exp \left[ z_+\delta _i + \left( \sum _{t=1}^{T_i}z_t\varvec{x}_{it}\right) '\varvec{\eta }_1 + z_{T_i}\left( \phi + \varvec{x}_{iT_i}'\varvec{\eta }_2\right) + z_{i*}\psi \right] }. \end{aligned}$$

Here \(\delta _i\) denotes the individual specific effects, \(\varvec{\eta }_1\) is the vector of regression parameters, and \(\psi \) measures the true state dependence. In this formulation, \(\phi \) and \(\varvec{\eta }_2\) are considered nuisance parameters. The denominator is based on the sum extended to all the possible binary response vectors \(\varvec{z}\), while \(z_{i*} = y_{i0}z_1 + \sum _{t=2}^{T_i} z_{t-1}z_t\).

Differently from the DL model, in the QE model the total score \(y_{i+}\) is a sufficient statistic for the individual intercept \(\delta _i\), so that the conditional likelihood function is based on the individual contributions

$$\begin{aligned}&p(\varvec{y}_i | \varvec{X}_i, y_{i0}, y_{i+}) \nonumber \\&\quad = \frac{\exp \left[ \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}\right) '\varvec{\eta }_1 + y_{iT_i}\left( \phi + \varvec{x}_{iT_i}'\varvec{\eta }_2\right) + y_{i*}\psi \right] }{\sum _{\varvec{z}:z_+ =y_{i+}} \exp \left[ \left( \sum _{t=1}^{T_i}z_t\varvec{x}_{it}\right) '\varvec{\eta }_1 + z_{T_i}\left( \phi + \varvec{x}_{iT_i}'\varvec{\eta }_2\right) + z_{i*}\psi \right] }. \end{aligned}$$
(10)

The CML estimator can easily be obtained by Newton-Raphson algorithm. Even in this case, though, estimation is unfeasible when T is large for same reasons illustrated above for the static logit model.

2.3 Simplified, Modified, and Approximating QE Models

We now focus on three models deriving from the QE model in Eq. (10). These are: (i) the simplified QE model outlined in Bartolucci & Pigini (2017); (ii) the modified QE model put forward in Bartolucci et al. (2018); and (iii) the approximating QE version for the DL model proposed by Bartolucci & Nigro (2012). A common feature of these models is that the conditional probability for an individual response configuration can be written as

$$\begin{aligned} p(\varvec{y}_i | \varvec{X}_i, y_{i0}, y_{i+}) = \frac{q_j({\varvec{X}}_i,{\varvec{y}}_i)}{\sum _{\varvec{z}:z_+ = y_{i+}} q_j({\varvec{X}}_i,{\varvec{z}})}, \quad j = 2,3,4, \end{aligned}$$
(11)

respectively.

The simplified QE model assumes that the regression parameters are the same for all time occasions, implying a reduction of the set of parameters in Eq. (10) from \(({\varvec{\eta }}_1, {\varvec{\phi }}, {\varvec{\eta }}_2, {\varvec{\psi }})\) to \(({\varvec{\eta }}, {\varvec{\psi }})\), so that the probability in Eq. (11) can be written as a function of

$$\begin{aligned} q_2({\varvec{X}}_i,{\varvec{y}}_i)= \exp \left( \sum _{t=1}^{T_i}y_{it}\varvec{x}_{it}'\varvec{\eta } + y_{i*}\psi \right) . \end{aligned}$$
(12)

The modified QE model closely resembles the simplified QE version. The main difference is that, in this case, the association rule between the outcome variable and its lag takes into account the pairs of consecutive observations that are equal, regardless of their value being 0 or 1. This formulation is such that a more powerful test for \(H_0: \psi = 0\) (absence of state dependence) can be obtained with respect to the QE and simplified QE models.

Under the modified QE model, the conditional probability of \({\varvec{y}}_i\) given \({\varvec{X}}_i\) and \(y_{i0}\) can be written as a function of

$$\begin{aligned} q_3({\varvec{X}}_i,{\varvec{y}}_i)= \exp \left( \sum _{t=1}^{T_i} y_{it}{\varvec{x}}_{it}'{\varvec{\eta }} + {\tilde{y}}_{i*}\psi \right) , \end{aligned}$$
(13)

where \({\tilde{y}}_{i*} = \sum^{T_I} _{t=1} \mathbbm {1} \{y_{it} = y_{i,t-1}\}\) is the number of consecutive observations equal each other, being other 0 or 1. The corresponding denominator in (11) is the sum over all the \(\varvec{z}\) such that \(z_+ = y_{i+}\) of \(q_3({\varvec{X}}_i,{\varvec{z}})\), which will be a function of \({\tilde{z}}_{i*} = \mathbbm {1}\{z_{i1} = y_{i0}\} + \sum _{t>1} \mathbbm {1} \{z_{it} = z_{i,t-1}\}\).

Finally, Bartolucci & Nigro (2012) show that the QE model can serve as an approximation of the DL model. The approximating formulation is based on a first-order Taylor expansion of (9) around \(\alpha = {\bar{\alpha }}\), \({\varvec{\beta }} = \bar{{\varvec{\beta }}}\), and \(\gamma = 0\). The main advantage is that the approximating QE model admits the total scores as sufficient statistics for the individual intercepts. For this model, the conditional probability of \({\varvec{y}}_i\) is a function of

$$\begin{aligned} q_4({\varvec{X}}_i,{\varvec{y}}_i)= \exp \left( \sum _{t=1}^{T_i} y_{it}{\varvec{x}}_{it}'{\varvec{\beta }}-\sum _{t=1}^{T_i}{\bar{r}}_{it}y_{i,t-1}\gamma +\sum _{t=1}^{T_i}y_{i*}\gamma \right) , \end{aligned}$$
(14)

where \({\bar{r}}_{it} = \exp ({\bar{\alpha }}_i + \varvec{x}_{it}'\bar{\varvec{\beta }})/\left[ 1 + \exp ({\bar{\alpha }}_i + \varvec{x}_{it}'\bar{\varvec{\beta }})\right] \).

The model parameters are estimated by PCML on the basis of two steps: first, \({\bar{r}}_{it}\) is computed using a CML estimate for \(\bar{\varvec{\beta }}\) and \({\bar{\alpha }}_i\) equal to its ML estimate under the static logit model; second, the estimate of \(({\varvec{\beta }}', {\varvec{\gamma }})'\) is obtained by maximizing the conditional log-likelihood based on (14).

Overall, all models here discussed, included the static logit model, may be expressed in the form (11) with different definitions of function \(q_j({\varvec{X}}_i,{\varvec{y}}_i)\), \(j=1,\ldots ,4\), that are defined in Eqs. (5), (12), (13), and (14), respectively.

3 Recursive Computation of the Conditional Likelihood Functions

A common shortcoming of the aforementioned models is the computational burden in evaluating the normalizing constants \(\sum _{\varvec{z}:z_+ = y_{i+}} q_j({\varvec{X}}_i,{\varvec{z}})\), \(j= 1,\ldots , 4\). This section first introduces the recursive algorithm by Krailo & Pike (1984) for the static logit model. Then it shows the proposed extension for evaluating the normalizing constants for the QE formulations, namely the simplified, modified and approximating QE models.

3.1 Recursion for the Static Logit Model

Consider the denominator of the conditional probability in (4), which is given by

$$\begin{aligned} \sum _{\varvec{z}:z_+ = y_{i+}} q_1({\varvec{X}}_i,{\varvec{z}}), \end{aligned}$$
(15)

where \(q_1({\varvec{X}}_i,{\varvec{z}})\) is defined in (5).

From the computational point of view, the sum above represents a limitation to the applicability of the CML estimator, especially with large T, as will be discussed is Sect. 4. However, as noted by Howard (1972), the computation of the sum in Eq. (15) can be performed recursively. Define now for individual i function \(f_{t,s}(\cdot )\), indexed by the time dimension \(t = 1,\dots ,T_i\) and by the total score denoted \(s = 0,\dots ,y_{i+}\), as

$$\begin{aligned} f_{t,s}({\varvec{\phi }})=\sum _{\varvec{z}:z_+=s}\exp \left( \sum _{u=1}^t z_u\phi _u\right) , \end{aligned}$$

where now \(\varvec{z}=(z_1,\ldots ,z_t)'\) such that \(\sum _{u=1}^t z_u = s\) and \(\phi _u = {\varvec{x}}_{iu}'{\varvec{\beta }}\). Further, define \({\varvec{\phi }} = (\phi _1,\dots ,\phi _t)'\). Note that \(f_{t,s}({\varvec{\phi }})\) corresponds to the quantity defined in (15) for \(t=T_i\) and \(s=y_{i+}\), that is,

$$\begin{aligned} \sum _{\varvec{z}:z_+ = y_{i+}} q_1({\varvec{X}}_i,{\varvec{z}})=f_{T_i,y_{i+}}({\varvec{\phi }}). \end{aligned}$$
(16)

By exploiting the symmetric function properties of \(f_{t,s}({\varvec{\phi }})\), it can be computed recursively in this way:

  1. 1.

    for \(t=1\), let

    \(f_{1,0}({\varvec{\phi }})=1\), \(f_{1,1}({\varvec{\phi }})=\exp (\phi _1)\), and 0 \(f_{2,2}=f_{1,2}+f_{1,1}\) otherwise;

  2. 2.

    for \(t=2,\ldots ,T_i\), compute

    $$\begin{aligned} f_{t,s}({\varvec{\phi }})=f_{t-1,s}({\varvec{\phi }})+f_{t-1,s-1}({\varvec{\phi }})\exp (\phi _t),\quad s=0,\ldots ,t. \end{aligned}$$
    (17)

The recursive structure can be exploited for the computation the first- and the second-derivatives of this function as well. In turn by exploiting (16), these derivatives are used to compute the score vector and the Hessian matrix in (6) and (7), which are involved in the Newton-Raphson algorithm for the maximization of this function. For the first and second derivatives and for \(h,j=1,\ldots ,T_i\), \(s=0,\ldots ,t\), and \(t=1,\ldots ,T_i\), we use the notation

$$\begin{aligned} f^{(h)}_{t,s}({\varvec{\phi }})=\frac{\partial f_{t,s}({\varvec{\phi }})}{\partial \phi _h},\quad \text {and} \quad f^{(h,j)}_{t,s}({\varvec{\phi }})=\frac{\partial ^2 f_{t,s}({\varvec{\phi }})}{\partial \phi _h\partial \phi _j}, \end{aligned}$$

respectively. How to compute these derivatives by a recursion for the static logit, as well as for the other QE models, is described in "Appendix".

3.2 QE Models

The methodology outlined in the previous section for the static logit model cannot be applied directly to the QE models illustrated in Sect. 2.3. In the following, we propose recursive algorithms to deal with these models.

Similarly to the static logit model, given the time dimension, the total score, and the initial observation, the sum at the denominator of Eq. (11) may be expressed as a function of type

$$\begin{aligned} g_{t,a,s}({\varvec{\phi }},\psi ) = g_{t,a,s,v=0}({\varvec{\phi }},\psi )+g_{t,a,s,v=1}({\varvec{\phi }},\psi ), \end{aligned}$$
(18)

where the index v corresponds to the value of the t-th element of the vector \({\varvec{z}}\), the initial observation is \(a=0,1\), while \(s=1,\ldots ,t\) and \(t=1,\ldots ,T_i\) are the total score and the length of the vector configuration \({\varvec{y}}_i\), respectively. This general structure can be easily adapted to the three QE models defining different functions \(g_{t,a,s}({\varvec{\phi }},\psi )\) that may be recursively computed. The main difference with respect to the static case is that, further to the total score and the time dimension, we take into account not only the initial observation but also the t-th element of vector \({\varvec{z}}\).

3.2.1 Simplified QE Model

Consider the simplified version of the QE model described by (12) and the corresponding normalizing constant expressed as the denominator of (11). In general, we define a single element of the sum at the rhs of Equation (18) as

$$\begin{aligned} g_{t,a,s,v}({\varvec{\phi }},\psi ) =\sum _{{\varvec{z}}:z_+=s,z_t=v}\exp \left[ \sum _{u=1}^t z_u\phi _u+\left( az_1+\sum _{u=2}^tz_{u-1}z_u\right) \psi \right] ,\nonumber \end{aligned}$$

where \(\phi _u = {\varvec{x}}_{iu}'{\varvec{\eta }}\). Following the formulation above, the recursion to compute \(g_{t,a,s,v}({\varvec{\phi }},\psi )\) is based on the following steps:

  1. 1.

    for \(t=1\), let

    $$ g_{1,a,s,v}({\varvec{\phi }},\psi )=\left\{ \begin{array}{ll} 1,&\quad s=v=0,\\ \exp (\phi _1+a\psi ),&\quad s=v=1,\\ 0, &\quad \text{otherwise}, \end{array}\right.$$

    with \(a=0,1\);

  2. 2.

    for \(t=2,\ldots ,T_i\), recursively compute the quantities

    $$\begin{aligned}&g_{t,a,s,v}({\varvec{\phi }},\psi ) \\&\quad =\left\{ \begin{array}{ll} 1, &{} s=0,\,v=0,\\ g_{t-1,a,s,0}({\varvec{\phi }},\psi ) + g_{t-1,a,s,1}({\varvec{\phi }},\psi ), &{} s=1,\ldots ,t-1,\, v=0,\\ g_{t-1,a,s-1,0}({\varvec{\phi }},\psi )\exp (\phi _t)\\ \quad +g_{t-1,a,s-1,1}({\varvec{\phi }},\psi )\exp (\phi _t+\psi ), &{} s=1,\ldots ,t,\,v=1,\\ 0, &{} \text{otherwise}. \end{array}\right. \end{aligned}$$
    (19)

3.2.2 Modified QE Model

Following the same strategy as above, the recursion for the modified QE model can be adopted by defining

$$\begin{aligned} g_{t,a,s,v}({\varvec{\phi }},\psi )=\sum _{{\varvec{z}}:z_+=s,z_t=v} \exp \left[ \sum _{u=1}^{t} z_u \phi _u + \left( \mathbbm {1}\{z_1=a\}+\sum _{u=2}^{t} \mathbbm {1}\{z_u=z_{u-1}\}\right) \psi \right] .\end{aligned}$$

The recursion for computing \(g_{t,a,s,v}({\varvec{\phi }},\psi )\) is as follows:

  1. 1.

    for \(t=1\), let

    $$\begin{aligned} g_{1,a,s,v}({\varvec{\phi }},\psi )=\left\{ \begin{array}{ll} \exp [(\mathbbm {1}-a)\psi ],&{} s=v=0,\\ \exp (\phi _1+a\psi ),&{} s=v=1,\\ 0, &{} \text{otherwise}, \end{array}\right. \end{aligned}$$

    with \(a=0,1\);

  2. 2.

    for \(t=2,\ldots ,T_i\), compute

    $$\begin{aligned}&g_{t,a,s,v}({\varvec{\phi }},\psi )\nonumber \\&\quad =\left\{ \begin{array}{ll} g_{t-1,a,s,0}({\varvec{\phi }},\psi )\exp (\psi )\\ \quad +g_{t-1,a,s,1}({\varvec{\phi }},\psi ), &{} s=0,\ldots ,t-1,\, v=0,\\ g_{t-1,a,s-1,0}({\varvec{\phi }},\psi )\exp (\phi _t)\\ \quad +g_{t-1,a,s-1,1}({\varvec{\phi }},\psi )\exp (\phi _t+\psi ), &{} s=1,\ldots ,t,\,v=1,\\ 0, &{} \text{otherwise}.\\ \end{array}\right. \end{aligned}$$
    (20)

3.2.3 Approximating QE Model

Finally, for the approximating QE model leading to Eq. (14), we define

$$\begin{aligned} g_{t,a,s,v}({\varvec{\phi }},\psi ) = \sum _{{\varvec{z}}:z_+=s,z_t=v} \left[ \sum _{u=1}^{t} z_u\phi _u + \sum _{u=1}^{t} \nu _u z_{u-1} + \left( \sum _{u=1}^tz_{u-1}z_u\right) \gamma \right] , \end{aligned}$$

where \(z_0 = a\), \(\phi _u = {\varvec{x}}_{iu}'{\varvec{\beta }}\), and \(\nu _u = -{\bar{r}}_{iu}\gamma \).

Exploiting the same structure of Eq. (18), the recursive computation of \(g_{t,a,s,v}({\varvec{\phi }},\gamma )\) can be performed as follows:

  1. 1.

    for \(t=1\), let

    $$\begin{aligned} g_{1,a,s,v}({\varvec{\phi }},\gamma )=\left\{ \begin{array}{ll} \exp (\nu _1a),&{} s=v=0\\ \exp [\phi _1+a(\nu _1 + \gamma )],&{} s=v=1,\\ 0, &{} \text{otherwise}, \end{array}\right. \end{aligned}$$

    with \(a=0,1\);

  2. 2.

    for \(t=2,\ldots ,T_i\), compute

    $$\begin{aligned}&g_{t,a,s,v}({\varvec{\phi }},\gamma )\nonumber \\&\quad =\left\{ \begin{array}{ll} 1, &{} s=0,\,v=0,\\ g_{t-1,a,s,0}({\varvec{\phi }},\gamma )\\ \quad +g_{t-1,a,s,1}({\varvec{\phi }},\gamma )\exp (\nu _t), &{} s=1,\ldots ,t-1,\, v=0,\\ g_{t-1,a,s-1,0}({\varvec{\phi }},\gamma )\exp (\phi _t)\\ \quad +g_{t-1,a,s-1,1}({\varvec{\phi }},\gamma )\exp (\phi _t+\gamma +\nu _t), &{} s=1,\ldots ,t,\,v=1,\\ 0, &{} \text{otherwise}.\\ \end{array}\right. \end{aligned}$$
    (21)

4 Computational Complexity

We now focus on the comparison between the different methods to evaluate the conditional probability of a response configuration in terms of computational complexity. For what concerns the standard method based on Equation (11) applied for each unit \(i = 1,\dots ,n\), this is based on a sum extended to all the vectors \({\varvec{z}}\), whose number depends on the time dimension and on the total score, as given by the binomial coefficient \(k = \left( {\begin{array}{c}T_i\\ y_{i+}\end{array}}\right) \). When \(T_i\) is large (e.g., \(T\ge 15\)), the number of operations k may become huge as well as the number of operations required for the computation of the normalizing constant. Figure 1 shows the total number of vectors \({\varvec{z}}\) for a small grid of values of \(T_i\) and \(y_{i+}\). Obviously, when the number k of such vectors becomes too large, the computation of (11) is no longer feasible.

Fig. 1
figure 1

Numbers of vectors \({\varvec{z}}\) (in thousands) as a function of \(T_i\) and \(y_{i+}\)

The proposed recursion overcomes the computational problem described above. Again, the operations required depend on \(T_i\) and \(y_{i+}\), but their order is reduced because basic sums and products are computed, at most, for each value of the total score s in {\(1,\dots ,t\)}, and repeated for t times, with t varying in {\(1,\dots ,T_i\)}. This aspect results in a significant difference in terms of computational time.

For a clearer comparison between different methods, the computational time of the proposed recursive algorithm is empirically evaluated by a simple Monte Carlo simulation that is described in this section. In this regard consider that experiments are run on a computer with n. 2 Intel Xeon CPU E5-2640 v4 2.40GHz, 250GB of RAM, with Debian GNU/Linux “bullseye”/sid as operating system. Results are obtained using R 4.0.3 and the custom code included in the cquad package.

The simulation design is similar to the one adopted by Honoré & Kyriazidou (2000). In particular, data are generated from the DL model according to expression (8), where \(x_{it}\) is an exogenous regressor following a Gaussian distribution with zero mean and variance \(\pi ^2/3\), the parameter \(\beta \) is equal to 1, and the state dependence parameter \(\gamma \) is equal to 0.5. Individual effects are generated as \(\alpha _i = \frac{1}{4}\sum _{t=0}^{3}x_{it}\) and the Monte Carlo replications are 50. Furthermore, the initial observation has distribution

$$\begin{aligned} p(y_{i0} | \alpha _i, x_{i0}) = \frac{\exp \left[ y_{i0}\left( \alpha _i + x_{i0}\beta \right) \right] }{1 + \exp \left( \alpha _i + x_{i0}\beta \right) }. \end{aligned}$$

Finally, we consider \(n=250\) individuals observed for a different number of time occasions. In order to simplify the interpretation of the results, the study is based on simulated datasets of 249 subjects observed for \(T_i = 5\) time occasions and only one subject, the last, observed for \(T>5\) time occasions, where T is allowed to vary in different ranges.

Within the simulation study we measure how the CPU time required by the Newton-Raphson algorithm for CML estimation varies according to the time dimension of the panel. In this case we evaluate T in a grid between 16 and 100 with steps of 4. For each value of T, the average CPU time taken by the maximization over the replications are shown in Fig. 2. The CPU time obviously increases in T and it is similar for the three models considered.

Fig. 2
figure 2

CPU time of the recursion: QE models

With a second experiment, we aim at comparing the required time for the computation of the log-likelihood function with both the standard algebraic operation and using the recursive algorithm.

In this case, we make use of a single draw from the simulation design described above, with \(T_i=5\) for \(i = 1, \dots, 249\). The outcome for the 250-th subject is arbitrarily given by a sequence that alternates 0 and 1, so as to rule out the randomness given by the data generating process and, at the same time, to maximize the number of vectors, namely k, controlling for the total score. Further, the initial observation is set to 1 and T ranges in \(\{8, \dots , 18\}\).

Fig. 3
figure 3

CPU time comparison of algebraic and recursive computation: Simplified QE

Fig. 4
figure 4

CPU time comparison of algebraic and recursive computation: Modified QE

Fig. 5
figure 5

CPU time comparison of algebraic and recursive computation: Approximating QE

Figures 34, and 5 represent the computational time required for the log-likelihood maximization of the simplified, modified, and approximating QE models, respectively. In terms of computational load, the results for the modified QE model are entirely comparable with those for the simplified QE model, while the approximating QE is more demanding since it consists in a two-step CML estimator.

Each figure reports the time required by the algebraic operations (“Standard”) and the recursion (“Recursive”), in which we can see that the former exhibits an exponential pace as T approaches 18 while the latter remains stable. The time taken by the algebraic computation of the “one-step” QE models grows up to 1 minute while the PCML routine takes more than 4 minutes for \(T=18\). For what concerns the algorithm, the computational time is always lower than one second for all the time dimensions considered.

5 Application: Brand Loyalty

The proposed methodology is applied to real data on consumers loyalty to two different dominant yogurt brands. The analysis here performed replicates the example provided by Chintagunta et al. (2001). Differently from the original work, though, the present application is based on a sample dataset provided by A.C. NielsenFootnote 2 that contains information about yogurt purchases made by individuals observed for a period of about two years. This dataset has already been used by Jain et al. (1994).

Purchases concern four brands: Yoplait, Dannon, Nordica, and Weight. As in the original work, we only keep purchases of the two brands with the largest market share in this dataset, namely Dannon and Yoplait. The sample consists of an unbalanced panel of 100 consumers, for a total of 1,788 observations. Time occasions differ among individuals and range between a minimum of 1 and a maximum of 161.

The response variable for the model considered is set to 1 when a consumer chose Dannon. We also include two exogenous explanatory variables. The first one is the log-difference between the prices of the two products (Price). The second one (Featured) is a categorical variable built as the difference between two dummy variables, one per brand, recording whether the brand is advertised in newspapers. We expect that a small relative price and advertisement should have a positive impact on the demand of the goods and then on the probability of a purchase for a brand. Moreover, stickiness in consumer habits require to be taken into account by means of a dynamic specification of the model, that is, by including the lagged dependent variable in the set of covariates. Chintagunta et al. (2001) discuss different model specifications but only the DL model allows us to properly include individual fixed effects and account for true state dependence.

The models here considered are extremely useful to represent consumers behavior and their loyalty to a brand (Chintagunta et al., 2001). First of all, from a statistical perspective, it is straightforward to represent brand choices as a binary variable being 1 or 0 according to whether a product is chosen or not. Secondly, the panel structure of the data allows us to take into account some individual time-constant unobserved characteristics. Since consumers are observed for a relatively short span of time, in which their purchases are recorded, from a theoretical perspective it is reasonable to consider some individual unobservable characteristics as time-invariant.

In the original example, Chintagunta et al. (2001) specify the DL model and consider different estimation approaches: the ML estimator of the pooled model (with no individual intercepts), the CML estimator where the lagged dependent variable is treated as exogenous, the weighted CML (WCML) estimator by Honoré & Kyriazidou (2000), and different random-effects specifications. The proposed algorithm allows us to extend the analysis to QE models as well. In particular, here we formulate the brand choice model as a simplified QE model and as the QE approximation of the DL model. The parameters of these model formulations are estimated by CML and PCML, respectively. Estimation results are reported in Table 1, together with the pooled logit model, the CML, and the WCML.

Table 1 Estimation results: DL and QE models for brand choice

We observe that the estimated coefficients greatly differ across the proposed model specifications. In general, all the signs are coherent with economic theory, so that an increase of the relative price decreases the probability of a purchase for the brand Dannon. The variable Featured seems to be not statistically significant. Furthermore, it is interesting to see how the estimated level of state dependence is sensitive to the specification of the unobservables. In the pooled model, the estimated state dependence parameter is the largest among all the considered specifications. Including heterogeneity dramatically decreases its magnitude, which slightly differs between the DL estimated by CML and PCML, due to the different estimation strategies. Anyway, the Pooled and the CML models are likely to provide biased estimates because the first ignores the potential heterogeneity and in both models the dynamic specification is not accounted for. Finally, we observe that the computational time for the QE models are about twenty times larger with respect to the DL ones because of the larger complexity of the dynamic formulations considered.

Finally, it is worth stressing that there is no way to compare the computational time required by the algorithm for the QE models and their standard matrix computation since, in the latter case, the estimation in not feasible due to the large numbers of time occasions.

6 Conclusions

This work provides a novel way to compute the conditional likelihood functions of the QE model and its extensions. By means of a recursive algorithm, the standard computational burden of algebraic calculations is relieved for applications that involve a panel dataset with large time dimensions.

In particular, the application of the algorithm here proposed is relevant in empirical works making use of unbalanced panel datasets. The presence of subjects observed for a few time occasions makes the conditional inference approach more appealing than the ML estimator, which is inconsistent with fixed T. At the same time, the proposed algorithm dramatically limits the burden of computing the conditional log-likelihood, which can be sizable even if only few subjects are observed for a long time span.

A Monte Carlo experiment shows how the proposed algorithm outperforms standard computation and it should enlarge the applicability of the considered models. In this regard, an application to real data concerning brand loyalty has been proposed with the additional interest of showing a set of results that would have been otherwise impossible, providing practitioners with a wider range of estimation tools for empirical analysis.

Finally, the conditional inference approach can be easily extended to ordered data, as a fixed-effects ordered logit model can be estimated by relying on a set of fixed-effects binary logit models, which are referred to binary variables with categories obtained from a dichotomization of the original responses (Chamberlain, 1980; Baetschmann et al., 2015).