Introduction

In categorical data analysis, hypothesis tests about parameters of interest are typically based on three classes of methods: The likelihood-ratio (LR) test (Wilks 1938), the Wald test (Wald 1943), and the score test proposed by Rao (1948). These statistics can also be utilized to construct confidence intervals (CI) by inverting test statistics about the parameter.

Wald tests and CIs are typically considered the standard approach, because of their computational simplicity and easy availability with software. For inversion of the Wald test, the 95% confidence interval is obtained simply by adding to and subtracting from the parameter estimate 1.96 times the estimated standard error. Thanks to this simplicity, early developments of methods in categorical data analysis were Wald-based, such as weighted least squares methods proposed by Grizzle et al. (1969) and many follow-up articles by Gary Koch and his colleagues. However, the LR test and its inversion for confidence intervals are increasingly available in software. Also, Rao’s score test, which is based on the derivative of the log-likelihood function at the null hypothesis value of the parameter of interest, is commonly used in some settings. In honour of Professor C. R. Rao for this special issue, our article focuses on the score test and score-test-based confidence intervals in categorical data analysis. We also present recent extensions of it that utilize the breakthrough impact of Rao’s contribution.

We begin by summarizing the three methods. For notational simplicity, we present them for the simple case of a single parameter \(\beta\) for a simple statistical model. Denote by \(\ell (\beta )\) the associated log-likelihood function and by \({\widehat{\beta }}\) the maximum likelihood (ML) estimate. The score function is

$$\begin{aligned} u(\beta ) = \partial \ell (\beta )/\partial \beta . \end{aligned}$$

Linked to it is the Fisher information, \(\iota (\beta ) = -\mathbb {E}\left[ \partial ^2 \ell (\beta )/\partial \beta ^2 \right]\), coinciding with the variance of the score function \(u(\beta )\), since \(\mathbb {E}\left[ u(\beta ) \right] = 0\).

Consider a two-sided significance test of \(H_0\): \(\beta = \beta _{0}\) against \(H_a\): \(\beta \ne \beta _{0}\). The squared version of the Wald test statistic is

$$\begin{aligned} \left[ \frac{{\widehat{\beta }} - \beta _{0}}{\text {se}({\widehat{\beta }})} \right] ^2 \; = \; \left( {\widehat{\beta }} - \beta _{0}\right) ^2\iota ({\widehat{\beta }}), \end{aligned}$$

where the standard error \(\text {se}({\widehat{\beta }})\) of \({\widehat{\beta }}\) and the Fisher information \(\iota ({\widehat{\beta }})\) are evaluated at \({\widehat{\beta }}\). The LR test statistic is

$$\begin{aligned} -2\left[ \ell (\beta _0) - \ell ({\widehat{\beta }}) \right] , \end{aligned}$$

comparing the unconstrained log-likelihood function at its maximum \(\ell ({\widehat{\beta }})\) with its value at \(\beta _0\). Rao’s score test statistic is

$$\begin{aligned} \frac{\left[ u(\beta _0)\right] ^2}{\iota (\beta _0)} \; = \; \frac{\left[ \partial \ell (\beta )/\partial \beta _0\right] ^2}{-\mathbb {E}\left[ \partial ^2 \ell (\beta )/\partial \beta _0^2\right] }, \end{aligned}$$

with derivatives evaluated at \(\beta _0\). Its underlying idea is that when \(H_0\) is true, the score function should be relatively near zero at \(\beta _0\). In some literature, especially in econometrics, Rao’s score test is also known as the Lagrange multiplier test, based on Silvey (1959).

Under \(H_0\), all three test statistics have asymptotic chi-squared distributions and are asymptotically equivalent (Cox and Hinkley 1974). When \(H_0\) is false, the three statistics have approximate non-central chi-squared distributions, with different non-centrality parameters. A \(100(1-\alpha )\)% CI can be derived by inverting the tests, constructing the set of \(\beta _0\) values such that the two-sided significance test has p-value > \(\alpha\). An advantage of Rao’s score test is that it applies even when the other two tests cannot be used, for instance when \(\beta\) falls on the boundary of the parametric space under \(H_0\). A disadvantage of the Wald test is its lack of invariance, with results depending on the scale of measurement for \(\beta\). Likewise, the Wald CI for a nonlinear function \(g(\beta )\) of \(\beta\) is not \(g(\cdot )\) applied to the Wald CI for \(\beta\). Thus, in using the Wald method, a wise choice of scale is needed.

In this paper, we introduce frameworks in which score test-based inference is useful for categorical data analysis. “Tests for Categorical Data as Score Tests” summarizes score tests, “Score-Test-Based Confidence Intervals” summarizes score-test based confidence intervals, and “Small-Sample Score-Test-Based Inference” considers small-sample methods. “Extensions of Score-Test-Based Inference for Categorical Data” discusses further extensions such as a “pseudo-score” method that applies when ordinary score methods are not readily available, generalizations for high dimensional cases and for complex and longitudinal settings, and potential future research.

Tests for Categorical Data as Score Tests

Many standard significance tests for categorical data can be derived as score tests that a parameter or a set of parameters equal 0. Methods that construct their estimates of variability under a null hypothesis are often score tests or are closely related to score tests. A landmark example is the Pearson chi-squared test of independence for a two-way contingency table. With cell counts \(\{y_{ij}\}\) for a sample of size n and with expected frequency estimates \(\{\widehat{\mu }_{ij} = y_{i+}y_{+j}/n\}\) based solely on the row and column marginal counts, the Pearson statistic is

$$\begin{aligned} X^2 = \sum _{i}\sum _{j} \frac{\left( y_{ij} - {\widehat{\mu }}_{ij}\right) ^2}{{\widehat{\mu }}_{ij}}. \end{aligned}$$

Some details follow about other statistics for categorical data. See Agresti (2013) for a summary of the methods mentioned.

Smyth (2003) showed that the Pearson chi-squared statistic \(X^2\) for testing independence in a two-way contingency table is a score statistic, under the assumption that the cell counts are independent and Poisson distributed. For multiway contingency tables, Smyth proved that the score test of the hypothesis that any chosen subset of the pairs of faces in the table are independent yields a Pearson-type statistic. For I independent \(\text {Binom}(n_i, \pi _i)\) random variables \(\{Y_i\}\) and a binary linear trend model \(\pi _i = H(\alpha + \beta x_i)\) with a twice differentiable monotone function H, Tarone and Gart (1980) proved that the score statistic for testing \(H_0: \beta = 0\) does not depend on H. It follows that the Cochran-Armitage test, which is the score test of \(H_0: \beta = 0\) in a linear probability model \(\pi _i= \alpha +\beta x\), is equivalent to the score statistic for testing \(H_0: \beta = 0\) in the logistic regression model. The Cochran-Mantel-Haenszel test of conditional independence in \(2 \times 2 \times K\) tables that compare two groups on a binary response while adjusting for a categorical covariate is a score test for the logistic model assuming no interaction between the group variable and the categorical covariate (Birch 1964, 1965; Darroch 1981). Day and Byar (1979) demonstrated the equivalence of Cochran-Mantel-Haenszel statistics and score tests for testing independence in case-control studies, investigating the risk associated with a dichotomous exposure and with individuals stratified in groups. Another special case of the Cochran-Mantel-Haenszel test is a score test applied to binary responses of n matched pairs displayed in n partial \(2\times 2\) tables, commonly known as McNemar’s test. A generalized Cochran-Mantel-Haenszel test for two multicategory variables is the score test for the null hypothesis of conditional independence in a generalized logistic model (Day and Byar 1979). For testing conditional independence in three-way contingency tables that relate a nominal explanatory variable to an ordinal response variable while adjusting for a categorical variable, Iannario and Lang (2016) presented a generalization of the Cochran-Mantel-Haenszel test by proposing score tests based on first moments and constrained correlation.

For generalized linear models with the canonical link function, such as binomial logistic regression models and Poisson loglinear models, the likelihood function simplifies with the data reducing to sufficient statistics. For subject i, letting \(y_i\) denote the observed response and \(x_{ij}\) the value of the explanatory variable j for which \(\beta _j\) is the coefficient, the sufficient statistic for \(\beta _j\) is \(\sum _i x_{ij} y_i\). The score test statistic for \(H_0: \beta _j = 0\) can be expressed as a standardization of its sufficient statistic. In this case, Lovison (2005) gave a formula for the score statistic that resembles the Pearson statistic, being a quadratic form comparing fitted values for two models. Let \({\mathbf{X}}\) be the model matrix for the full model and let \(\widehat{\mathbf{V}}_0\) be the diagonal matrix of estimated variances under the null model (e.g., with \(\beta _j = 0\)), with fitted values \(\widehat{\varvec{\mu }}\) for the full model and \(\widehat{\varvec{\mu }}_0\) for the reduced model. Then, the score statistic is

$$\begin{aligned} (\widehat{\varvec{\mu }} - \widehat{\varvec{\mu }}_0)'\mathbf{X}(\mathbf{X}'\widehat{\mathbf{V}}_0 \mathbf{X})^{-1}\mathbf{X}' (\widehat{\varvec{\mu }} - \widehat{\varvec{\mu }}_0). \end{aligned}$$

Lang et al. (1999) gave this formula for the loglinear case.

Another setting for the Pearson chi-squared statistic occurs in testing model goodness-of-fit. Let \(\{y_i\}\) denote multinomial cell counts for a contingency table of arbitrary dimensions. Let \(\{{\widehat{\mu }}_{i}\}\) be the ML fitted values for a particular model. For testing goodness-of-fit, the score test statistic is the Pearson-type statistic,

$$\begin{aligned} X^2 = \sum _i \frac{\left( y_{i} - {\widehat{\mu }}_{i}\right) ^2}{{\widehat{\mu }}_{i}}. \end{aligned}$$

Cox and Hinkley (1974, p. 326) noted this, and Smyth (2003) extended it to a corresponding statistic for generalized linear models.

Score-Test-Based Confidence Intervals

Although the score test is well established for categorical data, CIs based on the score function are less utilized. The best known and most utilized score CI is Wilson’s CI for a binomial parameter \(\pi\) (Wilson 1927). This CI uses the score test statistic

$$\begin{aligned} z_W = \frac{{\widehat{\pi }} - \pi _0}{\sqrt{\pi _0(1-\pi _0)/n}}, \end{aligned}$$

which has an asymptotic standard normal distribution under \(H_0\): \(\pi = \pi _0\). The \(100(1-\alpha )\%\) CI consists of those \(\pi _0\) values for which the p-value \(> \alpha\). For instance, the endpoints of the 95% CI are the roots of the quadratic equation \(z^2_W = (1.96)^{2}\). By contrast, the Wald CI is based on the z statistic with \(\widehat{\pi }\) in the denominator instead of \(\pi _0\). It has midpoint \(\widehat{\pi }\) and thus zero length whenever \(\widehat{\pi }\) = 0 or 1.

In other settings, score CIs are less commonly known and used. Studies often aim to compare two groups (e.g. different treatments) on a binary response (success, failure), and we focus on that case. A \(2 \, \times \, 2\) contingency table, with observed frequencies \(\{y_{11}, y_{12}, y_{21}, y_{22}\}\), shows the results, with rows for the groups and columns for response categories. Let \(n_1 = y_{11}+y_{12}\), \(n_2 = y_{21}+y_{22}\), denote the sample sizes of the two groups. For a subject in row i, \(i=1,2\), let \(\pi _{i}\) denote the binomial probability that the response is category 1 (success). For relevant parameters such as the difference of probabilities, the ratio of probabilities, and the odds ratio, Agresti (2011) summarized score-test based CIs. We next summarize them.

Consider a score CI for the difference, \(\delta = \pi _1 - \pi _2\). For \(H_0\): \(\pi _1 - \pi _2 = \delta _0\), let \({\widehat{\pi }}_1\) and \({\widehat{\pi }}_2\) be the unrestricted ML estimates, which are the sample proportions \({\widehat{\pi }}_i = y_{i1}/n_{i}\), \(i=1,2\), and let \({\widehat{\pi }}_1(\delta _{0})\) and \({\widehat{\pi }}_2(\delta _{0})\) be the ML estimates subject to the constraint \(\pi _1 - \pi _2 = \delta _0\). Mee (1984) obtained an asymptotic score CI by inverting the test statistic that is the square of

$$\begin{aligned} z_{\text {diff}} = \frac{({\widehat{\pi }}_1 - {\widehat{\pi }}_2) - \delta _{0}}{\sqrt{[{\widehat{\pi }}_1(\delta _{0})(1 - {\widehat{\pi }}_1(\delta _{0}))/n_1] + [{\widehat{\pi }}_2(\delta _{0})(1 - {\widehat{\pi }}_2(\delta _{0}))/n_2]}}. \end{aligned}$$

The restricted ML estimates \({\widehat{\pi }}_i(\delta _{0})\), \(i=1,2\), have closed form, but the computation of the set of \(\delta _0\) that fall in the CI requires an iterative algorithm. When \(\delta _0 = 0\), \(z_{\text {diff}}^2\) is the Pearson chi-squared statistic for testing independence. Therefore, when zero is included in this \(100(1-\alpha )\%\) score CI, the Pearson test has p-value \(> \alpha\). Miettinen and Nurminen (1985) proposed to multiply \(z_{\text {diff}}\) by \(\left( 1- (n_1 + n_2)^{-1} \right) ^{1/2}\) to improve performance with small samples. Newcombe (1998a) proposed another score-test based CI for \(\pi _1 - \pi _2\). This interval combines Wilson’s individual score CIs for the two proportions. See also Fagerland et al. (2017, Sec. 4.5.4) for details. In large samples, Newcombe’s interval tends to be close to Mee’s asymptotic score interval, and both have higher actual coverage probabilities than the Wald interval (which inverts the z statistic with unrestricted ML estimates in the standard error), particularly in unbalanced samples (\(n_1 \ne n_2\)). Fagerland et al. (2015) recommended Newcombe’s hybrid score intervals as the best when sample sizes are moderate or large. The Newcombe CI and the Miettinen-Nurminen CI perform similarly, with coverage probabilities close to the nominal level, but the Newcombe CI performs better when proportions are close to the boundaries.

Sometimes it is more informative to consider the ratio, \(\phi = \pi _1/\pi _2\), instead of the difference \(\delta = \pi _1 - \pi _2\), especially when both \(\pi _1\) and \(\pi _2\) are near 0. This ratio, sometimes referred to as the relative risk, is estimated by

$$\begin{aligned} {\widehat{\phi }} \, = \, \frac{{\widehat{\pi }}_1}{{\widehat{\pi }}_2} \, = \, \frac{y_{11}/n_{1}}{y_{21}/n_{2}}. \end{aligned}$$

Koopman (1984) proposed an asymptotic score CI for the relative risk. Under \(H_0: \pi _1/\pi _2 = \phi _0\), that is \(H_0: \pi _1 = \phi _0 \pi _2\), the chi-squared statistic is

$$\begin{aligned} u_{\text {RR}}\, =\, \frac{\big (y_{11}-n_{1} \widehat{\pi }_{1}(\phi _0)\big )^{2}}{n_{1} \widehat{\pi }_{1}(\phi _0)\big (1-\widehat{\pi }_{1}(\phi _0)\big )}+ \frac{\big (y_{21}-n_{2} \widehat{\pi }_{2}(\phi _0) \big )^{2}}{n_{2} \widehat{\pi }_{2}(\phi _0)\big (1-\widehat{\pi }_{2}(\phi _0)\big )}, \end{aligned}$$

where \(\widehat{\pi }_{i}(\phi _0)\), \(i=1,2\), denote the ML estimates of \(\pi _{1}\) and \(\pi _{2}\) under \(H_0\), which are

$$\begin{aligned} \begin{aligned} \widehat{\pi }_{1}(\phi _0) \; = \,&\frac{\phi _{0} \, \left( n_{1} + y_{21}\right) +y_{11}+n_{2}}{2 (n_1+n_2)} \; + \\&\quad - \frac{\sqrt{\Big (\phi _{0} \, \left( n_{1}+y_{21}\right) +y_{11}+n_{2}\Big )^2 - 4 \phi _{0} \, (n_1+n_2) \left( y_{11}+y_{21}\right) }}{2 (n_1+n_2)}\end{aligned} \end{aligned},$$

and \(\widehat{\pi }_{2}(\phi _0) =\widehat{\pi }_{1}(\phi _0) / \phi _{0}\). The endpoints are the two solutions to \(u_{\text {RR}} = \chi _{1,1-\alpha }^{2}\), equating \(u_{\text {RR}}\) to the \(1-\alpha\) quantile of the chi-squared distribution with one degree of freedom. The CI limits are zero or infinity when a cell count is 0. Miettinen and Nurminen (1985) proposed another asymptotic score CI for \(\phi\), by inverting under \(H_0\),

$$\begin{aligned} z_{\mathrm {MN}}\, =\,\frac{1}{s} \big ( \widehat{\pi }_{1} - \phi _0 \widehat{\pi }_{2} \big ) \sqrt{1-\frac{1}{n_{1}+n_{2}}},\end{aligned}$$

where

$$\begin{aligned} s = \sqrt{\frac{\widehat{\pi }_{1}(\phi _0) \big (1-\widehat{\pi }_{1}(\phi _0)\big )}{n_{1}}+\frac{\phi _{0}^{2} \, \widehat{\pi }_{2}(\phi _0)\big (1-\widehat{\pi }_{2}(\phi _0)\big )}{n_{2}}}. \end{aligned}$$

Also, Zou and Donner (2008) proposed a hybrid score CI combining the Wilson’s score CIs of each parameter \(\pi _1\) and \(\pi _2\) and exploiting the logarithm of a ratio equaling the difference of the logarithms. See also Sec. 4.7.5 of Fagerland et al. (2017). Fagerland et al. (2015) recommended Koopman’s asymptotic CI for small, moderate and large sample size. According to a comparative study by Price and Bonett (2008), this CI performs very well, with coverage probabilities always close to the nominal level, and from this point of view, it is clearly superior to other non-score-based intervals.

The odds ratio is a parameter of special interest for categorical data, because it is linked to the coefficient of an explanatory variable in logistic regression via exponentiation. In a \(2 \times 2\) table with two independent binomials, the odds ratio is

$$\begin{aligned} \theta \, = \, \frac{\pi _{1}/(1-\pi _{1})}{\pi _{2}/(1-\pi _{2})}, \end{aligned}$$

estimated by \((y_{11}y_{22})/(y_{12}y_{21})\). To construct a score-test-based CI for \(\theta\), for a given \(\theta _0\), let \(\{{\widehat{\mu }}_{ij}(\theta _0)\}\) be the unique values that have the same row and column margins as \(\{y_{ij}\}\) and such that

$$\begin{aligned} \frac{{\widehat{\mu }}_{11}(\theta _0){\widehat{\mu }}_{22}(\theta _0)}{{\widehat{\mu }}_{12}(\theta _0){\widehat{\mu }}_{21}(\theta _0)} = \theta _0. \end{aligned}$$

The set of \(\theta _0\) satisfying

$$\begin{aligned} X^2 = \sum _{ij} (y_{ij} - {\widehat{\mu }}_{ij}(\theta _0))^2/ {\widehat{\mu }}_{ij}(\theta _0) \le \chi ^2_{1,1-\alpha }, \end{aligned}$$

forms a \(100(1 - \alpha )\%\) conditional score CI for the odds ratio (Cornfield 1956) that also applies for a multinomial sample over the four cells.

The research literature suggests that asymptotic score tests and corresponding CIs perform well, usually much better than Wald CIs. Even with small samples, score CIs perform surprisingly well and often out-perform likelihood-ratio-test-based inference. This behavior may be a consequence of the score statistic in canonical models being the standardization of a sufficient statistic that uses standard errors computed under \(H_0\). For evaluations based on simulations, see Fagerland et al. (2015) and Fagerland et al. (2017). In addition, for comparisons specific to CIs for binomial proportions, see Newcombe (1998b) and Agresti and Coull (1998). See Miettinen and Nurminen (1985), Newcombe (1998a), and Agresti and Min (2005a) for comparison of CIs for the difference of proportions and the relative risk, Tango (1998) and Agresti and Min (2005b) for inference about the difference of proportions for dependent samples, Miettinen and Nurminen (1985) and Agresti and Min (2005a) for CIs for the odds ratio, Agresti and Klingenberg (2005) for multivariate comparisons of proportions, Agresti et al. (2008) for simultaneous CIs comparing several binomial proportions, Ryu and Agresti (2008) for effect measures comparing two groups on an ordinal scale, Lang (2008) for logistic regression parameters and generic measures of association, and Tang (2020) for score CIs for stratified comparisons of binomial proportions.

Statistical software provides functions for computing score CIs. The Appendix lists some useful R (R Core Team 2022) functions.

Small-Sample Score-Test-Based Inference

Asymptotic tests based on large-sample approximations may perform poorly for small n, although research suggests that score tests often perform well even in quite small samples. One can instead perform tests and construct CIs by applying relevant small-sample distributions, such as the binomial. For instance, consider inference for a coefficient \(\beta _j\) in a logistic regression model

$$\begin{aligned} \text{ logit }\big [P(Y_i = 1)\big ] = \beta _0 + \beta _1 x_{i1} + \cdots + \beta _k x_{ik}, \end{aligned}$$

for binary response \(Y_i\). With \(x_{i0} = 1\) for the intercept, the score statistic for \(\beta _j\) is based on the sufficient statistic \(T_j = \sum _i x_{ij}y_{i}\). Starting with the binomial likelihood for independent observations, one can base a test on the conditional distribution of \(T_j\) after eliminating the other nuisance parameters by conditioning on their sufficient statistics. For example, with the equal-tail method, bounds \((\beta _{1L}, \beta _{1U})\) of a 100(\(1 - \alpha\))% CI for \(\beta _1\) are obtained by solving the two equations

$$\begin{aligned}&P(T_1 \ge t_{1,obs} | t_0, t_2, \ldots , t_k) = \alpha /2, \\&P(T_1 \le t_{1,obs} | t_0, t_2, \ldots , t_k) = \alpha /2. \end{aligned}$$

For details, see Mehta and Patel (1995). Software is available for doing this, such as LogXact (Cytel 2005).

Discreteness implies that a significance test cannot have a fixed size \(\alpha\), at all possible null values for a parameter. In rejecting the null hypothesis whenever the p-value \(\le \alpha\), the actual size has \(\alpha\) as an upper bound. Hence, actual confidence levels for small-sample interval estimation inverting such tests do not exactly equal the nominal values. Inferences are conservative, in the sense that coverage probabilities are bounded below by the nominal level and CIs are wider than ideal. The actual coverage probability varies for different parameter values and is unknown.

Agresti (2003) shows some remedies to alleviate the conservatism. One approach that is feasible when the parameter space is small uses an unconditional approach to eliminate nuisance parameters, because the conditional approach exacerbates the discreteness. For \(H_0\): \(\beta = \beta _0\) with nuisance parameter \(\psi\), let \(p(\beta _0; \psi )\) be the p-value for a given value of \(\psi\). The unconditional p-value is \(\hbox {sup}_{\psi }p(\beta _0; \psi )\) and the \(100(1-\alpha )\%\) CI consists of \(\beta _0\) for which \(\hbox {sup}_{\psi }p(\beta _0; \psi ) > \alpha\). Chan and Zhang (1999) proposed an exact unconditional interval for \(\pi _1 - \pi _2\) by inverting two one-sided exact score tests of size at most \(\alpha /2\) each. Agresti and Min (2001) inverted a single two-sided exact unconditional score test, which results in a narrower interval, available in StatXact software (Cytel 2005). Agresti and Min (2002) found that the unconditional exact approach with two-sided score statistic also works well for the odds ratio. See Fagerland et al. (2017) for Chan-Zhang and Agresti-Min forms of CIs for \(\pi _1 - \pi _2\) (pp. 118–119), the relative risk (pp. 139–141), and the odds ratio (pp. 159–160). Coe and Tamhane (1993) proposed an alternative unconditional approach for \(\pi _1 - \pi _2\) and \(\pi _1/\pi _2\) that is more complex but performs well. Santner et al. (2007) reviewed several such methods.

Agresti and Gottard (2007) showed that an alternative way to reduce conservativeness with discrete data is to base tests and CIs on the mid-P-value (Lancaster 1961). For testing \(H_0: \beta = \beta _0\) versus \(H_a: \beta > \beta _0\) based on a discrete test statistic T such as a score statistic, the mid-P-value is

$$\begin{aligned} P(T > t_{obs} \mid H_0) + \frac{1}{2}\,P(T = t_{obs}\mid H_0). \end{aligned}$$

Under \(H_0\), the ordinary p-value is stochastically larger than uniform(0,1) in distribution (which is the exact distribution in the continuous case), but the mid-P-value is not and it has the same mean and a slightly smaller variance than a uniform random variable. The sum of right-tail and left-tail p-values equals \(1 + P(T = t_{obs}\mid H_0)\) for the ordinary p-value but equals 1 for the mid-P-value. Using the small-sample distribution, a 100(\(1 - \alpha\))% mid-P-value CI (\(\beta _L\), \(\beta _U\)) for \(\beta\) is determined by

$$\begin{aligned}&P_{\beta _U}(T < t_{obs}) + (1/2)\times P_{\beta _U}(T = t_{obs}) = \alpha /2, \\&P_{\beta _L}(T > t_{obs}) + (1/2)\times P_{\beta _L}(T = t_{obs}) = \alpha /2. \end{aligned}$$

Its coverage probability is not guaranteed to be \(\ge\) (\(1 - \alpha\)), but it is usually close to that value. Numerical evaluations, such as in Agresti and Gottard (2007), suggest that it tends to be a bit conservative in an average sense.

Several examples in Fagerland et al. (2017) show the good performance of mid-P-based inference, compared with commonly used methods. An example is the Cochran-Armitage mid-P score test and related CIs for trend in logit models for \(I \times 2\) contingency tables with ordered rows and possibly small samples (Fagerland et al. 2017, Table 5.12, p. 221). Other mid-P versions of score-type tests include a mid-P version of the Pearson chi-square test statistic for independence in unordered \(I \times J\) tables, and the McNemar mid-P test. This latter test does not violate the nominal level in all of the 10000 scenarios evaluated by Fagerland et al. (2013).

An alternative small-sample approach uses asymptotic statistics but employs a continuity correction, to better align standard normal tail probabilities with the binomial tail probabilities used to construct exact intervals and exact tests. However, because such exact tests are conservative, doing this provides some protection from the actual coverage probability being too low but sacrifices performance in terms of length, with the average coverage probability being too high.

Extensions of Score-Test-Based Inference for Categorical Data

This section describes some extensions of score-test-based inference and potential future research about such methods.

Pseudo-Score Inference with the Pearson Chi-Squared Statistic

Consider a multinomial model for cell counts \(\{y_i\}\) and its ML fitted values \(\{{\widehat{\mu }}_i\}\). Consider a simpler, null model obtained from the full model by imposing a constraint on a model parameter, say \(\beta = \beta _{0}\), with ML fitted values \(\{{\widehat{\mu }}_{i0}\}\). The LR statistic

$$\begin{aligned} G^2 = 2 \sum _i {\widehat{\mu }}_i \log ({\widehat{\mu }}_i/{\widehat{\mu }}_{i0})\end{aligned},$$

can be used to compare the models and to construct the profile likelihood \(100(1 - \alpha )\%\) CI for \(\beta\), which is the set of \(\{\beta _{0}\}\) such that \(G^2 \le \chi ^2_{1,1-\alpha }\). To provide a score-type CI, Agresti and Ryu (2010) proposed instead inverting a Pearson-type statistic proposed by Rao (1961),

$$\begin{aligned} X^2 = \sum _i \frac{({\widehat{\mu }}_i - {\widehat{\mu }}_{i0})^2}{{\widehat{\mu }}_{i0}}. \end{aligned}$$

This statistic is a quadratic approximation for \(G^2\) and is equivalent to the Pearson statistic for goodness-of-fit testing when the full model is the saturated one. Haberman (1977) showed that under \(H_0: \beta = \beta _{0}\), \(X^2\) has the same limiting distribution as \(G^2\) for large, sparse tables. This includes the case in which the number of cells in the table grows with the sample size, such as occurs with a continuous explanatory variable.

Agresti and Ryu (2010) proposed the asymptotic \(100(1 - \alpha )\%\) CI for a generic parameter \(\beta\) as the set of \(\beta _{0}\) values such that \(\ X^2 \le \chi ^2_{1,1-\alpha }\). This is a pseudo-score CI, as \(X^2\) is the score test statistic only if the full model is saturated. Agresti and Ryu (2010) noted that the pseudo-score CI is available even when the score CI itself is not easily obtainable. In addition, the pseudo-score method generalizes to sampling schemes more complex than simple multinomial sampling and to discrete distributions other than the multinomial, such as Poisson regression models. For generalized linear models with canonical link function and independent observations \(\{y_i\}\) from a specified discrete distribution, Lovison (2005) showed that the bounds obtained by inverting the generalized Pearson-type statistic (see “Tests for Categorical Data as Score Tests”) are bounded above by the Pearson statistic. Consequently, the asymptotic p-values for the ordinary score test are at least as large as those for the pseudo-score test, and CIs based on inverting the score test contain CIs based on inverting the pseudo-score test. Nonetheless, the pseudo-score method is useful when ordinary score methods are not practical, such as in more complex cases or when the link function is not canonical. In these situations, the pseudo-score CIs can be implemented with the same difficulty level as profile likelihood confidence intervals. Through simulations, Agresti and Ryu (2010) found that the pseudo score method has similar behavior as the profile likelihood interval and sometimes with even a bit better performance with small samples. Also, as discussed in the next subsection, extensions of the pseudo-score method may apply to settings in which profile likelihood methods are not available.

The pseudo-score method generalizes to parameters of generalized linear models for discrete data, for instance in Poisson and negative binomial regression. Suppose \(\{Y_i, i = 1, \ldots , n\}\) are independent observations assumed to have a specified discrete distribution. Let \(v({\hat{\mu }}_{i0})\) denote the estimated variance of \(Y_i\) assuming the null distribution for \(Y_i\) and let \(\hat{\mathbf{V}}_0\) be the diagonal matrix containing such values. A Pearson-type statistic for comparing models in the generalized linear model setting (Lovison 2005) has the form

$$\begin{aligned} X^2 = \sum _i \frac{({\hat{\mu }}_i - {\hat{\mu }}_{i0})^2}{v({\hat{\mu }}_{i0})} = (\hat{\varvec{\mu }} - \hat{\varvec{\mu }}_0)'\hat{\mathbf{V}}_0^{-1}(\hat{\varvec{\mu }} - \hat{\varvec{\mu }}_0). \end{aligned}$$

This statistic also applies to a quasi-likelihood setting, in which one needs only to specify the expected values under the assumed models and the variance function (or more generally a matrix of covariance functions), without specifying a particular distribution (Lovison 2005).

Other Extensions of Score-Test-Based Inference

For longitudinal data and many other forms of clustered data, score tests are not readily available for popular models. A prime example is the set of models for which the likelihood function is not an explicit function of the model parameters, such as marginal models for longitudinal data. A popular approach for marginal modeling uses the method of generalized estimating equations (GEE). Because of the lack of a likelihood function with this method, Wald methods are commonly employed, together with a sandwich estimator of the covariance matrix of model parameter estimators. Boos (1992) and Rotnitzky and Jewell (1990) presented score-type tests for this setting.

In future research, the pseudo-score inference presented in “Pseudo-Score Inference with the Pearson Chi-Squared Statistic” may also extend to marginal modeling of clustered categorical responses. For binary data, let \(y_{it}\) denote observation t in cluster i, for \(t = 1, \ldots , T_i\) and \(i = 1, \ldots , n\). Let \(\mathbf{y}_i\) = (\(y_{i1}, \ldots , y_{iT_{i}})'\) and let \({\varvec{\mu }}_i = \mathbb {E}(\mathbf{Y}_i) = (\mu _{i1}, \ldots , \mu _{iT_{i}})'\). Let \(\mathbf{V}_i\) denote the \(T_i \times T_i\) covariance matrix of \(\mathbf{Y}_i\). For a particular marginal model, let \(\widehat{\varvec{\mu }}_i\) denote an estimate of \({\varvec{\mu }}_i\), such as the ML estimate under the naive assumption that the \(\sum _i T_i\) observations are independent. Let \(\widehat{{\varvec{\mu }}}_{i0}\) denote the corresponding estimate under the constraint that a particular parameter \(\beta\) takes value \(\beta _{0}\). Let \(\widehat{\mathbf{V}}_{i0}\) denote an estimate of the covariance matrix of \(\mathbf{Y}_i\) under this null model. The main diagonal elements of \(\widehat{\mathbf{V}}_{i0}\) are \(\widehat{\mu }_{it0}(1-\widehat{\mu }_{it0})\), \(t = 1, \ldots , T_i\). Separate estimation is needed for the null covariances, which are not part of the marginal model. Now, consider the statistic

$$\begin{aligned} X^2 = \sum _i (\widehat{{\varvec{\mu }}}_i - \widehat{{\varvec{\mu }}}_{i0})' \widehat{\mathbf{V}}_{i0}^{-1}(\widehat{{\varvec{\mu }}}_i - \widehat{{\varvec{\mu }}}_{i0}). \end{aligned}$$

With categorical explanatory variables, \(X^2\) applies to two sets of fitted marginal proportions for the contingency table obtained by cross classifying the multivariate binary response with the various combinations of explanatory variable values. The set of \(\beta _{0}\) values for which \(X^2 \le \chi ^2_{1,1-\alpha }\) is a CI for \(\beta\). Unlike the GEE approach, this method does not require using the sandwich estimator, which can be unreliable unless the number of clusters is large. Even with consistent estimation of \(\mathbf{V}_{i0}\), however, the limiting null distribution of \(X^2\) need not be exactly chi-squared because the fitted values result from inefficient estimates. It is of interest to analyze whether the chi-squared distribution tends to provide a good approximation. Extensions are possible for correlated discrete cases other than correlated categorical responses. As pointed out by Lovison (2005), unlike likelihood ratio test-type statistics, a Pearson-type statistic can be defined for any quasi-likelihood model, needing only to specify expected values under the model and variance-covariance functions.

Many research studies, especially those using surveys, obtain data with a complex sampling scheme. For example, most surveys do not use simple random sampling but instead a multi-stage sample that employs stratification and clustering. One can then replace \(\widehat{V}_0\) in the Pearson-type statistic just mentioned by an appropriately inflated or non-diagonal estimate of the covariance matrix. For such complex sampling designs, profile likelihood CIs are not available and need to be replaced by quasi-likelihood adaptations.

In one approach of this type, Rao and Scott (1981) proposed an extension of the Pearson chi-squared statistic for testing independence in a two-way contingency table when the data result from a complex survey design and the observations cannot be treated as realizations of iid random variables. In particular, they provide a correction of \(\widehat{V}_0\) for stratified random sampling and two-stage sampling. However, their test statistic requires that none of the observed cell counts equals zero. To solve this limitation, Lipsitz et al. (2015) proposed Wald and score statistics for testing independence based on weighted least squares estimating equations.

Another possible extension of score-based inference concerns constrained statistical inference. Constrained statistical inference problems arise in categorical data analysis when there are inequality constraints on parameters, such as functions of conditional probabilities in a \(I \times J\) table. They are used to specify hypotheses of stochastic dominance, monotone dependence and positive association in contingency tables (Agresti and Coull 1998; Dardanoni and Forcina 1998; Bartolucci et al. 2007, among others). To test them, the literature on constrained inference for categorical data (see Colombi and Forcina 2016, and the references therein quoted) concentrated on the LR statistic and its asymptotic chi-bar-squared distribution, a weighted sum of chi-squared variables whose weights can be calculated exactly or sufficiently precisely via simulation (see R package ic-infer by Grömping 2010, and hmmm by Colombi et al. 2014).

Silvapulle and Sen (2005) presented an extensive review on testing under inequality restrictions, and described two possible ways (global and local) to extend score statistics for inequality constrained testing problems, giving proofs of the asymptotic equivalence, under some conditions, of these score-type and LR statistics. However, the LR seems more used in constrained inference, possibly because of analytical and computational advantages (e.g. Molenberghs and Verbeke 2007). A research challenge could be in the direction of investigating, also through simulations, where score-type testing is convenient.

Inference in High-Dimensional Settings

In high-dimensional settings, the number of parameters can be very large, sometimes even exceeding the sample size. Then, a fundamental issue is to derive the theoretical properties of regularized estimators such as those using a lasso-type (Tibshirani 1996) penalty term. While several properties on regularized point estimators have been assessed, methods to adequately quantify estimate uncertainty and derive confidence intervals is an important topic under investigation, usually referred as post selection inference or selective inference. Classical inferential theory is not valid. Even if interest focuses only on few parameters with the others considered a nuisance, the score function is seriously affected by the dimension of the nuisance parameter. Recent developments explore how extensions of Rao’s score test function can be utilized both for hypothesis testing and confidence intervals in high-dimensional generalized linear models.

A key contribution is due to Ning et al. (2017). For a subset of parameters of interest, they proposed a new device, called a decorrelated score function, that can be used with high dimensional logistic and Poisson regression models, among others. To illustrate, suppose the assumed model is characterized by a set of parameters \(\varvec{\theta }\) that can be partitioned as \(\varvec{\theta }= (\beta , \varvec{\gamma })\), where \(\beta\) is a finite-dimensional parameter of interest and \(\varvec{\gamma }\) is a high-dimensional nuisance parameter. Ning et al. (2017) applied a decorrelation operation to the score function, obtaining a score function for \(\beta\) that is uncorrelated with the nuisance score function. The decorrelated score test can be viewed as an extension of Rao’s score test, and it is equivalent to this in a low-dimensional setting. For instance, consider a logistic regression model with covariates \(\varvec{Q} = (z, \varvec{x})' \in \mathbb {R}^p\), where z is the variable of interest with coefficient \(\beta\) and \(\varvec{x}\) are other covariates, with coefficients \(\varvec{\gamma }\), assumed as sparse. Then, the log-likelihood function is

$$\begin{aligned} \ell (\theta , \varvec{\gamma })=\frac{1}{n} \sum _{i=1}^{n}\Big \{y_{i}\left( \theta z_{i}+\varvec{\gamma }' \varvec{x}_{i}\right) +\log \big [1+\exp \left( \theta z_{i}+\varvec{\gamma }' \varvec{x}_{i}\right) \big ]\Big \}. \end{aligned}$$

Ning et al. (2017) showed that when \(\varvec{\gamma }\) has high dimension, Rao’s score test statistic with maximum likelihood and regularized estimator fails its asymptotic optimal properties. In particular, the score functions no longer have a simple limiting distribution. The decorrelated score function of \(\beta\) is defined as

$$\begin{aligned} S(\beta , \varvec{\gamma }) = u_{\beta }(\beta , \varvec{\gamma }) - \varvec{w}' \, u_{\varvec{\gamma }}(\beta , \varvec{\gamma }), \end{aligned}$$

where \(u_{\beta }(\beta , \varvec{\gamma }) = \partial \ell (\beta ,\varvec{\gamma })/\partial \beta\) is the score function with respect to \(\beta\), and \(\varvec{w} = \iota (\beta , \varvec{\gamma }) \, \iota (\varvec{\gamma })^{-1}\). The resulting function \(S(\beta , \varvec{\gamma })\) is uncorrelated with the score function for the nuisance parameters \(u_{\varvec{\gamma }}(\beta , \varvec{\gamma })\).

The score test for \(\beta\) requires an estimate for both \(\varvec{\gamma }\) and \(\varvec{w}\) to compute the test statistic \({\widehat{S}} (\beta , \widehat{\varvec{\gamma }})\) to be evaluated under \(H_0: \beta =\beta _0\). Ning et al. (2017) proposed an algorithm for such computation and showed that it applies to several models, to several regularized estimators, and also for a multi-dimensional parameter of interest. Under \(H_0\), the test statistic

$$\begin{aligned} z_{DS} = \sqrt{n} \, {\widehat{S}}(\beta _0, \widehat{ \varvec{\gamma }}) \, / \sqrt{{\widehat{\sigma }}_S}, \end{aligned}$$

with \({\widehat{\sigma }}_S\) a consistent estimator of the variance of the decorrelated score function has, asymptotically, a standard normal distribution, under mild assumptions. In comparison with Wald-type tests for high dimensional settings, such as the desparsifying method (Van de Geer et al. 2014), the decorrelated score test was shown through simulation to be slightly more powerful. The decorrelated score function can also generate valid confidence intervals for the parameters of interest (Shi et al. 2020).

High-dimensional data typically are sparse data, which can cause problems such as infinite estimates in models for categorical data because of complete separation or quasi-complete separation. Generally, with sparse data or infinite maximum likelihood estimates, it is popular to use Firth’s penalized-likelihood approach (Firth 1993). Siino et al. (2018) have developed the penalized score statistic test for logistic regression in the presence of sparse data by modifying the classical score function to partly remove the bias of the ML estimates due to sparseness. In particular, for logistic regression parameters, the authors showed through simulations that the score-based CIs with Firth’s penalization perform better than some competitors such as Wald and likelihood ratio statistics in terms of coverage level and average width, even with small samples, strong sparsity, and sampling zeros, and also for any number of covariates in the model.