1 Introduction

Ordinal data are collected in many fields such as Sociology, Psychology, Medicine, Economics and Marketing [1]. In these contexts, quite often, they represent responses of interviewees as ratings, preferences, judgments, opinions, and so on. In this setting, the statistical analysis focusses on regression models able to interpret the subjects’ selection process.

From a general point of view, ordinal data are usually analyzed by means of models based on the multinomial distribution, which do not imply that the probabilities of categories belong to a family of random variables. Different approaches rely on Item Response Theory [10, 63, 70], Underlying Response Variable models [6, 5356] or, more recently, stochastic binary search algorithms [8]. These proposals have been introduced also to pursue clustering or classification objectives [71].

More specifically, the paper considers as paradigms for regression models of ordinal data those based on cumulative functions and discrete mixtures: proportional odds [49] and cub models [58] are the simplest instances of the two approaches, respectively. Though derived from different principles, they are often compared in terms of fitting, parsimony, graphical tools and interpretation. In fact, it will shown that such models are just two specifications of a more general paradigm for the analysis of ordinal data.

Then, a focal point of the discussion concerns the specification of “cutpoints” (or thresholds) which are real values able to transform a continuous phenomenon (whose existence is real or virtual) into a sequence of ordinal categories in a one-to-one correspondence with the first m integers, for some known m. In addition, we emphasize the need to parameterize the role of uncertainty in the responses as an inherent component of human decisions.

Several classifications are legitimate and we set apart models which are generated by a family of random variables characterized by few parameters from those which do not search explicitly for it. In the first instance a close relationship with the data generating process is required whereas in the second case more stringent fitting aptitudes are pursued.

The paper is organized as follows: in the next section, we examine the main components to be considered for the analysis of data generated by a selection of ordered categories. Then, a general framework and its main inferential issues are discussed in Sect.  3. Sections 4 and 5 consider different models according to discrete and latent variable paradigms, respectively. Section 6 investigates a real case study and the interpretative usefulness of the proposed approach. Some final remarks conclude the paper.

2 Data generating process for ordinal data

When data expresses opinions, evaluations, judgments, perceptions, etc. the selection of an ordinal category is the final outcome of a complex activity involving the human mind in a multifacet space filled by knowledge, memory, emotion, instinct, deduction, etc. where several factors related to subjects, items and circumstances affect the answer.

An important point of the debate is the distinction between people who optimize (those who do their best to give accurate responses) and people who satisfice (those who choose to do just enough to give a plausible answer that may not be the optimal one), as emphasized by [43, 45]. In any case, an authoritative model of question answering [66] distinguishes four steps in selecting a category for the response: (i) comprehension of the question; (ii) retrieval of information; (iii) derivation of a judgment; (iv) formulation of a response. This cognitive process includes both the perceptual aspect (the rater’s perception of the content of the item) and the decisional aspect (the rater’s use of the available ordinal scale). In all these steps, a subjective decision is mixed with a varying fuzziness.

To summarize, a simplified version of this mechanism assumes that at least two components drive the process leading to an ordinal choice: a substantive personal attractiveness towards the item and an inherent indecision to select a single category from a list of ordered options. Although these components should be better specified in each case study, hereafter, for simplicity, we qualify them as feeling and uncertainty.

2.1 Feeling

The first reaction of the mind towards an item/question may be positive, negative or indifferent: if a more accurate selection is requested then respondents should evaluate a finer subdivision. This human reaction will be called feeling but, according to different contexts, it may be better specified as attraction, preference, worry, concern, liking, agreement, level of perception, etc.

Feeling is the result of the personal history of the respondent with respect to the given item and, in this respect, genetic factors, emotional facts, family traditions, achieved education, environmental pressures, media influence, etc. are substantial and not independent facts. Thus, a probability distribution should be able to explain the continuous perception which derives from several agents (most of them are unconscious) and manifests itself as a discrete response.

A community of interviewees reacts to questions/items in different manner according to the factors which influence the final selection; however, people are positively, negatively or indifferently oriented in different proportions with some spread of preferences around the modal option. When positive, negative or indifferent proportions are prevalent, we should expect a left-skewed, right-skewed or almost symmetric distribution of the responses, respectively. That is, a homogeneous community should have a unique mode and the responses distribution may be captured by a well specified probability mass function. On the contrary, if subgroups are present (according to gender, nationality, education, marital status, working conditions, income distribution, etc.) different modal values may be expected.

These arguments do not conflict with situations where some interviewees adhere to a specific category by laziness, ignorance, desire of privacy: this is a refuge option named shelter effect and represented as a degenerate distribution collapsed at the shelter category.

The psychological movement which generates an ordinal choice should be explained by a discrete random variable parsimoniously parameterized to include uni or multimodality, shelter effect and different skewness. If the output is conditional on manifest (concomitant) variables, the so-called subjects’ covariates, one should insert these information in the modelling structure so as to improve interpretation and fitting of the observed data.

2.2 Uncertainty

Uncertainty is the result of convergent and related factors of different origins: limited set of information, knowledge/ignorance of properties and/or characteristics of the item to be judged, personal interest/engagement in activities related to the specific or similar items, amount of time devoted to the response, nature of the scale in terms of range and wording, tiredness or fatigue for a correct comprehension of the wording, willingness to joke and fake, lack of self-confidence of the respondent, apathy/boredom in the selection mechanism. Satisficing behaviour generates a varying degree of indecision [64] ranging from a complete lack of satisficing (totally accurate answer) to strong satisficing (totally random responses).

When a community of respondents manifests high uncertainty towards a given item a great heterogeneity in the distribution of responses is observed (the proportions of different categories tend to be similar), whereas if the uncertainty in the responses is extremely low most of the respondents select the same or surrounding categories (the distribution tends to be concentrated on a limited number of options). This component is not related to the variability of the responses (which is computed with the values \(r,\,r=1,\dots ,m\) of the ordinal scores) but to the variability of the numeric values of the probabilities \(p_r,\,r=1,\dots ,m\) (which involves the concept of heterogeneity of the distribution and derives from variability across individuals).

A further aspect is the overdispersion, that is the possibility that respondents manifest an inter-subject variability higher than those expected [36]. Overdispersion may be caused but not exhausted by uncertainty; thus, in some circumstances, a formal specification of the overdispersion may be effectively and parsimoniously achieved [37].

3 A unifying perspective

A framework which encompasses different regression models for ordinal data expresses the probability of a response as a finite mixture of (at least) two components which we have named as feeling and uncertainty (see Fig. 1).

Fig. 1
figure 1

Stochastic mechanism of a discrete ordinal choice

Hereafter, with respect to a given item \(\mathcal {I}\) submitted to a sample of n respondents, we denote as \(R_i\) the final response expressed by the i-th subject and as \(Y_i\) and \(V_i\), \(i=1,\dots ,n\), the corresponding random variables for feeling and uncertainty, respectively. All these random variables are defined over the discrete support \(\{1,\dots ,m\}\), for a known m. Let \(\varvec{\theta }=(\varvec{\beta }, \varvec{\Psi })\) and \(\varvec{\Psi }=(\varvec{\gamma },\mathcal {T}_m)\) the set of parameters characterizing the distribution of \((R_1,\dots ,R_n)\) and of the feeling component, respectively. Here, \(\mathcal {T}_m=(\tau _1,\dots ,\tau _{m-1})\) denotes a vector of parameters, called thresholds and possibly involved in the distribution of \(Y_i\), as specified in the following.

From a formal point of view, the feeling component \(Y_i\) assumes (categorical) values \(c_1, \dots , c_m\) and, for convenience, we score them by letting \(c_j=j\), for \(j=1,\dots ,m\). If the i-th subject is attracted by the j-th category as the preferred response to a given item we will denote this circumstance as:

$$\begin{aligned} Y_i=j\,, \qquad \text {for}\,\, i=1,\dots ,n\,\,\, \text {and}\,\, j=1,\dots ,m. \end{aligned}$$

A Generalized mixture model with uncertainty (gem) is defined as follows:

$$\begin{aligned} Pr\displaystyle \left( R_i=j\mid \varvec{\theta }\right) =\pi _i\, Pr\displaystyle \left( Y_i=j \mid \varvec{t}_i^{(\varvec{\gamma })}, \varvec{\Psi }\right) \,+\,(1-\pi _i)\,Pr\displaystyle \left( V_i=j\right) \,, \end{aligned}$$
(1)

for \(i=1,\dots ,n\) and \(j=1,\dots ,m\), where \(\pi _i=\pi (\varvec{t}_i^{(\varvec{\beta })}, \varvec{\beta })\, \in (0,\,1]\) are introduced to weight the two components and \(\varvec{t}_i^{(\gamma )} \in \varvec{T}^{(\varvec{\gamma })}\)\(\varvec{t}_i^{(\varvec{\beta })} \in \varvec{T}^{(\varvec{\beta })}\) include the selected covariates for the i-th subject. Available information on the subjects’ characteristics (covariates) are collected in a matrix \(\varvec{T}\), and \(\varvec{T}^{(\varvec{\beta })}\) and \(\varvec{T}^{(\varvec{\gamma })}\) are submatrices of \(\varvec{T}\) containing the regressors for \(\pi _i\) and \(Y_i\), respectively.

The probability distribution of the uncertainty component \(V_i\) is assumed known on the basis of a priori assumptions and it does not require parameters. Quite often, \(V_i\) is a discrete Uniform random variable over the support \(\{1,\dots ,m\}\) although some different distributions without parameters may be usefully introduced to take account of different response styles (as discussed by [27], for instance). An increase of \(\pi _i\) implies a reduced impact of the uncertainty component; thus, \((1-\pi _i)\) is a (normalized) measure of the importance of the uncertainty in the model (1). The case \(\pi _i \equiv 0\) is logically possible if we conjecture that the selection process is equivalent to a totally random choice; however, this assumption rules out the identifiability of the parameters concerning the feeling. Thus, we prefer to set \(\pi _i>0,\,\forall i\) and, to simplify notation, we drop in \(\pi _i\) the (possible) dependence on \(\varvec{t}_i^{(\varvec{\beta })}\) and \(\varvec{\beta }\,\).

Turning to the feeling component \(Y_i\), ordinal phenomena may be interpreted as genuinely observed or derived by a continuous variable \(Y^*\) (a latent variable) which for convenience or necessity is examined in a discrete version by means of Y. This dichotomy, expressed by [4], and also discussed by [1] and [67], is a fundamental issue in ordinal data modelling since it may lead to different specifications. In the first case the correspondence with integers is immediate (although the scale is not metric) and the assumptions about a latent variable are not so important because the statistical procedures involve a direct consideration of Y. In the second case the discretization of Y is obtained by means of cutpoints (thresholds) \(\tau _j,\,j=1,\dots ,m-1\) which transform the continuous support of \(Y^*\) into a sequence of ordered bins related to Y.

In general, the probability distribution of the feeling component \(Y_i\) is

$$\begin{aligned} \left\{ \begin{array}{ll} Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{t}_i^{(\varvec{\gamma })}\right) \,, &{} \text {if specified via a discrete distribution};\\ F_{Y^*_i}(\tau _j; \varvec{\gamma },\,\varvec{t}_i^{(\varvec{\gamma })})-F_{Y^*_i}(\tau _{j-1}; \varvec{\gamma },\,\varvec{t}_i^{(\varvec{\gamma })})\,, &{} \text {if specified via a latent variable distribution}; \end{array} \right. \end{aligned}$$

where \(F_{Y^*_i}(\tau _j; \varvec{\gamma },\,\varvec{t}_i^{(\varvec{\gamma })})=Pr\displaystyle \left( Y^*_i \le \tau _j \mid \varvec{\gamma },\,\varvec{t}_i^{(\varvec{\gamma })}\right) \) is the distribution function of the latent variable \(Y_i^*\).

The interpretation of (1) is two-fold:

  • according to the logic of latent class models [31, 46], respondents split into two clusters in proportions \((\pi _i)\) and \((1-\pi _i)\), respectively, and one class consists of people whose selection is completely random;

  • each respondent has propensities \((\pi _i)\) and \((1-\pi _i)\) to select an ordinal category with a meditated and a random choice, respectively.

We strongly support the second interpretation. In fact, finite mixture models for ordinal data have been already discussed [12, 28, 29, 72]; however, all these experiences concern combinations of probability distributions belonging to the same family and they implicitly assume that several subgroups justify such assumptions. As an instance, a proportion of Uniform distribution has been introduced to capture outliers (see: [5] for the continuous case).

A possible classification of regression models for ordinal data derives from the information available on the set of cutpoints:

$$\begin{aligned} \mathcal {T}_m=\{-\infty =\tau _0\,<\,\tau _1\,<\,\cdots \,<\,\tau _{m-1}\,<\,\tau _m=+\infty \}\,. \end{aligned}$$

In this regard, we distinguish between supervised (classes I and II) and unsupervised discretization (class III) and, to simplify discussion and notation, we will assume fixed cutpoints.

  1. I

    There are no cutpoints to define the ordered sequence, thus \(\mathcal {T}_m \equiv \emptyset \). In this case, the probability mass function of \(Y_i\) is directly specified.

  2. II

    Cutpoints are defined on a priori basis and are assumed to be known. A density function for \(Y^*_i\) is required and \(Y^*_i\) is transformed into a discrete version \(Y_i\) by means of some acceptable but arbitrary convention. A simple solution, valid for \(Y^*_i\) with a finite range \((\tau _{min},\,\tau _{max})\), consists in selecting equi-distributed thresholds. It implies that, for each \(i=1,\dots ,n\), the latent variable \(Y^*_i\) is discretized into \(Y_i\) by means of the transformation:

    $$\begin{aligned} \tau _{min}+(j-1)\,d <\,Y^*_i\,\le \,\tau _{min}+j\,d \,\,\,\Longrightarrow \,\,\, Y_i=j\,, \quad j=1,\dots ,m\,, \end{aligned}$$

    where \(d=(\tau _{max}-\tau _{min})/m\). Alternative solutions are possible to take account of rounding effects, for instance.

  3. III

    Cutpoints are unknown and may be located everywhere on the real support of \(Y^*_i\). The mapping from \(Y^*_i\) to \(Y_i\) is obtained by an ad hoc specification of cutpoints which are estimated by data. Notice that, since \(Y^*\) is a latent variable, both the scale and the origin cannot be identified ([2], pp.13–15). Then, for the continuous (latent) variable \(Y^* \in (\tau _{min},\,\tau _{max})\), where one or both \(\tau _{min}\) and \(\tau _{max}\) may be infinite, the correspondence between \(Y^*_i\) and \(Y_i\) becomes:

    $$\begin{aligned} \tau _{j-1}\,<\,Y^*_i\,\le \,\tau _{j} \quad \Longrightarrow \quad Y_i=j\,, \quad \,\, \text {for}\,\, i=1,\dots ,n\,\,\, \text {and}\,\, j=1,\dots ,m. \end{aligned}$$

In all situations, \(Y_i\)  is a discrete random variable characterized by a probability mass function \(Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{t}_i^{(\varvec{\gamma })}\right) \). However, this discretization may be reached in a direct way (case I) or by means of the probability distribution function \(F_{Y^*_i}(.)\) of a continuous latent variable \(Y_i^*\) and the knowledge (classes II) or the estimation (class III) of the cutpoints \(\tau _j \in \mathcal {T}_m\) which are necessary to transform \(Y^*_i\) into \(Y_i\).

Discretization implies a different knowledge about the generating mechanism of the ordinal data. The uniform splitting (class II) reproduces the shape of the continuous variable \(Y^*\); as an instance, if a latent variable is strongly right-skewed we are assigning high probabilities to the first (low) categories. Instead, a general splitting (class III) may change the distribution of the latent variable \(Y^*\) by assigning the concentration of probabilities to selected categories thanks to a convenient subdivision of the original range. In a sense, the option to estimate cutpoints lets the observed data to become the criteria of subdivision of the range of \(Y^*\) (unsupervised choice). However, the main burden of estimable cutpoints is that the number of parameters increases by \((m-1)\), and ceteris paribus this leads to a less parsimonious model.

The deterministic aspects of gem model (1) concern the relationships of the weight of uncertainty (related to \(\pi _i\)) and the intensity of feeling (denoted with \(\xi _i\)) with respect to subjects’ covariates. These links are expressed by:

$$\begin{aligned} \pi _i=g_{\pi }\left( \varvec{t}_i^{(\varvec{\beta })}, \varvec{\beta }\right) ;\qquad \xi _i=g_{\xi }\left( \varvec{t}_i^{(\gamma )}, \varvec{\Psi }\right) \,. \end{aligned}$$
(2)

Then, according to the GLM jargon, the stochastic and systematic components of the response \(R_i\) will be specified by (1) and (2), respectively.

Here, \(\pi _i\) and \(\xi _i\) summarize the statistical content of the model (1) with respect to uncertainty and feeling, respectively. As it will be evident in Sects. 4, 5, the quantities \(\xi _i\) may be related to the expectation or distribution function of \(Y_i\) or directly to the parameters which characterize such a distribution.

Common specifications are the logistic, Gaussian or complementary log-log functions [1], for instance. Formally, we get:

$$\begin{aligned} \left\{ \begin{array}{l} logit(\pi _i)=\varvec{t}_i^{(\varvec{\beta })}\, \varvec{\beta };\\ logit(\xi _i)=\varvec{t}_i^{(\varvec{\gamma })}\, \varvec{\gamma };\qquad \text {or} \qquad H^{-1}\bigg (Pr\displaystyle \left( Y_i \le j \mid \mathcal {T}_m,\,\varvec{\gamma }\right) \bigg )=\tau _j-\varvec{t}_i^{(\varvec{\gamma })}\, \varvec{\gamma }\,. \end{array} \right. \end{aligned}$$

where \(logit(p)=log\left( p/(1-p)\right) \) and H(.) is a monotone increasing function mapping \(\mathbb {R}\) into (0, 1); the logit(.) function and the inverse of the Gaussian distribution function are just two instances of \(H^{-1}(.)\).

For inferential purposes, given a sample \(\varvec{r}=(r_1,\dots ,r_n)'\) of ratings and a matrix \(\varvec{T}\) containing all subjects’ covariates, the likelihood function is:

$$\begin{aligned} \mathcal {L}(\varvec{\theta })=\prod _{i=1}^{n}\,\prod _{j=1}^{m}\,\left[ Pr\displaystyle \left( R_i=j \mid \varvec{\theta }\right) \right] ^{I(r_{i}=j)} \end{aligned}$$

and the corresponding log-likelihood function is

$$\begin{aligned} \ell (\varvec{\theta })=\log \mathcal {L}(\varvec{\theta })=\sum _{i=1}^{n}\,\sum _{j=1}^{m}\,I(r_{i}=j)\,\log \Big [ \pi _i\,Pr\displaystyle \left( Y_i=j \mid \varvec{t}_i^{(\varvec{\gamma })}, \varvec{\Psi }\right) \,+\,(1-\pi _i)\,p^V_j\Big ]\,, \end{aligned}$$

where \(I(r_{i}=j)=1\) if the i-th respondent selects the j category and \(I(r_{i}=j)=0\) otherwise. Hereafter, for simplicity, we let \(p^V_j=Pr\displaystyle \left( V_i=j\right) \).

For this class of models, the EM procedure [25] is an effective algorithm to reach convergence almost everywhere for general mixture distributions [26, 50]. Consistent starting values may be derived for most of the mixtures hereafter examined (as in [34], for instance) and convergence may be achieved when the increase in log-likelihood is less than a prefixed small value. Standard analyses based on deviance and likelihood ratio tests apply and \(AIC,\,BIC\) indexes are useful to compare non-nested models.

We mention the comparison of estimated probabilities \(\hat{p}_j\) with the corresponding relative frequencies \(f_j\) by means of a dissimilarity measure (Diss) or a fitting index \((\mathcal {F}^2)\) defined, respectively, by:

$$\begin{aligned} Diss=\frac{1}{2}\,\sum _{j=1}^{m}\,\mid f_j\,-\,\hat{p}_j\mid \,; \quad \qquad \mathcal {F}^2=1-Diss\,. \end{aligned}$$

Both of them are normalized in [0,1] and are especially useful in gem models without covariates or when covariates have a discrete support.

In the next Sections, we will discuss some specifications of (1), (2) which lead to well known statistical models, so far presented as alternative approaches to the analysis of ordinal data.

4 Specification of models via discrete random variables

In case latent variables are not involved in the specification of the probability distribution, models for ordinal data are derived on the basis of an explicit data generating process without the knowledge of cutpoints (Class I). Considered as a whole, these probability distributions represent a sequential strategy aimed to fit the observed data with increasing complexity.

4.1 IHG models

The generating process of the marginal rank distribution of a given item (which can be considered as an ordinal evaluation of the item) has been studied by means of a discrete random variable [20, 22, 23]. More specifically, the inverse (or negative) hypergeometric distribution (IHG) has been introduced to mimic the waiting time of a white ball in the sequential drawing from an urn which contains black and white balls in some definite proportion [30, 51, 73].

No uncertainty is present (then, \(\pi _i \equiv 1\)), and the stochastic and systematic components of the ihg model are:

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma }\right) =\xi _i (1-\xi _i)^{j-1}\,\frac{m-1}{m}\,\prod _{s=1}^{j}\,\frac{m-s+1}{m-s+\xi _i(s-1)}\,;\\ logit(\xi _i)=\varvec{t}_i^{(\varvec{\gamma })}\,\varvec{\gamma }\,; \end{array} \right. \end{aligned}$$

for \(j=1,\dots ,m\) and \(i=1,\dots ,n\). The expectation of this random variable is \(\mathbb {E}\left( Y_i\right) =\displaystyle \frac{m-\xi _i}{1+\xi _i\,(m-2)}\). Then, a different parameterization which involves the mean value is possible.

This model is especially useful in case subjects judge the items as extreme (the worst or the best in a list) since the modal value of the ihg distribution may be 1 or m. The skewness is positive (negative) according to \(\xi _i<1/m\) (\(\xi _i>1/m\)); if \(\xi _i=1/m\) the ihg random variable becomes the discrete uniform distribution. A positive (negative) feeling towards the item is determined by \(\xi _i<1/m\) (\(\xi _i>1/m\)); as a consequence, the interpretation of the parameter depends on the maximum of the support. This characteristic causes some difficulties in comparing ihg models when the number of categories is different.

4.2 Shifted binomial models

To take account of an intermediate modal value, a shifted binomial (SB) random variable has been introduced as a proxy of the generating process of paired comparison of categories [21]. If \(m=2\), the so-called Binomial regression is obtained.

Again, the proposal does not include uncertainty (then, \(\pi _i \equiv 1\)) and the stochastic and systematic components of (1) are:

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma }\right) ={m-1 \atopwithdelims ()j-1} \xi _i^{m-j} \left( 1-\xi _i\right) ^{j-1}\,;\\ logit(\xi _i)=\varvec{t}_i^{(\gamma )}\,\varvec{\gamma }; \end{array} \right. \quad j=1,\dots ,m;\,\,\,i=1,\dots ,n. \end{aligned}$$

Since \(\mathbb {E}\left( Y_i\right) =1+(m-1)(1-\xi _i)=m-(m-1)\,\xi _i\), a direct parameterization by using the mean value is possible.

Fig. 2
figure 2

A bi-modal distribution well fitted by a shifted binomial model with a dichotomous covariate: conditional (left panel) and simulated (right panel) distributions

The SB random variable is a unimodal distribution but this model may well be exploited for fitting/interpreting multimodal observations if explanatory covariates (as a dichotomous one, for instance) are able to characterize the clusters. As an instance, for \(m=9\), Fig. 2 shows the implied distributions (conditional to \(D_i=0,1\), respectively) of the following SB model:

$$\begin{aligned} \left\{ \begin{array}{lll} Pr\displaystyle \left( R_i=j\right) &{}=&{}\left( {\begin{array}{c}8\\ j-1\end{array}}\right) \,\xi _i^{8-j}\,(1-\xi _i)^{j-1}\,;\\ logit\big (\xi _i\big )&{}=&{}-1.362\,+\,2.744\,D_i\,; \end{array} \right. \qquad j=1,2,\dots ,9;\,\,\,i=1,2,\dots ,n. \end{aligned}$$

and a simulated frequency distribution generated by this model.

A relationship of the shifted binomial distribution with item response theory (IRT) has been advanced by [3] in the context of Likert-type personality measures to support the idea of a genuinely discrete generating process of the responses to a given item.

4.3 CUB models

The mixture defined cub model is a (convex) Combination of a discrete Uniform and a (shifted) Binomial distribution. It has been introduced by [58] on the basis of psychological arguments and further discussed by [24, 38, 41, 59], among others. A package in R [42] for building cub models and several generalizations [40] is available.

This class of models is very flexible and parsimonious since observed data with different location, variability and skewness are adequately fitted with just two parameters. In addition, given the one-to-one correspondence between the parameters \((\pi _i,\,\xi _i)\) and a probability distribution conditional on the observed covariates for the i-th subject, a cub model allows effective interpretations when parameter estimates are examined as points in the parameter space (i.e., unit square). As a matter of fact, this visualization conveys immediate information in terms of uncertainty and feeling, respectively.

Formally, a cub model is specified by the stochastic and systematic components given, respectively, by:

$$\begin{aligned} \left\{ \begin{array}{lll} Pr\displaystyle \left( R_i=j \mid \varvec{\beta },\,\varvec{\gamma }\right) &{}=&{}\pi _i\,b_{j}(\xi _i)+(1-\pi _i)\,p^V_j\,;\\ logit\left( \pi _i\right) =\varvec{t}_i^{(\varvec{\beta })} \varvec{\beta }\,;&{}&{}\qquad logit\left( \xi _i\right) =\varvec{t}_i^{(\varvec{\gamma })}\varvec{\gamma }\,;\\ \end{array} \right. \quad j=1,\ldots ,m\,;\,\, i=1,\ldots ,n\,. \end{aligned}$$
(3)

We have set \(b_j\left( \xi _i\right) ={m-1 \atopwithdelims ()j-1} \xi _i^{m-j} \left( 1-\xi _i\right) ^{j-1}\) and \(p^V_j=1/m\), \(j=1,\ldots ,m\), for the probability mass functions of the shifted binomial and discrete Uniform random variable, respectively. Here, \((\varvec{t}_i^{(\varvec{\beta })}, \varvec{t}_i^{(\varvec{\gamma })})\) are the information set for the i-th subject extracted from \(\varvec{T}\) to specify the relationships of \(\pi _i\) and \(\xi _i\) with the corresponding subjects’ covariates. Of course, the deterministic part of model (3) is not limited to the logistic link since any monotone mapping \(\mathbb {R}^p\,\leftrightarrow \,(0,1)\) between linear combinations of covariates and parameters is legitimate.

cub models have been also applied to a collection of items related to the same questionnaire to describe, classify or discriminate the behaviour of respondents with respect to the different items: these objectives may be pursued without any reference to subjects’ covariates for clustering objectives [14, 16] or for missing data imputation [19]. Furthermore, multilevel extensions [35] and multivariate generalizations [13, 15] have been advanced since the approach has been interpreted as a new paradigm for modelling ordinal data.

An important variant has been advanced by [48] who proposed non-linear cub models specified by a non-constant transition probability from one category to the next. In this perspective, cub models are the simplest solution for they assume a constant transition probability derived by some approximation.

4.4 CUBE models

To take account of a possible overdispersion in ordinal data, [36] proposed a mixture model, denoted as cube (Combination of Uniform and BEta-Binomial random variable), which has been further extended with the introduction of subjects’ covariates [60]. The stochastic and the systematic components of this model are defined by:

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( R_i=j \mid \varvec{\theta }\right) =\pi _i\,h_{j}(\xi _i, \phi _i)+(1-\pi _i)\,p^V_j\,;\\ logit\left( \pi _i\right) =\varvec{t}_i^{(\varvec{\beta })} \varvec{\beta }\,;\,\,\, logit\left( \xi _i\right) =\varvec{t}_i^{(\varvec{\gamma })} \varvec{\gamma }\,;\,\,\, \log \left( \phi _i\right) =\varvec{t}_i^{(\varvec{\alpha })} \varvec{\alpha }\,;\\ \end{array} \right. \end{aligned}$$
(4)

for \(j=1,\dots ,m\) and \(i=1,\dots ,n\). Generally, \(p_j^V=1/m,\,j=1,\dots ,m\). We let \(\varvec{\theta }=(\varvec{\beta },\, \varvec{\gamma },\varvec{\alpha })\) whereas \(h_{j}(\xi _i, \phi _i)\) is the (shifted) Beta-Binomial distribution of the feeling of the i-th subject which has been parameterized as follows:

$$\begin{aligned} h_j(\xi _i,\phi _i)=\left( {\begin{array}{c}m-1\\ j-1\end{array}}\right) \,\displaystyle \frac{\displaystyle \prod _{k=1}^{j}\Big [1-\xi _i+\phi _i\,(k-1)\Big ]\, \displaystyle \prod _{k=1}^{m-j+1}\Big [\xi _i+\phi _i\,(k-1)\Big ]}{\Big [1-\xi _i+\phi _i\,(j-1)\Big ]\,\Big [\xi _i+\phi _i(m-j)\Big ]\, \displaystyle \prod _{k=1}^{m-1}\Big [1+\phi _i\,(k-1)\Big ]}\,, \end{aligned}$$
(5)

for \(j=1,\dots ,m\) and \(i=1,\dots ,n\). If \(\phi _i \equiv 0,\,\forall i\) we get a cub model; thus, cub are nested into cube models and the selection between them is made effective by means of likelihood ratio tests.

Simulations and experimental evidence show that an explicit parameterization of a possible overdispersion allows a more correct analysis of the uncertainty in the estimated model [37].

4.5 Models with varying uncertainty

Hitherto, the discrete Uniform random variable has been considered as the building block to account for a subjective propensity to indecision: this solution is motivated by simplicity and parsimony.

In order to fit real situations generated by different response styles of interviewees, a greater flexibility has been suggested for \(p_j^V\) [27]; it can be achieved by selecting one of alternative discrete distributions over the support \(\{1,\dots ,m\}\), without the addition of further parameters. Formal tests have been introduced in this respect.

cub models with a varying uncertainty (vcub models) belong to the gem models family (1) with a convenient specification of \(p_j^V\) and vice versa, all gem models may be safely considered with a varying uncertainty in \(p_j^V\).

4.6 The family of models with a shelter component

In psychological, sociological, political and marketing surveys an important proportion of respondents may choose a category which represents a sort of “refuge” to avoid a more demanding selection: this circumstance has been named a shelter effect and the corresponding option a shelter category [16]; then, a cub model with a shelter effect as a sort of c-inflated distribution has been considered [32].

4.6.1 GeCUB models

A cub model with a shelter effect which includes subjects’ covariates has been referred to as a g e cub model [39, 41]. The stochastic and the systematic components are specified by:

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( R_i=j\mid \varvec{\theta }\right) =\delta _i\,\biggl [\,D_{j}^{(c)}\,\biggr ]\,+\,(1-\delta _i)\,\biggl [\,\pi _i\,b_{j}(\xi _i)+(1-\pi _i)\,p_j^V\,\biggr ];\\ logit\left( \pi _i\right) =\varvec{t}_i^{(\varvec{\beta })} \varvec{\beta }\,;\quad logit\left( \xi _i\right) =\varvec{t}_i^{(\varvec{\gamma })} \varvec{\gamma }\,;\quad logit\left( \delta _i\right) =\varvec{t}_i^{(\varvec{\omega })} \varvec{\omega }\,;\quad i=1,\ldots ,n\,;\\ \end{array} \right. \end{aligned}$$
(6)

where \(\varvec{\theta }=(\varvec{\beta },\,\varvec{\gamma },\,\varvec{\omega })\) and the dummy (degenerate) random variable is defined by: \(D_{j}^{(c)}=I(j=c)\), with I(E) the indicator function of E. The integer \(c \in [1,m]\) is assumed known on the basis of the nature of the selected option. Again, the logistic link may be substituted by a continuous monotone mapping from \(\mathbb {R}\) to (0, 1).

The stochastic component of (6) may be written as

$$\begin{aligned} Pr\displaystyle \left( R_i=j\mid \varvec{\theta }\right) =\pi _i^*\,Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{\omega }\right) +(1-\pi _i^*)\,p_j^V \end{aligned}$$

where

$$\begin{aligned} \left\{ \begin{array}{lll} \pi _i^*&{}=&{}\pi _i+\delta _i\,(1-\pi _i)\,;\\ Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{\omega }\right) &{}=&{}\left( \frac{\delta _i}{\delta _i+\pi _i\,(1-\delta _i)}\right) \,D_{j}^{(c)}+\left( \frac{\pi _i\,(1-\delta _i)}{\delta _i+\pi _i\,(1-\delta _i)}\right) \,\,b_{j}(\xi _i)\,; \end{array} \right. \end{aligned}$$

then, g e cub models fit the gem model family defined by (1), (2).

4.6.2 “Don’t know” models

The family of cub models has been exploited to cope with non-responses motivated by a “Don’t know” option [47]. The empirical results are really convincing and the new proposal captures respondents’ opinions with more fidelity than the simplistic solution to exclude “Don’t know” responses.

The specification of this model is:

$$\begin{aligned} Pr\displaystyle \left( R_i=j\right) =Pr\displaystyle \left( A=0\right) \,Pr\displaystyle \left( R_i=j \mid A=0\right) \,+\,Pr\displaystyle \left( A=1\right) \,p_j^V\,, \end{aligned}$$

where A is a dichotomous random variable indicating whether the respondent is able or not to formulate the requested rating \(R_i\); it assumes values \(0,\,1\) with probabilities \(p_0\) and \(1-p_0\), respectively. Then, Authors argued that \(Pr\displaystyle \left( R_i=j \mid A=0\right) \) may be the probability mass function of a cub model.

These assumptions lead to a class of models included in (1):

$$\begin{aligned} Pr\displaystyle \left( R_i=j \mid \varvec{\theta }\right) =\pi _i^{**}\,Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{\omega }\right) \,+\,(1-\pi _i^{**})\,p^V_j,\quad i=1,\dots ,n\,; \,\,\, j=1,\dots ,m\,, \end{aligned}$$

where

$$\begin{aligned} Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{\omega }\right) =b_j(\xi _i);\qquad p^V_j=1/m,\,\forall j; \qquad \pi _i^{**}=\pi _i\,p_0\,; \end{aligned}$$

and \(\pi _i,\,\xi _i,\) are defined as in (3).

Although this model may be formally included in the family of cub models with a shelter effect (“Don’t know” response is a sort of refuge caused by difficulties to express an opinion), a substantial difference should be noted: the shelter category c belongs to the support \(\{1,\dots ,m\}\) whereas the “Don’t know” option is an added choice for the respondent. This characteristic justifies a two-step procedure for the estimation of \(p_0\) and the model parameters.

A possible extension includes subject’s covariates to characterize clusters of respondents who select the “Don’t know” option. In addition, to investigate the possible causes of the missing selection, we might consider objects’ covariates in a multi-item perspective, as pursued by [61], for instance. Thus, it would be possible to explain the reasons because some items have been considered by respondents more awkward than others.

5 Specification of models via latent variables

Some specifications of (1), which are based on the relationship with a latent variable and whose cutpoints are necessary to set the ordered categories, are now discussed. A difference between supervised (models of Class II: Sects. 5.1 and 5.2) and unsupervised discretization (models of Class III: Sects. 5.3 and 5.4) is considered. For these models, \(\xi _i=logit(\mu _i)\) and \(\xi _i=logit(F_{Y_i}(j))\), respectively.

5.1 CUN model

In a note relating cub models and IRT, a direct discretization of the Normal distribution has been proposed [62] by specifying model (1) with

$$\begin{aligned} Pr\displaystyle \left( Y_i=j \mid \mu ,\,\sigma \right) \,=\,\Phi \left( \frac{j+1/2-\mu }{\sigma }\right) -\Phi \left( \frac{j-1/2-\mu }{\sigma }\right) \,, \end{aligned}$$

where \(\Phi (.)\) is the distribution function of a standard Gaussian random variable and \((\mu \,,\sigma )\) are quantities estimable by data. Here \(\pi _i \equiv 1\) and \(\mathcal {T}_m=(\frac{1}{2},1+\frac{1}{2},\dots ,m-\frac{1}{2})\).

The introduction of the parameter \(\sigma \) for the variability improves the fitting of observed data; however, the uniform splitting and the symmetry of the Gaussian distribution around \(j=\left[ \mu \pm \frac{1}{2}\right] \) are severe constraints to cope with observed data.

Simple generalizations include: (i) an ad hoc splitting of cutpoints by means of thresholds \(\mathcal {T}_m\) to be estimated by data as in probit/logit models; (ii) a link between parameters and subjects’ covariates by means of:

$$\begin{aligned} logit(\mu _i)=\varvec{t}_i^{(\varvec{\gamma })} \varvec{\gamma }\,; \qquad \log (\sigma _i)= \varvec{t}_i^{(\varvec{\lambda })} \varvec{\lambda }\,; \end{aligned}$$

where \((\varvec{t}_i^{(\varvec{\gamma })},\, \varvec{t}_i^{(\varvec{\lambda })})\) are the information set derived from \(\varvec{T}^{(\varvec{\gamma })}\) and \(\varvec{T}^{(\varvec{\lambda })}\), the submatrices of \(\varvec{T}\) containing regressors for the mean level and variability, respectively.

5.2 Discretization of a Beta random variable

The noticeable flexibility of the shape of the Beta random variable over a finite continuous support has driven different Authors to address this distribution to get ordered values (updates references in [65]). Indeed, this random variable achieves “separation effects” in order to eliminate dependence of the variance from the expectation in robust experimental designs [11]. In a recent Ph.D. thesis [69], the split of the (0, 1) range of a Beta distribution into m equi-distributed cutpoints to obtain a probability mass function with a flexible shape has been proposed. The model does not consider uncertainty; thus, \(\pi _i \equiv 1,\, \forall i\) (see also [52]).

The parameterization for \(Y^*\) is:

$$\begin{aligned} h_{Y^*}(y)=\frac{\Gamma (\phi )}{\Gamma (\mu \,\phi )\,\Gamma (\phi \,(1-\phi ))}\,y^{\mu \,\phi -1}\,(1-y)^{(1-\mu )\,\phi -1}\,, \qquad y \in (0,1) \end{aligned}$$

with \(\mu >0\) and \(\phi >0\), as in Beta regression analysis [18]. As a consequence, the first two central moments are:

$$\begin{aligned} \mathbb {E}\left( Y^*\right) =\mu \,;\qquad Var(Y^*)=\frac{\mu \,(1-\mu )}{1+\phi }. \end{aligned}$$

Thus, \(\mu \) and \(\phi \) may be interpreted as location and precision parameter, respectively.

Similarly to [52], \(\mathcal {T}_m=\{0\,,1/m\,,2/m,\dots ,1\}\) has been selected as a supervised splitting. In presence of covariates, let us define:

$$\begin{aligned} \begin{array}{lll} Pr\displaystyle \left( \tau _{j-1}\,<\, Y_i^* \,<\, \tau _{j}\right) &{}=&{}\displaystyle \int _{\tau _{j-1}}^{\tau _j}\,h_{Y^*}(y)\,dy\\ &{}=&{}IB_{\mu _i\phi _i,(1-\mu _i)\phi _i}(\tau _{j})-IB_{\mu _i\phi _i,(1-\mu _i)\phi _i}(\tau _{j-1})\,, \end{array} \end{aligned}$$

for \(j=1,\dots ,m\) and \(i=1,\dots ,n\), where \(\tau _j=j/m\) for \(j=0,1,\dots ,m\) and \(IB_{\mu \phi ,(1-\mu )\phi }(.)\) is the incomplete Beta function with parameters \((\mu _i\phi _i,\,(1-\mu _i)\phi _i)\).

Then, the stochastic and systematic components of this Discrete-Beta model become, for \(j=1,\dots ,m\) and \(i=1,\dots ,n\):

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( Y_i=j \mid \varvec{\gamma },\,\varvec{\lambda }\right) =IB_{\mu _i\phi _i,(1-\mu _i)\phi _i}(\tau _{j})-IB_{\mu _i\phi _i,(1-\mu _i)\phi _i}(\tau _{j-1})\,;\\ logit(\mu _i)=\varvec{t}_i^{(\varvec{\gamma })} \varvec{\gamma };\quad \log (\phi _i)=\varvec{t}_i^{(\varvec{\lambda })} \varvec{\lambda }\,. \end{array} \right. \end{aligned}$$

The proposed distribution may be underdispersed with respect to the (shifted) Binomial one whereas cub models tend to have tails heavier than this discretized Beta random variable.

5.3 Cumulative models

In this class of models, the response \(R_i\) is interpreted as coincident with the expressed feeling \(Y_i\). A distinctive feature of the approach is the presence of (unsupervised) cutpoints in \(\mathcal {T}_m\). Since \(\pi _i \equiv 1\), a possible uncertainty/heterogeneity of data is spread around all categories and captured by the estimated cutpoints.

Assume that \(p\ge 1\) covariates are relevant for explaining the latent regression model by means of

$$\begin{aligned} Y_i^*= \varvec{t}_i^{(\gamma )} \varvec{\gamma }+ \epsilon _i,\qquad i=1, \ldots , n, \end{aligned}$$

where \(\epsilon _i \sim F_{Y^*_i}(\tau _j; \varvec{\gamma })\). Common choices for \(F_{Y^*}(.)\) are Gaussian, logistic or Complementary log-log distributions: see [1, 67], for details. Then, \(Pr\displaystyle \left( Y_i=j\mid \mathcal {T}_m,\,\varvec{\gamma }\right) =Pr\displaystyle \left( \tau _{j-1} < Y_i^*\le \tau _{j}\right) \).

If a logistic random variable for \(\epsilon _i\) is considered, the stochastic and systematic components of the model are:

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( Y_i=j \mid \mathcal {T}_m,\,\varvec{\gamma }\right) \,=\,F_{Y^*_i}(\tau _{j}-\varvec{t}_i^{(\gamma )}\varvec{\gamma })-F_{Y^*_i}(\tau _{j-1}-\varvec{t}_i^{(\gamma )}\varvec{\gamma })\,;\\ logit\left[ Pr\displaystyle \left( Y_i \le j \mid \mathcal {T}_m\right) ,\,\varvec{\gamma }\right] \,=\,\tau _{j}-\varvec{t}_i^{(\gamma )} \varvec{\gamma }\,.\\ \end{array} \right. \end{aligned}$$
(7)

Of course,

$$\begin{aligned} Pr\displaystyle \left( Y_i \le j \mid \mathcal {T}_m,\,\varvec{\gamma }\right) =\left[ 1+\exp (-[\tau _{j}-\varvec{t}_i^{(\gamma )} \varvec{\gamma }])\right] ^{-1}\,,\,\,\,\, i=1,\ldots ,n\,;\,\,\, j=1,\ldots ,m. \end{aligned}$$

The parameter vector \((\mathcal {T}_m,\,\varvec{\gamma })\) consists of intercepts \(\mathcal {T}_m=(\tau _1,\dots ,\tau _{m-1})\) and covariate coefficients \(\varvec{\gamma }\). If there are no covariates the cumulative model is a saturated one and, given a sample of ordinal data, empirical and estimated distribution functions coincide.

Intercepts \(\tau _j\) characterize each category but model (7) assumes a constant effect \(\varvec{t}_i^{(\gamma )} \varvec{\gamma }\) for all categories. This restraint has been removed by considering a varying effect \(\varvec{t}_i^{(\gamma )} \varvec{\gamma }_j\) as in stereotype models [4], for instance. However, this extension increases the number of parameters to be estimated and requires a priori constraints to take the order of the categories into account: see ([1], pp 103–115) and ([67], p 260). Finally, a dispersion parameter may be added [49] and more general location-scale cumulative odds models may be considered [17]. Other generalizations include multilevel versions, varying choice of thresholds, multivariate setting, etc.

Ceteris paribus, model (7) presents constant log-odds for consecutive values of a single covariate and it has been denoted [49] as a Proportional Odds Model (pom) or, more correctly, a “proportional odds version of the cumulative logit model” ([1], p. 53). This assumption may be tested in practical applications by a formal score test [57] and/or a graphical device [44]. Other cumulative models (as adjacent category and continuation ratio models, for instance) have been introduced but their usage is more limited. Finally, sequential models support the selection of an ordinal score which implies a stepwise decision process ([67], pp 252–255). Notice that cumulative logit models can be also interpreted following both the IRT and URV approaches.

5.4 CUP models

Tutz et al. [68] have recently proposed to extend the class of cumulative models by introducing a component of uncertainty to improve the fitting. This mixture has been denoted as a cup model (that is, a Combination of a discrete Uniform and a Preference random variable).

Formally, the stochastic and systematic components of a cup model are:

$$\begin{aligned} \left\{ \begin{array}{l} Pr\displaystyle \left( R_i=j \mid \varvec{\theta }\right) =\pi _i\,\Big [F_{Y^*_i}(\tau _{j}-\varvec{t}_i^{(\gamma )}\varvec{\gamma })-F_{Y^*_i}(\tau _{j-1}-\varvec{t}_i^{(\gamma )}\varvec{\gamma })\Big ]\,+\,(1-\pi _i)\,\frac{1}{m}\,;\\ logit\left( \pi _i\right) =\varvec{t}_i^{(\varvec{\beta })} \varvec{\beta }; \qquad \qquad logit\left( Pr\displaystyle \left( Y_i \le j\right) \right) =\tau _{j}-\varvec{t}_i^{(\varvec{\gamma })} \varvec{\gamma }\,;\\ \end{array} \right. \end{aligned}$$

for \(j=1,\dots ,m\) and \(i=1,\dots ,n\). Then, a cup model may be considered a gem model (1) where \(Pr\displaystyle \left( Y_i=j \mid \mathcal {T}_m, \varvec{\gamma }\right) \) coincides with (7) and \(p_j^V\) is the probability mass function of a discrete Uniform random variable.

The systematic part of the model saves the traditional definition of a predictor (as in cumulative models) but considers as well parameters \(\pi _i\) (as in the family of cub models) to weight for the uncertainty component. Then, a plot of log-odds (which are independent of categories when the link is logistic) versus \(1-\pi _i\) has been proposed as a graphical tool to interpret the effect of subjects’ covariates.

Authors emphasize the ability of cup models to capture multimodality and empirical overdispersion with a better fitting. From an abstract point of view, the comparison with more parsimonious models (as those discussed in Sect. 4) should lead to prefer cup models since they achieve more fidelity given a greater number of parameters. Empirical evidence shows that this is not a rule when one considers both fitting and parsimony criteria, as derived by BIC measures, for instance.

6 A real case study

During 2007, a survey has been carried out in Naples, Italy, to score the most serious emergencies of the metropolitan area by using a Likert scale, with \(m=9\) (where 1 = “completely unimportant” and 9 = “absolutely serious”). Hereafter, we will emphasize only the ordinal responses to the “worry for organized crime” item (crime) of \(n=2381\) subjects. Sample data include the subjects’ covariates gender and age, among others.

Table 1 lists the main results based on log-likelihood and dissimilarities measures obtained by fitting parsimonious models based on discrete probability distributions (standard errors of estimated parameters are in parentheses). For the moment, we consider models without subjects’ covariates and for fitting purposes we compare the models which admits an explicit data generating process (see Sect. 4). More specifically, Fig. 3 shows the distribution of the observed responses (left panel) and compares them with the estimated probabilities obtained from a cube model (right panel).

Table 1 Models fitted to the crime response (without covariates)

We notice an almost perfect fit of cube model (dissimilarity index shows that less than \(1\,\%\) of responses should be changed to obtain a complete overlapping of the observed and estimated distributions) and this result is confirmed by a log-likelihood at the maximum which is very close to that of the saturated model (\(=-3994.56\)). Worry for this item is very strong (\(1-\hat{\xi }\) is always greater than 0.83) and all estimated models imply an expected response in \([7.1,\,7.5]\). In addition, uncertainty has been estimated around 0.30 by all models of cub family. However, most of uncertainty may be attributable to overdispersion, as shown by the cube model where the inclusion of \(\phi \) significantly cuts uncertainty by half.

Fig. 3
figure 3

Distribution of responses to crime and probabilities of a fitted cube model

No significant covariates affect uncertainty whereas gender and the transformed covariate lage=\(log(age)-\overline{log(age)}\) are significant for the feeling component. Thus, we estimate different models with gender, lage and lage \(^2\) and present results only for significant covariates. The (more complete) systematic component for the feeling turns out to be:

$$\begin{aligned} logit(\xi _i)=\gamma _0+\gamma _1\,\texttt {lage}_i+\gamma _2\,\texttt {lage}_i^2+\gamma _3\,\texttt {gender}_i\,,\quad i=1,\dots ,n\,. \end{aligned}$$

In addition, for the Discrete-Beta model we found a significant relationship of the dispersion parameter with gender and lage expressed by:

$$\begin{aligned} \log (\sigma _i)=\lambda _0+\lambda _1\,\texttt {lage}_i\,,\quad i=1,\dots ,n\,. \end{aligned}$$
Table 2 Models fitted to the crime response (with covariates)
Fig. 4
figure 4

Expected worry for crime as a function of gender and age according to ihg  (top-left), cub (top-right), cube (bottom-left) and cup (bottom-right) models

Table 2 lists the estimates obtained for the main models discussed so far. Opposite signs in d-beta, pom and cup models are expected given their different parameterizations.

In our case study, the significance of lage and gender is a common feature of all models with a similar impact whereas only ihg, cub and partially cube (with a fairly significant estimate) are able to capture a parabolic effect of lage on the responses. The unsupervised models (class III) improve the fitting and exclude lage \(^2\); then, the parabolic effect is masked by (accounted for) the estimated cutpoints which adapt themselves to the specific shape of the categories. Observe that the maximum of log-likelihood is obtained for a cup model (with 11 parameters) whereas a comparable fit is obtained by a cube model (with just 5 parameters) which turns out to achieve the minimum BIC.

A more effective picture displays the effect on the mean value \(\mathbb {E}\left( R_i\right) \) of a varying age, given the gender, for some of the previous models (Fig. 4). These plots show that the expected worry is regularly inferior for women; in addition, the effect of cube and cup models are quite similar whereas some difference in level is implied by ihg and cub models.

7 Conclusions

The specifications previously discussed may be further extended in several directions, by considering different distributions for the random variables \(Y_i\) and \(V_i\). Given the many possibilities offered by the proposed mixture it seems difficult to be exhaustive in this regard. Some of them are legitimate by a direct relationship with the data generating process and should be supported by empirical evidences.

According to the latent variable approach, possible candidates for the feeling are the negative exponential, Gamma and similar distributions: they are characterized by few parameters of location, variability and/or shape which can be related to subjects’ covariates by adequate link functions. Similarly, if we jointly consider items and subjects, the approach of 3PL model in IRT [9], which implies a “guessing option”, may be explored within the class of gem models.

Finally, the paradigm so far discussed emphasizes the role of a discrete support in modelling ordinal data and the importance to consider discrete states in the selection process, an approach fostered by [7] for longitudinal data and by [3] for personality measures.

A unifying perspective for modelling ordinal data is a useful framework to compare models, to discover unexpected similarities and to introduce new distributions. Most of all, the opportunity to see the same data from different points of view improves interpretation and prediction of ordinal responses by means of statistical models.