Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The diffusion of a culture of evaluation in the Italian Universities has changed the logic and the development of several activities/procedures. As a consequence, Universities perform periodic surveys in order to assess the students’ satisfaction with respect to the main conditions of teaching and the environment where teaching takes place. In addition, several projects and groups have been involved with statistical analyses of University evaluation.

In compliance with procedures established by law, responses are collected among students that attend lectures in a period close to the ending date of courses. This circumstance influences the responses because the students involved with the survey have attended lectures for more than half. Thus, they are aware of problems and tend to give substantially positive answers to the items of the questionnaire. In this context, we think that this large mass of data should be used in effective ways in order to discover useful information with regard to the evaluation process.

This work is organized as follows: in Sect. 7.2 we discuss the concept of students’ perception of teaching quality; in Sect. 7.3 we emphasize how the transformation from perception to rating is a complex decision, and thus it calls for adequate statistical approaches. These considerations are deepened in Sect. 7.4 by considering the role of latent variables in the evaluation process and the main logical framework where questionnaires are examined (Item Response Theory). In Sect. 7.5 we introduce a different model for data evaluation that explicitly aims at interpreting the probability distribution of ordinal choices, for each item. The mixture random variable we will discuss about and related generalizations of CUB models are briefly illustrated with special reference to students’ perception and teaching evaluation. Then, in Sect. 7.6, empirical evidences related to the survey conducted at University of Naples Federico II are presented. Some concluding remarks end the chapter.

2 Measurement of Students’ Perception About Teaching Quality

Objective of the survey submitted to students is the measurement of the satisfaction of University teaching and related aspects: timing, structures, courses consistency, and so on. These measures are not physical characteristics of some objects but psychological constructs related to respondents. This condition affects both the planning of the experiment and the analysis of data.

In this context, the plan of the experiment consists in a list of several questions (items) submitted to students with regard to relevant issues of the University satisfaction. Most of the items are derived from the standard guidelines [24], and each University specifies/qualifies them on the basis of local requirements.

Since satisfaction is a continuum latent variable, responses to items are based on some ordinal scale (generally, high values are related to high satisfaction). In this way, for each item, respondents are asked to select one of the first m integers related to a Likert scale points. In Italy, several questionnaires are based on a 4-points scale; however, we are strongly convinced that wider ranges for scale ratings are more effective and convenient, even for dynamic comparisons and selective discussions of the results (in any case, we suggest an odd number of alternatives).

In the statistical literature, data analysis of expressed ratings is usually performed by means of several exploratory and inferentially based methods. However, in periodical reports of the “Nuclei di Valutazione” and Councils meetings (“Consigli di Facoltà”, “Corsi di Laurea”), simple indicators are presented as common benchmarks for discussion and decisions. As a rule, they are related to the frequency of positive answers and/or the average of quantified responses (sometimes, along with dispersion measures). Most of the critical issues on the evaluation process is currently based on these measures.

We think that indicators without models may cause misunderstanding. Surely, numerical syntheses simplify patterns and complex considerations and allow a large audience to interact with results and assess a final judgement. However, the reduction of large mass of data to just one or two indicators without reference to a generating process may be often misleading, even if some indicators seems illuminating. Everybody knows that average is a correct location measure for a well balanced and unimodal distribution with no extreme data; however, considerable attention must be paid when averages are applied to mere quantifications of qualitative variables without any consideration of the stochastic nature of human decisions.

In this regard, we show in Fig. 7.1 two hypothetical distributions of preferences/rating to a given item expressed on a 9-point scale by two sets of respondents (we refer them as Model 1 and 2, respectively); the average is 6 in both cases but distributions are completely different.Footnote 1 For instance, the preferred options (=modal values) are 9 and 6, respectively; moreover, \({\ensuremath{P\scriptstyle{r}\displaystyle\left(5 \leq R \leq 7\right)}}\) is equal to 0.728 and 0.249, respectively. Thus, although the distributions produce the same mean value any comparative decision should be substantially different. This point should not be underestimated since, in these cases, any function based on expectation may hide important information.

Fig. 7.1
figure 7_1_191533_1_En

Two hypothetical preference/rating distributions

Thus, we adhere to the conclusion that “the indicator exists in a model and that the indicator itself is the product of a model” [12] and support the search for adequate measures [20].

In fact, what is really important in studying a complex phenomenon as students’ satisfaction is the modelling of the evaluation process that transforms a personal perception into an ordinal answer to a specific item. Thus, a model includes in a consistent way the role and the weight of the real uncertainty that is always pervasive in any decision process. In addition, modelling allows for statistical tests and confidence intervals for any indicator of interest by means of exact, asymptotic or simulated distributions.

3 Perception and Rating as Complex Decisions

The perception of an object/service/item is a psychological process by which a subject synthesizes sensory data in forms that are meaningful for his/her conscience. In fact, when we ask a student to answer a specific question on a questionnaire concerning the quality of teaching we are looking for his/her perception of the problem. Then, we are asking to summarize this perception into a well defined category (included in a set of ordinal finite values).

Thus, the expressed evaluation is the final act of complex causes and the answer we collect is affected by the real consideration of the problem and some inherent uncertainty that accompanies human decisions. As a consequence, any expressed perception becomes the realization of a stochastic phenomenon and it should be analyzed with statistical methods that rely on the possibility to investigate the generating data process.

Actually, psychological processes when faced with discrete choices manifest themselves by two main factors that explain the final decision:

  • a primary component, generated by the sound impression of the respondent, related to awareness and full understanding of problems, personal or previous experience, group partnership, and so on;

  • a secondary component, generated by the intrinsic uncertainty of the final choice. This may be due to the amount of time devoted to the answer, the use of limited set of information, partial understanding of the item, laziness, apathy, and so on.

From the point of view of the interviewer, the first component is hopefully the most important in determining the answer in order to gain information on the real motivations that generated the observed result. Instead, from the point of view of the interviewee, the second component may become considerable if he/she is not really involved/interested to give a meditated answer.

Moreover, by constraining the choice process into an ordinal finite set of alternatives, we produce a hierarchical procedure since respondents first orient themselves in a coarse evaluation (negative, indifferent, positive) and then refine their final judgement.

Actually, empirical evidence shows that extreme choices are assessed in a sharper way. On the contrary, when the number of alternatives increases people tend to be not so extreme even if they are really satisfied with item.

Finally, it is important to realize that specific circumstances may increase the observed evaluations in some classes. This happens, for instance, when some category is expressed in a way that induces to simplify more elaborate decisions (we call them shelter choices [45]).

4 Latent Variables and Item Response Theory

Since satisfaction and perceived quality are not observable, some remarks are necessary in order to define their role in the modelling approaches we will speak about. Surely, few latent traits (constructs, variables, factors) are common features that drive the general pattern of responses to a questionnaire aimed at evaluating a service [911, 36]. Empirical evidence confirms that similarities, differences, contrasts among the responses are quite common. Thus, although a huge amount of hypothetical patterns could be conjectured, only a limited subgroup of them are observed in a significant frequency.

Latent variables force the evaluation process and statistical researches should focus on substantive models for explaining observed patterns within a consistent rational framework. In the literature, several independent approaches bring to similar modelling. Among them, we quote those originated by psychophysical and sensorial studies: [17, 51, 52, 78]. Similarly, starting from [57], econometricians refer to “Random Utility Models” (RUM): [2, 28, 3941, 79]. Then, in the vein of “Unobservable Variable approach” (UVA), several statisticians have been involved with the introduction of useful models for evaluation data: [18, 23, 48, 6062, 65]. For an updated survey, see: [16].

A common feature of these approaches is that the answers to items are supposed to be generated by a latent variable that explains the dependency and manifests the most important characteristics (features, constructs, traits) of the survey.

Models related to Item Response Theory (IRT) are diffuse in psychological and medical studies, marketing and political researches, and different motivations, usages and notations may obscure a common framework. In fact, some papers are aimed at defining a recognized taxonomy: [75, 77]. Main distinctions are based on: dichotomous or politomous responses, number of parameters, ordinal nature of the responses, availability of covariates, number of latent traits, and so on.

From an historical point of view, IRT has been generated as a critical reaction to classical test theory [49], where people assume that responses to several items are numerous enough to apply standard methods to the total score. Main critical issues are the non independence of the items and the existence of common patterns in the responses. In addition, when a set of items in educational contexts are submitted, responses are function both of ability of respondents and difficulty of items. This is a problem that classical test theory does not tackle in a simple and effective way.

Thus, IRT is based on several assumptions, and the more important are the following:

  • Unidimensionality. Theory assumes that questionnaires are measuring a continuous latent variable defined on the real line.

  • Local Independence. Theory assumes that any relationship among items is fully explained by few common latent variables; thus, for a given trait level, item responses are independent.

  • Normality. Often, latent variables are assumed to be Normally distributed.

In this approach, the starting point for new directions has been the introduction of Rasch model [71], with just one parameter, later generalized by Birnbaum – in a series of reports from 1957 onwards, reported in [51] – and Lord [50] who considered two and three parameters model, respectively. Rasch originally proposed a model for dichotomous responses. Instead, Bock [15] considered politomous values where the probability of a category is proportional to the sum of all others. Specifically, the non-negativity of probabilities suggests the ratio of exponentials; further additional constraints on the parameters are required to ensure identifiability.

To establish notation, we assume that R ik is the response random variable of the i-th respondent to the k-th item, for \(i=1,2,\dots,n\) and \(k=1,2,\dots,K\). Then, sample information is contained in the following \((n \times K)\) matrix, consisting of the observed answers of n respondents to K items:

$$\begin{bmatrix} r_{1,1}&r_{1,2}&\dots&{r_{1,j}}&\dots&r_{1,K}\\ r_{2,1}&r_{2,2}&\dots&{r_{2,j}}&\dots&r_{2,K}\\ \dots &\dots &\dots&{\dots} &\dots&\dots\\ r_{n,1}&r_{n,2}&\dots&{r_{n,j}}&\dots&r_{n,K}\\ \end{bmatrix}.$$

Observed values are the expressed ratings of an evaluation process. They could be \(0,1\) for dichotomous situations (as it happens for tests) or included in \(\{1,2,\dots,m\},\,m>2\) for politomous cases (as it happens in evaluation and preference surveysFootnote 2).

Each row is the response pattern of a given subject and each column represents the observed evaluation to a given item expressed by different subjects. It seems evident that information deduced by rows should be related to subjects’ ability while information derived by columns should be related to items’ difficulty. This interpretation has historically generated the whole family of Rasch models, as shown by [38]. More specifically, George Rasch searched for an item response function such that it implied a complete separation among the person’s ability and the items’ difficulty parameters. This requirement has been called specific objectivity and it related to the joint sufficiency property of the parameters estimators [73].

For a single item that admits a dichotomous solution (correct: \(R=1\), or wrong: \(R=0\)), the standard formulationFootnote 3 of the original Rasch model for the j-th item:

$${\ensuremath{P\scriptstyle{r}\displaystyle\left(R_j=1 \mid \theta\right)}}=\frac{1}{1+e^{-a\,(\theta-\delta_j)}}\,,$$

expresses the probability of a correct answer for a given person’s ability θ with respect to the item difficulty parameter \((\delta_j)\) and the discrimination parameter \((a)\).

A generalization of this formulation leads to a three parameter (logistic) Rasch model defined, for each item \(j=1,2,\dots,K\), by:

$${\ensuremath{P\scriptstyle{r}\displaystyle\left(R_j=1 \mid \theta\right)}}=c_j+\frac{1-c_j}{1+e^{-a_j\,(\theta-\delta_j)}}\,.$$

Here, we are assuming that there is a probability c j to guess the correct response to the j-th item even if respondents do not know it; in addition, we allow discrimination parameters a j to vary among the items.

When responses are ordinal (as it is common in the evaluation context), Samejima [74] proposed a graded model which expresses the probability that a response will be observed in a specific category or above. Similarly, partial credit model proposed by Masters [54] assumes that the discrimination parameter does not vary among items and that the probability of scoring a given category over the previous one is a function of parameters. Finally, a rating scale model proposed by Andrich [4] introduces a further parameter with respect to the partial credit model to locate the item position on the underlying construct (but it is constrained to the same number of categories among items). These models are related to proportional odd models, adjacent categories logit models and continuation ratio models for ordinal data, proposed in the Generalized Linear Models framework, as derived by McCullagh [55] and discussed by [1, 35].

In the context of students’ evaluation [3], the ability assessment has been transformed into a quality assessment. Then, the ability (subjects) and difficulty (items) parameters are now transformed into satisfaction (subjects) and quality (items), respectively. This approach has been fully discussed in several papers by [2931] with regard to evaluation and customer satisfaction data; they prefer the extended logistic model, that generalizes the rating scale model, as proposed by [5, 6].

We defer to the vast literatureFootnote 4 for further considerations about these and related problems (for instance, multilevel, hierarchical, multidimensional, mixture and nonparametric IRT models: [47, 72, 76, 80]). We only quote here that the inclusion of subjects’ characteristics as explanatory variables of latent traits are examined by IRT researchers by means of “Differential Item Functioning” (DIF). This representation is useful for showing evidence of significant covariates in subgroups; often, the presence of clusters is considered as a bias in the responses expressed by a limited number of subjects as in [64].

5 An Alternative Model for the Evaluation Process

We introduce a different paradigm in order to explain ordinal choices that people routinely perform when faced with the evaluation process. The model that we will introduce is parsimonious and flexible with respect to alternative distributions [66]. In this case, the reference to latent traits is again valid but the probability of ordinal values is explicitly estimated and checked by data. In addition, it is immediate to add subjects’ covariates (even of continuous nature) for taking the behaviour of the respondents into account. Finally, clustering evaluation data by means of estimated models turns out to be efficient and selective, as shown by Corduas [2527].

5.1 Rationale for CUB Models

In our model, rating is interpreted as the final outcome of a psychological process, where the investigated trait is intrinsically continuous but it is expressed in a discrete way on a given scale. Then, it is possible to quantify the impact of individual covariates on the perception of the main aspects of University teaching, and to study how perception changes with students’ profiles.

The rationale for CUB modelsFootnote 5 stems from the interpretation of final choices of respondents as weighted combinations of a personal agreement (feeling) and some intrinsic uncertainty (fuzziness).

The first component is parameterized by a shifted Binomial random variable which is able to map a continuous latent variable (with unimodal distribution: Normal, Student t, logistic, etc.) into a discrete set of values \(\{1, 2,\dots,m\}\). Its shape depends on the cutpoints we assume for the latent variable.

The second component is a discrete Uniform random variable and describes the inherent uncertainty of an evaluation process constrained to be expressed by discrete choices. Actually, it is a building block for modelling the propensity of a respondent towards the extreme solution of a totally indifferent choice.

Although a mixture distribution may be interpreted as a two steps stochastic choice between two discrete distributions, we are not saying that population is composed of two subgroups (respondents whose choice is without and with uncertainty, respectively). Instead, we are assuming that each subject acts as if his/her final choice would be generated with propensities \((\pi)\) and \((1-\pi)\) to belong to one of the two distributions, respectively. In this regard, we observe that \((1-\xi)\) is a measure of agreement/feeling towards the item and \((1-\pi)\) is a measure of the uncertainty that accompanies the choice.

5.2 CUB Models

On a more formal basis, for a given \(m>3\), we consider the expressed rating r as a realization of a random variable R, with probability distribution given by:

$$Pr(R=r)=\pi\,\binom{m-1}{r-1}\xi^{m-r}(1-\xi)^{r-1}+(1-\pi)\,\frac{1}{m}\,,\quad r=1,2,\dots,m.$$

The model, firstly introduced by [33, 66], is fully specified by the parameters \(\pi \in (0,1]\) and \(\xi \in [0,1]\) that are inversely related to the weight of uncertainty and feeling, respectively. Its identifiability has been proved by Iannario [43].

Later, Piccolo [67] generalized this mixture random variable by introducing logistic links between the model parameters and the subjects’ and objects’ covariates (as applied in [68]). This class has been called \(CUB(p,q)\) models, depending on the numbers of \(p \geq 0\) and/or \(q \geq 0\) parameters related to covariates for π and ξ parameters, respectively. Of course, a \(CUB(0,0)\) is just a probability mixture distribution for the ratings, that is a model without covariates.

In this way, the class of \(CUB(p,q)\) models is generated by two components:

  • a stochastic component:

    $${\ensuremath{P\scriptstyle{r}\displaystyle\left({R=r \mid \textbf{y}_i;\,\textbf{ w}_i}\right)}}=\pi_i\,\binom{m-1}{r-1}\xi_i^{m-r}(1-\xi_i)^{r-1}+(1-\pi_i) \left(\frac{1}{m}\right);$$

    for \(r=1,2,\dots,m\), where the parameters π i and ξ i , for any i-th subject, \(i=1,2,\dots,n\), are defined by:

  • a systematic component:

    $$\left\{\begin{array}{lcl} \pi_i&=& {\ensuremath{\displaystyle}}\frac{1}{1+e^{-\textbf{\textit{y}}_i\,\boldsymbol\beta}}\,;\\ \xi_i&=& {\ensuremath{\displaystyle}}\frac{1}{1+e^{-\textbf{\textit{w}}_i\,\boldsymbol\gamma}}\,.\\ \end{array} \right.$$

Here, y i and w i are the i-th subjects’ covariates, for explaining π i and ξ i , respectively.

Finally, a \(CUB(p,q)\) model with a logistic link is defined, for any \(i=1,2,\dots,n\), by:

$${\ensuremath{P\scriptstyle{r}\displaystyle\left(R=r_i \mid {\boldsymbol\beta}, {\boldsymbol\gamma}\right)}}=\frac{1}{1+e^{-\textbf{\textit{y}}_{i}{\boldsymbol\beta}}} \left[ \binom{m-1}{r_i-1}\,\frac{\left(e^{-\textbf{\textit{w}}_{i}{\boldsymbol \gamma}}\right)^{r_i-1}}{\left(1+e^{-\textbf{\textit{w}}_{i}{\boldsymbol\gamma}}\right)^{m-1}}-\frac{1}{m} \right] +\frac{1}{m}\,.$$

A random sample consists of the joint set of expressed evaluations and covariates \((r_i,\,\textbf{\textit{y}}_{i},\,\textbf{\textit{w}}_{i})'\), for \(i=1,2,\dots,n\), and this information (for moderate and large size) is sufficient to generate sensible inference on the parameters \((\pi,\,\xi)'\) via the log-likelihood function and related asymptotic results.Footnote 6

Notice that CUB models adhere to the logic of the Generalized Linear Models (GLM), advocated by [56, 63], since they introduce linear functions of covariates for improving inference on observed data. However, they do not belong to GLM class since the chosen mixture distribution is not in the exponential family and a link among expectations and parameters is not required. In fact, our models are included in a more general framework [53].

Furthermore, we quote the extended CUB models proposed by [45] who are able to take into account the possible presence of atypical frequency distributionsFootnote 7 generated by subgroups that select a specific category (shelter choice). This kind of problem is relevant also in educational context: for instance, with reference to the evaluation survey to be discussed in Sect. 7.7, we found that the adjective satisfied, positioned just after an indifference option, caused everywhere a sensible shelter effect in the responses given by students, with a high impact on the frequency distribution ranging from 10 to 30%. Then, by using extended CUB model, this effect has been explained with a substantial improvement of the fitting measures, from 2 up to 10 times.

6 Empirical Evidences for University Teaching Evaluation

Several statistical methods and empirical evidences have been derived from the evaluation of University teaching in Italy. They are based on principles and foundations [13, 14] as well as on methods and applications [22, 23, 37, 58, 59, 65]. Moreover, the approach proposed in the previous section has been pursued with several evaluation data set [32, 34].

In this chapter, we present some results related to the evaluation of students’ satisfaction at University of Naples Federico II, based on data collected during the academic year \(2005/2006\) (the sample concerns \(n=34{,}507\) validated questionnaires), and we limit ourselves to discuss only few features. Unfortunately, more specific considerations cannot be derived since data base was explicitly delivered by University offices with the constraint of non-identifiability of Faculties (as a consequence of privacy rules).

In this survey, the perceived feeling/satisfaction to different items (quality of lecture halls, objectives and adequacy of courses, instructors’ ability and availability, time-table respect, and so on) has been rated from 1 = completely unsatisfied to 7 = completely satisfied. Thus, in the first subsection we examine \(CUB(0,0)\) models for these ratings with respect to some elements of stratifications: Faculty, gender and attendance. Moreover, in the second subsection, we will estimate CUB models with covariates in order to show how the global satisfaction rating is related to significant subjects’ covariates.

6.1 CUB Models Without Covariates

In Fig. 7.2, we present the estimated CUB models with reference to responses given to the global evaluation item. All Faculties are characterized by a low uncertainty (estimated π imply uncertainty shares always less than 3%) but the level of positive evaluation is more unstable (since estimated ξ vary from 0.25 to 0.40).

Fig. 7.2
figure 7_2_191533_1_En

CUB models for global satisfaction of 13 Faculties

For a better understanding of this global assessment, Fig. 7.3 presents the location of estimated models in the parametric space for the items concerning the evaluation of lecture halls, quality of teaching and global satisfaction. It seems evident how the last issue is related to (and almost confused with) the expressed judgement towards teaching. Anyway, responses related to global and teaching evaluations are less uncertain and manifest a more positive feeling with respect to lecture halls evaluations.

Fig. 7.3
figure 7_3_191533_1_En

CUB models and halls, teaching and global evaluations

Respondents are mostly women (55%) and different profiles arise when we consider the estimated models for various items with respect to genders, as confirmed by Fig. 7.4. For both genders we observe a common patterns of models on the parameter space: there is a difference among items related to organization and structure of courses and items related to personal relationship with the instructors. We register better judgments of instructors expressed with low uncertainty, whereas we see lower and more definite judgments towards structural components. However, women are more resolute about their evaluations in a sensible measure.

Fig. 7.4
figure 7_4_191533_1_En

CUB models and gender of respondents

Expressed results are clearly related to the typology of respondents since the sample consists of students with a generally high attendance: more than 76% declared to attend lectures for more than 80% of the term and only 0.7% of them for less than 20%. In fact, judgments are quite similar if CUB models are estimated for subgroups of students characterized by a given attendance rate; thus, remarkable differences of estimated models are found only for students with occasional attendance (Fig. 7.5). As a matter of fact, low attendance reduces positive evaluation and increases uncertainty in the responses.Footnote 8

Fig. 7.5
figure 7_5_191533_1_En

CUB models and lectures attendance of respondents

6.2 CUB Models with Covariates

Taking into account the results of previous subsection, we look for models which explains the expressed evaluation as a function of selected covariates. We limit ourselves to a large Faculty for which a considerable number \((n=10{,}572)\) of validated questionnaires is available and we study the behaviour of respondents with reference to the global evaluation item.

We found that positive evaluation is significantly related to Attendance (expressed by four ordinal classes), Age (in years) and the number of passed Exams of respondents. Table 7.1 presents estimated CUB models of increasing complexity with respect to the numbers of significant covariates (which enter in the model in decreasing order of significance). Notice that all parameters are significant, although the effect of the last covariate (=Exams) in the last estimated model seems feeble.

Table 7.2 summarizes these results by showing the asymptotic tests for the previous models. Observed tests should be compared with the critical values of a χ 2 random variable, which for a level \(\alpha=0.05\) and degrees of freedom \(g=1,2,3\), are given by: \(\chi^2_{(1)}=3.841;\,\chi^2_{(2)}=5.991; \,\chi^2_{(3)}=7.815\), respectively.

Table 7.1 Estimated \(CUB(p,q)\) models for expressed global evaluation
Table 7.2 Asymptotic tests for the estimated \(CUB(p,q)\) models

Specifically, given covariates \(\textbf{\textit{w}}_i=(Attendance_i,\,Age_i,\,Exams_i)'\) for the i-th respondent and m = 7, the estimated \(CUB(0,3)\) models implies that:

$$Pr(R=r \mid \textbf{\textit{w}}_i)=0.024 + 0.831\,\binom{6}{r-1}(1-\xi_i)^{r-1}\xi_i^{7-r},\,\,\, r=1,2,\dots,7,$$

where the \(\xi_i=(\xi_i\mid \textbf{\textit{w}}_i)\) parameters, for \((i=1,2,\dots,n)\) are specified by:

$$\begin{array}{lll} \xi_i&=&\frac{1}{1+\exp\{-2.028-0.253\,Attendance_i+0.973\log(Age_i)+0.005\,Exams_i\}}\\ &=&\frac{1}{1\,+\,0.132\,Age_i^{0.973}\,\,0.776^{Attendance_i}\,\, 1.005^{Exams_i}}\,. \end{array}$$

Given that ξ is an inverse measure of satisfaction, the expressed global evaluation increases with Age and also it remarkably raises with the Attendance rate; in a lower extent, it increases also with the number of passed Exams. For instance, when \(Exams=0\), these results are well summarized in Fig. 7.6 where the expected global evaluation, according to the estimated \(CUB(0,3)\) model, is plotted as a function of selected covariates.

Fig. 7.6
figure 7_6_191533_1_En

Expected global evaluation as a function of covariates

7 Concluding Remarks

The proposed CUB models are characterized by a sensible fitting performance achieved with few parameters (parsimony) and an immediate possibility to interpret results in terms of evaluation features and uncertainty, as well as by means of covariates.

In this area, we are currently looking for adequate formalizations that allow to integrate the CUB models approach in the latent trait environment, in order to gain the multivariate dimension of rating data analysis. Similarly, multilevel considerations for CUB models should be necessarily introduced for improving the interpretation of real situations where clusters of respondents generate similar evaluations.

Anyway, this kind of analysis (both conceptually and empirically based) have convinced us that some operational implications may be suggested to institutions charged to make more effective the evaluation process of University teaching. These considerations may help to generate few, simple and useful rules:

  • Questionnaires must be largely simplified since all analyses confirm that results are based upon just one or two latent variables. These constructs concern the satisfaction towards a personal component (ability of teacher judged in a favorable way, even if structures are not adequate) and/or the criticism towards a structural component of courses (rooms, times, availability and adequacy of laboratories, often negatively judged even if teaching is positively evaluated). Thus, few questions may effectively capture most of data information without wasting time and/or lowering the accuracy of responses.

  • Sample size may be reduced in favour of stratified procedure in order to achieve more accurate answers. In fact, collection of data among students in a classroom induces internal correlation, high dispersion of respondents and a large amount of useless questionnaires. Of course, any selection mechanism must respect the requirement that surveyed students attend lectures and courses to be evaluated.

  • Simple outputs for intermediate and final users should be based on effective indicators that are related to fitted models (reported with goodness of fit measures) which are derived from hypotheses on the generating mechanism of data.

The final message is that evaluation of University teaching is both a complex task and a difficult challenge for statisticians. As in other fields, knowledge should be based on sound theory and extensive experience that lead to iterative and interactive processes. In any case, simple and effective models should be encouraged for supporting correct decisions.