1 Introduction

Over the last decade, the assessment of the university teaching quality has assumed a prominent role in the university system with the main purpose of improving the quality of services offered to students. Students’ feedbacks on university teaching activity play an important role in this process, enabling university teachers, planners and leaders to monitor teaching processes by promoting internal surveys at the end of each course. Therefore, a substantial body of works has been devoted to the analysis of university teaching evaluation using students’ satisfaction questionnaires both at international level (Ramsden 1991; Kember et al. 2002; Marsh 2007) and at national level (see contributions in: Fabbris 2007; Monari et al. 2009; Attanasio and Capursi 2011; Crescenzi and Mignani 2014).

More recently, in Italy, starting from the activities promoted by the National Evaluation Committee of the University System (CNVSU) and now by the National Agency for the Evaluation of Universities and Research Institutes (ANVUR), student evaluation of teaching (SET) surveys are carried out each year by means of ad hoc questionnaires. Apart from minor changes allowed at local level, the latest questionnaire version, established by the ANVUR agency in 2013, is adopted by all Italian universities in order to allow comparisons at national level.

As a result of this process, a host of studies on the evaluation of university teaching was devoted to the Italian system, covering different topics and considering both case studies and methodological issues. Among others, various indicators (Capursi and Porcu 2001; Capursi and Librizzi 2008; Cerchiello and Giudici 2012; Marasini and Quatto 2014) and statistical models (Rampichini et al. 2004; Bacci and Caviezel 2011; Iannario 2012; Sulis and Capursi 2013) were introduced focusing on methods for the treatment of ordinal data in SET questionnaires used to summarize the results of students’ ratings at level of individual and/or course in a single statement. In particular, among alternative modelling approaches, the usefulness of the Item Response Theory (IRT, De Boeck and Wilson 2004; Toland 2014) and multilevel models (Goldstein 2011) has been deeply exploited.

Based upon this debate, we propose in this paper an integrated strategy of analysis for the treatment of SET data. Specifically, the combined use of the IRT and the multilevel models is advanced to: (i) obtain measures of students’ satisfaction on a metrical scale; (ii) assess the contribution that each factor related to the process under evaluation provides to students’ perception of university course quality; and, finally, (iii) remove the effects of factors which make comparisons across heterogeneous courses meaningless, with respect to the composition of students.

Starting from an overview of different methods proposed for the analysis of SET surveys, we consider and compare the information provided by different statistical tools, including descriptive indicators and model-based indicators (which rely on the joint use of IRT and multilevel models for data analysis). The main advantage of using an explanatory rather than a merely descriptive approach is illustrated. Firstly, we discuss indicators based on descriptive methods advanced in Italy to summarize the distributions of students’ responses to the items of SET questionnaires. For each item we compare the ranking of courses based on the use of some alternative indicators proposed in the literature. Secondly, we advance the use of model-based indicators of students’ satisfaction of university teaching and we discuss how to adjust them to take into account heterogeneity across evaluators (e.g. differences in students’ characteristics across courses).

The proposed strategy of analysis is presented along with a case study concerning the data on SET survey of all undergraduate programs offered by an university located in Southern Italy in the academic year 2013/2014.

The paper is organised as follows. Section 2 includes a brief review of the main methodological approaches proposed in the literature for the analysis of SET data. Section 3 describes the proposed modelling strategy. Section 4 provides details on the case study. Section 5 presents the results in terms of questionnaire validation, ranking comparisons across courses by means of different indicators, and multilevel models. The advantages related to the use of model-based approaches for the analysis of SET data are also discussed. Section 6 includes some final remarks and comments.

2 Methods for the analysis of student evaluation of teaching

2.1 Indicator definition

Over the years, SET has become the most used practice adopted by universities to gather feedback about their programs. The diffusion of these surveys, and their relevance for university government bodies, has prompted the interest of many researchers towards the definition of suitable statistical tools for the analysis of teaching data.

As a consequence, a large body of literature devoted efforts to propose and study the properties of specific indicators for categorical data which take into account the ordinal scale of questionnaire items. Within this class two definitions appear to be well suited to treat this kind of data. The first one is the satisfaction index proposed and discussed in Capursi and Porcu (2001) and Capursi and Librizzi (2008) defined as

$${\textit{IS}}_{R}=1-\left( \frac{1}{m-1}\sum _{i=1}^{m-1}F_{Ai}^{r}\right) ^{1/r}$$

where \(F_{Ai}\) represents the values of the empirical cumulative distribution function of the generic item A for the i-th ordinal category, and r is a proper chosen exponent (standard choices are \(r=1\) or \(r=0.5\)). The second one is the dissimilarity stochastic index proposed in Cerchiello and Giudici (2012) and defined as

$${\textit{SDI}} = \sum _{i=1}^{m}F_{Ai}.$$

For a review of the properties of these indicators, see Marasini and Quatto (2011) and the references therein.

Alternatively, indicators can be based on a metrical transformation of students’ ratings and they are just obtained as averages of numerical scores (\(x_i\)), assigned to the ordinal categories, weighted by using the associated absolute frequencies (\(w_i\)). That is, for a m-level Likert scale, the indicator is defined as

$${\textit{IM}} = \frac{\sum _{i=1}^{m}x_{i}w_{i}}{\sum _{i=1}^{m}w_{i}}.$$

For example, for a 4-level Likert scale (as customary in the Italian teaching evaluation system), the indexes are obtained by assigning to the ordinal categories equally (1, 2, 3, 4) (Labovitz 1970) or not equally (2, 5, 7, 10) (MURST 2000) spaced scores. To overcome some issues related to the selection of an arbitrary score system for ordinal categories, a better but more complex way to assign scores to the ordinal categories is by means of the results derived from estimated IRT models, where the scores \(x_{i}\) are functions of the IRT threshold values (Samejima 1969; Baker 2001).

In the current scenario, to the best of our knowledge the Italian experience shows a rare applicability of the indexes \({\textit{IS}}_{R}\) and \({\textit{SDI}}\) discussed in the last decades for the analysis of SET data. In most of the Italian universities these methodological proposals have never been transposed and implemented, with the exception of Cagliari, Florence, Palermo and some other universities who have joined the statistical information system for the evaluation of university teaching (SISValDidat) of the VALMON group.Footnote 1 A recent proposal to introduce and support the use of these indexes for the analysis of SET data is described in Conte et al. (2015), where an interactive software prototype with a strong emphasis of data visualization is implemented in the language R.

The required adjustment of “regularizing the indicators for comparison, adopting standard methods of production” (Bernardi 2011, p. 13) in SET surveys seems to go toward procedures based on both the use of percentage of dissatisfaction and satisfaction (negative and positive students’ judgements) and/or average points obtained by assigning equally and not equally spaced scores to the ordinal categories of SET questionnaires.Footnote 2

Moreover, few are the attempts to take into account the heterogeneity of the evaluators (e.g. students’ socio-demographic characteristics) in the comparison process among universities (Rampichini et al. 2004). The validation of the adopted procedures for the analysis of these data is also mainly demanded to local initiatives; whereas the usefulness to have a national dimension of the statistical analysis of teaching evaluation data is highlighted in the contribution of Carpita and Marasini (2014).

2.2 IRT and multilevel models

IRT models (De Boeck and Wilson 2004) are considered as the main methodological approach for measuring individuals’ latent trait values on a metrical scale on the basis of the responses provided to a set of categorical items, which measure an underlying variable. Multilevel models are widely adopted in regression analysis when the independence between observations does not hold and, thus, responses provided by units which belong to the same group tend to be similar than responses provided by units in different groups (Goldstein and Spiegelhalter 1996). This frequently arises in educational framework where students in the same class or educational program share the same environment, the same teachers and the same group of pairs.

In IRT models, the probability of providing a response equal to a category or greater is modelled as function of three parameters: the person parameter (\(\theta\)), the item-threshold parameter (\(\tau\)) and the item discrimination parameter (\(\lambda\)). The person parameter measures the individuals’ values of the latent trait, the item-threshold parameter identifies the location of the categories on the latent trait and the item discrimination parameter informs on the item capability to detect differences among persons with different values of the latent trait.

Denoting with \(Y_{ij_{c}}\) the response of person i \((i = 1,\ldots , n)\) in category c \((c = 1,\ldots ,C)\) or greater of item j \((j = 1,\ldots ,J)\), in the Graded Response Model (GRM) a logit link is specified to model cumulative probabilities (Samejima 1969):

$$Pr(Y_{ij_{c}}\ge c)=\frac{\exp ({\lambda _{j} (\theta _{i} -\tau _{j_{c}})})}{1+\exp ({\lambda _{j} (\theta _{i} -\tau _{j_{c}})})}.$$
(1)

The person parameter (\(\theta\)) is shared by the responses provided by the same individual and it is assumed to be a random term which follows a standard normal distribution. Item-threshold parameters (\(\tau _{j_{c}}\)) and person parameters (\(\theta _j\)) are expressed in the same metric, and both parameters are expected to range between [−3, +3]. The lower the person parameter \(\theta _i\) with respect to the item-threshold parameter (\(\tau _{j_{c}}\)), the smaller the probability to endorse higher categories. The discrimination parameter describes the slope of the logistic functions, thus low values of the parameter describe flat functions with low discrimination power. The item characteristic curves (ICCs) of the response category describe how the probability to choose a category rather than another varies in different latent trait values. The degree of information provided by items (and categories) varies along the latent trait values (is a function of \(\lambda\) and of the probabilities) (Toland 2013). The Test Information Function (TIF) is the result of the sum of the information contained in each single item (IICs). The TIF provides information to assess the degree of reliability of individuals’ estimates on different segments of the latent trait.

The higher the test information in one point of the latent trait, the greater the precision of the estimates of the latent trait values (e.g. the smaller the standard errors of \(\theta\)). In analysing SET questionnaires, the use of the GRM model allows to convert each pattern of responses in a metrical measure of students’ perceived quality.

The multilevel model allows us to analyse the relationship between the latent trait value of individual i to course g, denoted hereinafter as \(z_{ig}\), and students’ characteristics (\(\varvec{x}_{ig}\)) and other variables at different level of the analysis, such as course characteristics or other compositional variables (\(\varvec{z}_{g}\))

$$z_{ig}=\alpha +x_{ig}'\beta + z_{g}' \gamma + u_g+ \epsilon _{ig}.$$
(2)

In Eq. 2, \(u_g\sim N(0, \sigma ^2_u)\) is a random term at course level (level-2) shared by students who evaluate the same teacher/course. It captures the deviation of course g from the ground intercept \(\alpha\); \(\epsilon _{ig}\sim N(0, \sigma ^2_\epsilon )\) is the individual (level-1) residual term. The unexplained variance in z values is split in the between course variance \(\sigma ^2_u\) (Between) and the within course variance \(\sigma ^2_\epsilon\) (Within). The share of the first component on the sum of both components (called Intra Class Correlation Coefficient) provides a measure of the degree of correlation between responses provided by two students which evaluate the same course. The effect of observing dependencies between ratings of students who belong to the same degree program, department or faculty can be easily modelled in the analysis by generalizing Eq. 2 to consider further levels of clustering of the units, as degree programs or faculties at level-3. In this way the similarity in the responses is captured by adding further random terms which are shared by courses which belong to the same degree program or faculty.

3 An integrated strategy of analysis

The main advantage related to the use of the IRT models is that different latent trait values are estimated for individuals with different response patterns. Thus, this approach overcomes the issues related to the definition of a weighting scheme and a scaling method for combining responses to the ordinal variables in an overall metrical indicator. Furthermore, these models allow to: (i) study the properties of the scale of measurement; (ii) remove redundant items or categories; (iii) provide values of the latent trait in the continuum by treating the data as categorical; and (iv) assess the degree of reliability of the estimates across the different segments of the latent trait. The indicators based on descriptive measures proposed in the literature in the last decades to analyse SET questionnaires leave some of these aspects opened (e.g. the choice of a weighting system, uncertainty, etc.), or focus just on some of these aspects (e.g. the validation of a scaling method).

On the other hand, the use of multilevel analysis to analyse SET data allows to: (i) assess the variability in students’ ratings that is ascribable to the nesting of students in higher levels (e.g. courses, departments, faculties, etc.); (ii) evaluate how much of this variability is explained by differences in students’ composition with respect to socio-demographic characteristics and previous education background across courses and how these characteristics are related to students’ perceived quality; and (iii) provide adjusted measures of the quality of university courses suitable to make comparisons among them.

From the literature, two main approaches emerge for the analysis of SET data by exploiting the advantages of IRT and multilevel models. The first one considers a combined use of multilevel analysis and IRT model in an overall model (MLIRT, as in Bacci and Caviezel 2011; Sulis and Capursi 2013), and the second one consists of the use of the approaches in two separate steps (two-steps approach, as in Sani and Grilli 2011; Sulis and Porcu 2015).

The use of MLIRT model is recommended when the analysis mainly focuses on assessing the measurement instrument properties at course level, when the analysis is more descriptive than explanatory and it is bounded to courses which belong to the same faculty. However, the complexity of the explanatory multilevel IRT models (De Boeck and Wilson 2004; Sulis and Capursi 2013) makes hard the specification and the estimation of models which consider further levels of clustering of the units and the effect of confounders at different levels of analysis.

The two-steps approach, instead, allows to carry on a fully explanatory analysis of the effect of students’ characteristics and other compositional variables. It would allow researchers to easily extend the multilevel model in order to assess the effect of factors which may influence the evaluation process at higher levels of clustering of the units.

In this paper, we consider an integrated strategy of analysis based on the second approach in order to define an adjusted indicator of students’ satisfaction. Thus, the strategy is compound by two main steps:

  1. 1.

    the GRM model is considered to predict students’ satisfaction with respect to a course (namely z-scores);

  2. 2.

    the z-scores are used as the response variable in a multilevel model which considers the nesting of students in courses and the effect of relevant covariates.

Note that, the effect of further levels of clustering of the units is assessed before defining the number of levels in the multilevel model specification. The posterior predictions of course level residuals (level-2 residuals) with the related measures of uncertainty are used as indicators of course quality in students’ perception. Residuals are, indeed, considered in the literature as adjusted indicators suitable to make comparisons across courses (Goldstein and Spiegelhalter 1996; Leckie and Goldstein 2009).

4 Student evaluation of teaching at the university: a case study

The usefulness of the integrated strategy of analysis is here discussed within a real data set. We consider the information derived from the on-line questionnaire devoted to the students attending courses of degree programs offered at a university located in Southern Italy in the academic year (a.y.) 2013/2014. Students filled in a questionnaire for the assessment of each university course they attended. The hierarchical data structure implies that: courses are nested in degree programs; each department includes different types of programs, and, faculties group several departments, according to disciplinary affinity. The data gathered are organized in 801 courses, 79 degree programs (35 undergraduate programs, 34 master degree and 10 single-cycle programs), 16 departments and 6 faculties.

For the measurement of students’ satisfaction, Italian universities adopt the guidelines established in 2013 by the ANVUR agency. The latest version of the questionnaire for students attending the courses (i.e. students who declare to attend more than 50% of the course lectures)Footnote 3 is compound by 11 items measured by four ordinal categories on a Likert scale (decidedly no [DN], more no than yes [MN], more yes than no [MY], decidedly yes [DY]). The items are sectioned into three groups concerning course organization (preliminary knowledge [\(I_1\)], credits [\(I_2\)], reading material [\(I_3\)], exam rules [\(I_4\)]), aspects related to the teaching style (punctuality at lecture [\(I_5\)], ability to motivate [\(I_6\)], clear explanation [\(I_7\)], tutorial activity [\(I_8\)], respect of syllabus [\(I_9\)], punctuality at office [\(I_{10}\)]), and student’s interest on the course topic [\(I_{11}\)].

In the following, we consider the information regarding 35 undergraduate programs and 711 courses with at least 10 completed questionnaires. A total of 50,651 questionnaires of students attending the university courses are analysed. In addition to the items related to the teaching domain, students’ socio-demographic characteristics, prior educational attainments at secondary school and their university career have been included. More specifically, Table 1 reports the main features of the variables selected for the study. It contains: (i) the percentages of responses to items’ categories \(I_1\)-\(I_{11}\); (ii) the students’ characteristics (gender, student age in years [Age], type of secondary school [No Lyceum], grade of secondary school [GradeSS], enrolment year at university [EnrYear(I)], and the type of faculty in which the student is enrolled—Engineering [\({\textit{Faculty}}_{{\textit{EN}}}\)], Economics, Communication and Political Science [\({\textit{Faculty}}_{{\textit{ECPS}}}\)], Education and Humanities [\({\textit{Faculty}}_{{\textit{EH}}}\)], Medicine [\({\textit{Faculty}}_{H}\)], Maths [\({\textit{Faculty}}_{M}\)]); (iii) the number of filled questionnaires per course [Size course].

The 43.07 % of respondents are male and the average age is 21.49. Moreover, 35.49 % of respondents has not attended a lyceumFootnote 4 at secondary school and the final grade is around 80 on average (with the maximum attainable of 100). They are enrolled at the first year in the 42.20 % of cases. The distribution of the type of faculty in which the respondents are enrolled is almost balanced (around 20 %) for four faculties (Engineering, Economics, Communication and Political Science, Education and Humanities, Maths), with a lower percentage of respondents attending courses at the Medicine faculty (12.33 %).

With respect to the teaching domain, the distribution of the responses is mainly concentrated on positive ratings (the sum of percentages of the two positive categories DY and MY). A slight dissatisfaction (the sum of percentages of the two negative categories DN and MN) is registered for items related to the preliminary knowledge of students (22.72 %), the presence of tutorial activity (17.63 %), the reading material furnished by the lecturer (16.23 %), the ability to motivate the students (15.73 %), the clarity in presenting the exam rules (14.07 %), and the credits gained (13.62 %).

Table 1 Percentages of items questionnaire (decidedly no [DN], more no than yes [MN], more yes than no [MY], decidedly yes [DY]), and main characteristics (mean, percentages, and standard deviation-(SD in italics)) of the variables selected for the case study

5 Results

5.1 Questionnaire validation

The first step of the proposed approach is mainly related to the validation of the questionnaire, both in terms of selected items and properties of the measurement scale. We are interested in the prediction of the individual values of the latent trait (i.e. the student overall satisfaction of university teaching) on the basis of students’ response pattern to the items (the z-scores).

For this purpose, we analyse the properties of the scale of items adopted in Italy to measure students’ satisfaction toward the quality of university courses looking at the reliability measures and at the results of the GRM model.

In order to assess the properties of the questionnaire items adopted to measure the latent trait z, we consider only those items strictly related to teaching (from \(I_2\) to \(I_{10}\)). We have not considered those items referring to the prior knowledge (\(I_1\)) and the interest on the topic (\(I_{11}\)) declared by respondents.

The estimated value of the Cronbach’s coefficient \(\alpha\) (0.88), on average, signals a high reliability of the questionnaire items to measure the latent trait. However, an investigation of the IICs and the TIF, which better describes the measurement instrument properties, highlights a high level of reliability of the test for medium-low values of the latent trait (see Fig. 1). Furthermore, the picture highlights that the most discriminating items are those related to the teachers’ ability to motivate (\(I_6\), \(\lambda\) = 3.00) and their clear explanation of the arguments (\(I_7\), \(\lambda\) = 3.04) (both items with similar ICC). The lowest discrimination power is, instead, registered for \(I_2\) (\(\lambda\)= 1.43). Therefore, the values of the discrimination parameters highlight that items contribute with different loadings to the measurement of the latent trait z.

Summarizing, item responses are concentrated on positive values (DY and MY), as the ICC curves in Fig. 1 show. So, there are no informative item-categories on the medium-high/high areas of the latent trait. Hence, the reliability of the adopted scale declines for medium-high and high level of the latent trait.

Fig. 1
figure 1

Graded response model results. ICCs for the items \(I_2\) and \(I_{6}\); item information curves and test information function

5.2 Indicators and rankings comparison

One of the aims of the proposed integrated strategy of analysis was to compare the courses ranking for each item by considering different indicators. To this purpose, we consider the six indicators described in Sect. 2. For the first three indicators, based on the ordinal nature of the variables, we consider the index \(IS_R\) with coefficient r equal to 1 and 0.5 (denoted with \(IS_1\) and \(IS_{0.5}\), respectively), and the index SDI. The last three indicators are calculated as weighted averages of scores attributed to the four ordinal categories of each item (denoted with \(IM_1\) for equally spaced scores, \(IM_2\) for not equally spaced scores, and \(IM_3\) for scores derived as a function of the item-threshold-parameters from the previous estimated GRM model).

In particular, the comparison between the different rankings is made by using the Spearman’s coefficient \(\rho\), calculated for each pair of rankings based on the six indicators. The results are summarized graphically by heat maps of \(\rho\), ranging from a minimum of zero (concordant rankings) to a maximum of one (not concordant rankings).Footnote 5

By using descriptive indicators for each item as defined in Sect. 2, it seems that there are basically no substantial changes for most of the indicators in the ranking performance of the 711 courses considered. However, we note that a different course ranking might be obtained by using the \(IS_{0.5}\) indicator. For illustrative purposes, Fig. 2 shows the heat maps obtained for two courses, considering the three items \(I_2\), \(I_4\) and \(I_6\).

This result points out that using solely descriptive indicators might lead to different course rankings. With this respect, adding a model-based step might be useful to obtain more stable results.

Fig. 2
figure 2

Heat maps of three items \(I_2\), \(I_4\) and \(I_6\) with the ranking comparison of two courses by using Spearman’s coefficient \(\rho\) positive values (from 0 to 1) according to the six indicators \(IS_{0.5}\), \(IS_{1}\), SDI, \(IM_{1}\), \(IM_{2}\), \(IM_{3}\) (located in horizontal and vertical axes)

5.3 Multilevel models results

As the second step of the integrated strategy, we estimate 2-level random intercept modelsFootnote 6 by considering the two hierarchical levels related to the students (level-1) and university courses (level-2).Footnote 7

The response variable is the overall measure of student teaching satisfaction derived from the estimated GRM model. The covariates included in the models refer to students’ socio-demographic characteristics, the prior educational attainments at secondary school and their university careers, the course size measured by the number of filled in questionnaires for each course, the preliminary knowledge declared by the student (\(I_1\)) and the interest on the course topic (\(I_{11}\)). The information available on unit covariates are classified in three blocks of predictors in order to assess the variability in student’s responses ascribable to different sources of heterogeneity in the units of analysis: the first block is addressed to monitor the effect of students’ socio-demographic characteristics and courses’ characteristics; the second block is addressed to detect the role played by differences among faculties; the third block aims to take into account the effect of students’ self-stated assessment on their level of the preliminary knowledge and the interest on the topic covered in the course. The strategy we follow is to add to the null model the effect of different kinds of predictors. The analysis is carried out by inserting a block each time and selecting only relevant predictors.

In the following, the results of four estimated models are presented (Table 2): (i) the null model with the random intercept shared by students in the same course (\(M_1\)); (ii) the model that includes also students’ covariates (gender, age, type and grade of secondary school, enrolment year at university) (\(M_2\)); (iii) the model which considers also the faculty effect (including \(h-1\) dummy variables with \(h=6\) faculties) (\(M_3\)), and (iv) the model with the students’ self-stated background (preliminary knowledge) and the interest on the topic (\(k-1\) dummy variables with \(k=4\) ordinal categories for the two items \(I_1\) and \(I_{11}\)) (\(M_4\)).

For the \(M_1\) model the variance explained at student level (level-1) is 85 %, while for the course level (level-2) it is equal to 15 %.

With the introduction of covariates related to student’s characteristics (\(M_2\)), the size of the variances remains almost the same, but the composition of the courses with respect to covariates is considered. Specifically, this model provides evidence that there is a significant effect of gender (males are slightly more satisfied than females), age (older students are slightly more satisfied than their colleagues enrolled just after secondary school), and course size (for courses with a high number of respondents arises a lower degree of satisfaction towards teaching).

The faculty effect on the student overall satisfaction (\(M_3\)) is, instead, relevant: the students enrolled in one of the three faculties of Humanities [\({\textit{Faculty}}_{{\textit{EH}}}\)], Maths, Physical and Natural Sciences [\({\textit{Faculty}}_{{\textit{MPN}}}\)] and Economics, Political Science, Social and Communication [\({\textit{Faculty}}_{{\textit{ECPS}}}\)] show a level of student’s satisfaction higher than students attending an undergraduate program at Medicine [\({\textit{Faculty}}_{M}\)] and Engineering [\({\textit{Faculty}}_{E}\)] (the reference category). By considering differences across faculties, the proportion of variance explained at level-2 decreases to 14%.

Finally, results in \(M_4\) show that the combined effect of the two variables related to the student preliminary knowledge (\(I_1\)) and her/his interest for the course topic (\(I_{11}\)) reduces the proportion of variance explained by level-2 to 11.5 %, while increasing the proportion of variance explained by differences across students to 88.5 %. This means that students with prior background and interest in the course topic are more satisfied then students with lower knowledge and not interest in the topic.

Summarizing, from the simplest model (\(M_1\)) to the most complex one (\(M_4\)), a decrease in the variability of students’ satisfaction toward teaching aspects is observed: specifically about a decrease of 26 % between students (level-1) and about a decrease of 50 % between courses (level-2).

Table 2 Estimated 2-level random intercept models for students (level-1) and university courses (level-2)

5.4 Adjusted versus unadjusted indicator based on students’ characteristics

The results of the multilevel analysis are used to compare different courses. Figure 3 shows the level-2 course residuals for the above estimated models [\(M_1\)-\(M_4\)].

In multilevel analysis the expected posterior means of the residual terms \(\hat{u}_g^{(2)}s\), obtained as a result of \(M_1\) and \(M_4\) models, can be considered unadjusted and adjusted indicators of university courses quality, respectively. For both models a ranking of courses has been advanced based on the Rating Scale Index (RSI) (Sulis and Porcu 2015). This index is based on pair comparisons between courses and uses the information on their expected predictions and their pairwise confidence intervals (Goldstein 2011). Specifically, the RSI compares the pairwise confidence interval of the expected posterior prediction of a course with the pairwise confidence intervals of the expected posterior predictions observed for all the other courses under evaluation. The value of the index for a generic course g is equal to the number of courses which have the confidence interval completely below the confidence interval of course g. The index ranges between 0 and \((n-1)\), with higher values signalling better performances. Courses have been ranked on the basis of the decreasing values of the index and the average rank has been attached to each course in case of tails (Table 3). The main evidence which arises from a comparison between the two rankings is that the RSI indexes related to the two models have a level of agreement equal to 0.85. It is worthwhile to highlight some relevant changes in the ranking of some courses, e.g. the course labelled with number 511 goes from rank 1.5 to rank 38.

Finally, in order to highlight the differences between model-based explanatory procedures versus descriptive ones, we compared the ranking based on the adjusted measures also with those obtained taking the average over the questionnaire items \(I_2\)-\(I_{10}\) by considering the five indicators \({\textit{IS}}_{1}\), \({\textit{IS}}_{0.5}\), \({\textit{SDI}}\), \({\textit{IM}}_{1}\), \({\textit{IM}}_{2}\). We noticed that the level of agreement between the rankings obtained with \({\textit{RSI}}_{M1}\) and the other indexes is always lower than 0.80. The use of \({\textit{RSI}}_{M4}\), which accounts also for the heterogeneity in the characteristics of the evaluators, reduces remarkably the level of agreement (see Fig. 4).

Fig. 3
figure 3

Level-2 course residuals for the estimated \(M_1\)\(M_4\) multilevel models

Fig. 4
figure 4

Heat maps with the ranking comparison of 711 courses by using Spearman’s coefficient \(\rho\) positive values (from 0 to 1) according to the average of items \(I_2\)\(I_{10}\) based on the five indicators \({\textit{IS}}_{1}\), \({\textit{IS}}_{0.5}\), \({\textit{SDI}}\), \({\textit{IM}}_{1}\), \({\textit{IM}}_{2}\), and the RSI indexes for \(M_1\) and \(M_4\) models (located in horizontal and vertical axes)

Table 3 List of the first five courses and the last three courses according to the values of RSI indices for \(M_1\) and \(M_4\) multilevel models with their rankings

6 Conclusions

The present study proposed an integrated strategy of analysis for the treatment of the student evaluation of teaching. The use of both IRT and multilevel models is proposed to carry on a fully explanatory analysis of the effect of student’s characteristics and other compositional variables across courses. Specifically, the advantage of using an explanatory rather than a merely descriptive approach is investigated. The strategy was tested within a case study focusing on 35 undergraduate programs including 711 courses and 50,651 questionnaires of students attending courses in a university located in Southern Italy.

As general findings, SET questionnaires, adopted to measure the quality of teaching, appear to have low informative power (and thus low reliability) for high latent trait values.

The student’s responses are concentrated on high values of the Likert scale, and then there are no informative item-categories on the medium-high/high areas of the latent trait. Second, the two-step procedure allowed both to compare the results in terms of courses’ ranking according to model-based explanatory procedures versus descriptive ones. The empirical analysis clearly shows that different course rankings are found when considering model-based adjusted indicators instead of rankings obtained by taking averages over descriptive indicators of questionnaire items.

This result points out the weakness of descriptive indicators as well as unadjusted indicators when neglecting heterogeneity across courses and student’s characteristics. With this respect, a model-based approach for courses’ ranking appears to be a more effective choice for any informed decision making process, especially for teacher reward mechanisms based on students’ evaluation.