Introduction

Over the past five decades, higher education systems around the world have been expanding, leading to what is often referred to as the “massification” of higher education (OECD 2008). It has been argued that the broad expansion has come with a cost in terms of the quality of education provided. As a result, in the last decade a number of academics, government agencies, and international organizations have started to advocate for more accountability toward measuring student learning outcomes (SLOs) in higher education (OECD 2013; Zemsky et al. 2005). The quality of higher education is a complex, contested, and elusive concept.Footnote 1 However, in a time when school budgets for higher education have been decreasing and institutions need to be more accountable to the public, it is important to start developing a set of measures related to student learning outcomes in higher education, so higher education systems can demonstrate whether students are indeed gaining valuable knowledge and skills (Coates 2009, 2014).

As a result of pressures for more accountability, in terms of outcomes provided by postsecondary institutions, a number of countries from around the world, such as England, Brazil, and Colombia, have started to create more comprehensive systems to assess the quality of higher education processes.Footnote 2 Most recently, the Organization for Economic Cooperation and Development (OECD), through its Assessment of Higher Education Learning Outcomes (AHELO)Footnote 3 project, engaged in a more comprehensive process of assessing the quality of education that includes measuring whether students are gaining appropriate general knowledge and subject area skills.

One of the countries that have been working for almost two decades to create a more comprehensive system to assess and evaluate higher education institutions is Brazil. In 1996, the Brazilian government passed the decree 2.026/96 that laid down two types of measures for assessment: (1) an analysis of general performance indicators by a number of characteristics such as state and region, area of knowledge/major field of study, and type of higher education institution and (2) an institutional assessment by peers to assess their administration, education, social integration, technological, cultural, and scientific products. This was later institutionalized in 2004 when Brazil passed federal law 10.861/2004, adopting a formal evaluation system: the Brazilian Higher Education Evaluation System or Sistema Nacional de Avaliação da Educação Superior (SINAES).Footnote 4 As part of SINAES, the country adopted the National Student Performance Exam or Exame Nacional de Desempenho dos Estudantes (ENADE),Footnote 5 which is a compulsory college-exit examination designed to measure general and subject area knowledge of students in different major fields of study (e.g., economics, engineering). Brazil has been administering the ENADE annually to students in their senior year since 2004. From 2004 through 2010, the ENADE has been administered to both senior and freshmen students in different major fields of study.

The purpose of this study is to capitalize on the ENADE to provide some initial descriptive measures of gains in terms of SLOs in the general and subject area knowledgeFootnote 6 in different major fields of study. Two main research questions guide this study: (1) Are there any gains in SLOs, in terms of the general and subject area scores between freshmen and seniors by specific major field of study? and (2) Are there any differences in SLOs, in the scores of the freshmen and seniors by specific major field of study according to specific individual (i.e., proportion of low- and high-income students enrolled in the programs) and institutional (i.e., public vs. private) characteristics?

The fact that Brazil administered the same instrument (i.e., ENADE) to both freshmen and senior students provided a unique opportunity to get a first approximation of the general and subject area knowledge and skills gained by students enrolled in different majors.

This study contributes to the field by proposing a simple, intuitive, and visually compelling methodology to present gains in SLOs in higher education. We use this methodology to estimate SLOs by program, and in addition, compare estimates by specific student and individual characteristics. We also capitalize on meta-analytic techniques to aggregate the results and present differences by specific individual (i.e., by SES) or institutional (i.e., public vs. private) characteristics. We propose a simple methodology that can be used by college administrators and state and federal officials around the world to measure SLOs in higher education. Specifically, this study estimates effect sizes in both the general knowledge and subject area sections of the examination. It is worth noting that this is a descriptive study and we are not addressing two issues that might be biasing the estimates. First, we are not controlling for previous academic preparation and selection of students into college and into specific programs. Second, we are only partially addressing the problem of non-random attrition. We discuss these problems in more detail in the methodology section below.

Conceptual framework and literature review

Clark (1983) characterized higher education as a triangle in which three main forces: the academic community, the state, and the market, were in constant struggle trying to shape the system toward their own particular set of beliefs and goals. The differences in their beliefs and goals for the system also implied that they have somewhat different definitions of “quality” in higher education and how to best measure and assess it. Barnett (1992) used Clark’s triangle to illustrate how each of these forces shaped the current debate over the quality of higher education, and how each of these groups supports different methods to assess and evaluate higher education institutions. He concluded that the “debate over quality in higher education should be seen for what it is: a power struggle where the use of terms reflects a jockeying for position in the attempt to impose their own definitions of higher education” (Barnett 1992, p. 6).

Barnett (1992) argued that higher education is a process and that simple ways of measuring quality such as developing rankings of higher education institutions according to specific outcomes (i.e., degrees awarded or employment of graduates), ignored the core mission of education: to engage in a process to help students develop the art of critical thinking and problem solving. He offered a definition of “quality” that is the one that we adopted for this study: “a high evaluation accorded to an educative process, where it has been demonstrated that, through the process, the students’ educational development has been enhanced: not only have they achieved the particular objectives set for the course but, in doing so, they have also fulfilled the general educational aims of autonomy, of the ability to participate in reasoned discourse, of critical self-evaluation, and of coming to a proper awareness of the ultimate contingency of all thought and action.”

The field of assessment and evaluation of higher education institutions have identified a number of important dimensions and best practices to conduct evaluations of particular programs or institutions (National Academy for Academic Leadership 2014), but to the best of our knowledge, there were no comprehensive models developed to measure student learning outcomes (SLOs) in higher education. A notable exception was a model proposed recently by Coates (2009) to measure the value-added by higher education institutions in Australia. He argued that measuring learning at the higher education level was a very complex issue, but it was vital for demonstrating the “quality” and “value” provided by higher education institutions. Coates listed and described four different approaches that can be combined into a single model and used to measure the “quality” and “value-added” by higher education. These four approaches are: (1) computation of value-added estimates by comparing predicted against actual performance using data from entrance tests and routine course assessments, (2) comparison of outcomes between objective assessments administered to cohorts in their first and later years of study, (3) comparison of first and later years student engagement, and (4) feedback on graduate skills provided by employers, all of which could provide an independent perspective on the quality of the education provided. This study focused on the second component of Coates’s model, the comparison of outcomes between objective assessments administered to cohorts in their first and later years of study, and it took advantage of having student-level scores for three different cohorts of freshmen and seniors enrolled in nineteen different undergraduate programs.

In their book “Academically Adrift,” Arum and Roksa (2011) attempted to measure whether students in the USA were learning valuable skills in higher education. They used the Collegiate Learning AssessmentFootnote 7 (CLA) instrument to test over 2000 freshmen in 24 institutions. The authors concluded that 45 % of students “did not demonstrate any significant improvement in learning” during the first 2 years of college. The main problem as recognized by Arum and Roksa (2011) is that the instrument lacks construct validityFootnote 8 and can only measure general skills, when the reality is that students go to college to gain subject area content. In addition, the estimates of the study might be biased because of inappropriate controls for the student’s previous academic preparation and lack of controls for the problem of attrition.

A number of studies have recently attempted to measure SLOs in terms of the critical thinking and problem-solving skills gained by students in college while addressing the selection problem implicit in these types of models (Barrera-Osorio and Bayona-Rodríguez 2014; Domingue et al. 2014; Rossefsky-Saavedra and Saavedra 2011; Saavedra 2009; Steedle 2012). Rossefsky-Saavedra and Saavedra (2011) used information from two different cohorts of first-year and last-year colleges of students in 2009 in Colombia to estimate value-added models in higher education. The study used pilot data from the national postsecondary-exit examination,Footnote 9 and the final sample is composed of a selected sample of students in some major fields of study in only 17 of the 177 universities of the country. The authors estimated the value-added by institutions using regression analysis adjusting for covariates and weighed propensity scores. They concluded that relative to observationally similar high school graduates, students in the last year of college scored about half of a standard deviation higher, with statistically significant higher scores on every component of the test. A number of recent studies that have also used Colombian data present contradictory findings (Melguizo et al. 2015; Barrera-Osorio and Bayona-Rodríguez 2014; Domingue et al. 2014; Saavedra 2009). Whereas Saavedra (2009), Domingue et al. (2014) and Melguizo et al. (2015) find increases in SLOs as measured by differences between SABER 11 and SABER PRO results, Barrera-Osorio and Bayona-Rodríguez (2014) find no gains. The discrepancies on the findings might be related to differences in the model specifications, as well as the use of cohorts of students that were either part of the pilot study for SABER PRO, or who took the examination when it was not a compulsory requirement for graduation. The inconsistent findings suggest the need to explore in a systematic and rigorous way the type of models that are less subject to bias.

This study contributes to the previous literature by using the ENADE, a compulsory college-exit examinationFootnote 10 to measure both the general knowledge and subject area skills gained by students in higher education.

National Student Performance Exam (ENADE)

The ENADE is a compulsory college-exit examination that has two main components: the general and the subject areas. The general component consists of 10 items, 8 multiple-choice (MC) questions, and 2 short essays. The subject area has 27 MC questions and 3 short essays. Students were given 4 h to complete the test. The general part was common to all programs participating in a given year, and it was unrelated to the student’s program of study. It basically tested the knowledge on cultural and social aspects of contemporary society.

It is important to clarify that even though ENADE was a compulsory examination and it had high stakes for postsecondary institutions (i.e., results are used for budget allocation and accreditation), the examination is neither a prerequisite for college graduation nor a measure used by potential employers. This implies that it is not high stakes for the students and probably as a result there was substantial variation in the proportion of students who completed the examination, as well as the completion rates of the different parts of the examination. For example, in 2012 of the over 587,000 students required to take the examination, only 469,000 took the examination that year (about 80 %). In terms of the different parts of the examination, looking at the results of the ENADE 2012 in economics, about 10 % of the students did not answer any question of the multiple-choice general component, compared to about 30 % who did not answer the written questions.Footnote 11 There was also wide variation in terms of the completion rates of the examination and absenteeism by geographical regions (e.g., six geographic regions) and by control of the institution (i.e., public and private).

The ENADE had been administered annually since 2004 to students in their senior year and from 2004 to 2010 to both senior and freshmen students in different programs. At the beginning, ENADE was given to a representative sample of students, but this changed in 2009 when they changed from a sample to a census approach. The government established three main groups of programs of study: (1) Biological Sciences, (2) Science, Technology, Engineering, and Mathematics (STEM), and (3) Social Sciences, and each of these programs is evaluated every three years. In 2004, the government tested the students in 14 programs in the biological sciences (i.e., medicine and nursing), in 2005 in 20 different STEM programs (i.e., engineering and computer science), and in 2006 in 16 programs in the social sciences and business administration (i.e., sociology, economics, and business). The ENADE was administered to students enrolled in programs in each of these three categories every three years. For example, the students enrolled in the different programs of the biological science category have been assessed in 2004, 2007, 2010, and 2013. Finally, the fact that Brazil has administered the examination to both freshmenFootnote 12 and seniorFootnote 13 students from 2004 until 2010 provides us with a unique opportunity to test differences by program. Finally, in terms of the reliability and validity of the ENADE only the psychology-specific examination has been evaluated (Primi et al. 2010, 2011).

Methodology

Data

We used the most recent and publicly available data for the ENADE,Footnote 14 the examination that was given to both freshmen and seniors in the three main categories of programs: (1) Science, Technology, Engineering, and Mathematics (STEM), (2) Social Sciences, and (3) Biological Sciences between 2008 and 2010. In 2008, we selected the programs of architecture, computer science, engineering,Footnote 15 physics, mathematics, and chemistry, as representative of the STEM programs. In 2009, the focus was on programs from the Social Sciences, and within this group, we selected programs in business, accounting, economics, communications, law, and tourism. Finally, in 2010 the focus was on programs from the Biological Sciences, and within this group, we chose biology, biomedicine, physical education, nursing, pharmacy, physical therapy, medicine, nutrition, and dentistry.Footnote 16

Sample

The sample was composed of three different cohorts of students who took the ENADE examination between 2008 and 2010. We had ENADE scores in the general and subject areas of the multiple-choice part of the examination for 484,410 students enrolled in 10,041 different programs (see Tables 1, 2). As we mentioned above, although participation in the ENADE is compulsory and it is a requirement for graduation for those selected programs of study, a student may return the answer sheet blank and would still fulfill the requirement for graduation.Footnote 17 We used a set of variables TP_PR_X that indicated whether the student effectively participated in the examination or not (we do not know how INEP determined the student’s level of participation on an examination). In addition, we excluded students who did not have completed information on the list of covariates that we included in our model. Below we describe the procedure used to estimate the program-level effect sizes, the estimates from the matching models, as well as the limitations implicit in this methodological strategy.

Table 1 Sample sizes
Table 2 Sample sizes by major field of study

Calculating effect sizes by program level

We first estimated the gains in SLOs of a program as the standardized difference of the mean senior scores to the mean freshmen scores of students enrolled in specific fields of study. Formally, we computed the gain as an effect size, in particular Cohen’s d as:

$$d = \frac{{\mu \left( {\text{seniors}} \right) - \mu \left( {\text{freshmen}} \right)}}{{\sigma_{\text{p}} }}$$

where \(\mu \left( {\text{seniors}} \right)\) was the average test score for seniors, \(\mu \left( {\text{freshmen}} \right)\) the average test score for freshmen, and σ p the pooled standard deviation, which was calculated as:

$$\sigma_{\text{p}}^{2} = \frac{{\left( {N\left( {\text{seniors}} \right) - 1} \right)\sigma^{2} \left( {\text{seniors}} \right) + \left( {N\left( {\text{freshmen}} \right) - 1} \right)\sigma^{2} \left( {\text{freshmen}} \right)}}{{N\left( {\text{seniors}} \right) + N\left( {\text{freshmen}} \right) - 2}}$$

The effect size calculated for the general knowledge part of the examination for economics, for example, measured the knowledge gained in the general examination by students enrolled in any economics program in the country. For each major field of study, we computed the effect size for the multiple-choice parts of both the general and the subject area examinations. We also computed the 95 % interval of confidence for the effect sizes, based on non-centrality parameters, as implemented in the MBESS R package (Kelley 2007). Below we describe how we addressed the issue of non-random attrition of students.

Propensity score matching estimates

As described above, even though both freshman and seniors are randomly selected to take the ENADE examination, there was a problem of non-random attrition (Rossefsky-Saavedra and Saavedra 2011). The fact that the less academically prepared and less motivated students might be more likely to drop out of the program before their senior year implied that both the observed and unobserved characteristics of the freshmen and seniors were different. The non-random attrition of students is problematic, and it might result in overestimated effect sizes or estimators with an upward bias. One way to address this issue was to control dropout rates of students. According to Silva Filho et al. (2007), the annual dropout rate of students from public universities in Brazil was around 12 %, while the dropout rate for private universities was 27 %. In order to address this problem, we used a propensity score matching (PSM) technique (Stuart 2010) to identify for each individual in the treatment group (seniors) a “similar” individual in the control group (freshman) based on a distance measure called “propensity.” Only matched seniors and freshmen were used in the calculation of the effect size, and so it was likely that the freshmen were “similar” to the seniors (when they were freshmen themselves). This procedure addressed to some extent the selection bias problem.

The propensity was the probability of a student becoming a senior, given a number of covariates, that is:

$$e_{i} = P\left( {{\text{senior}}_{i} /X_{i} } \right) - \frac{{\exp \left( {\beta X_{i} } \right)}}{{1 - \exp \left( {\beta X_{i} } \right)}}$$

where X i was the vector of covariates for student i, and senior i was an indicator variable on whether student i was a senior or not. The formula above calculated the propensity score for a student as the logistic regression of being a senior given the covariates.

The distance between two students was the absolute value of the differences of the logit of the propensity score of each student, and the matching was performed at the program level—that is, for each program we selected the seniors and the most “similar” freshmen.

$$d_{ij} = \left| {\log it\left( {e_{i} } \right) - \log it\left( {e_{j} } \right)} \right|$$

We used as covariates a number of variables that in the literature were associated with student persistence and attainment (Melguizo 2011). The following variables were included: student’s family income,Footnote 18 education level of the father and mother (questions 13 and 14 in all years), student’s income and relation to family regarding support (question 6 for 2009 and 2010, question 9 in 2008), gender, student’s self-declared race (questions 2 in 2009 and 2010, and 5 in 2008), student’s high school (private or public) (question 17 all years), and student’s type of high school diploma (question 18 all years). The propensity score matching used in this research was “nearest neighbor,” that is each senior student was matched with an unmatched freshman with the closest distance (as described above), as implemented in the MatchIt R package (Ho et al. 2011). A main limitation of the PSM strategy was that we were unable to control for a number of variables that in the literature were correlated with students’ persistence and attainment (i.e., previous academic preparation and other non-cognitive factors such as motivation). Despite these problems, the estimates provided by this method would provide probably an upper bound of the real effect size.

Limitations

We would like to acknowledge a number of methodological issues implicit in these types of study. First, although Brazil has a compulsory high school-exit examination (ENEM) and data from the examinations are available, there is no identifying information in either dataset that allows one to link a student’s ENADE and ENEM scores, and thus we could not include a variable to control for previous academic preparation of the students. This is problematic given that this is a critical covariate to include to address the problem of selection of students into programs and institutions. We tried to ameliorate the problem by using matching techniques, but this is not enough and estimates should be considered descriptive and probably suffering from an upward bias. Second, even though the examination was compulsory and was a prerequisite to receive the degree, about 20 % of the students did not take it and there were differences in response rates in various parts of the examination. In addition, students in certain programs and regions of the country were protesting the examination, so their results could not be included in the analyses. This is problematic and is probably biasing the estimates. Third, with the exception of Primi et al. (2010; 2011) who conducted a psychometric evaluation of the psychology examination, there has not been an independent evaluation of the ENADE. As a result, the psychometric properties of the test are unknown. Finally, similar to the threats to validity of the findings described in the study of Rossefsky-Saavedra and Saavedra (2011), our study is also subject to maturation bias.

Results

Gains in SLOs in general and subject area by program/major field of study

We present the results of the gains in SLOs for both the general and subject area tests by three main categories of programs: (1) STEM, (2) Social Sciences, and (3) Biological Sciences. The results in Fig. 1a show the effect size gain for the freshmen and seniors who were enrolled in a STEM program in any university in the country, in both the general and subject area components of the test. The central dot for each major field of study represents the effect size of the gain for all students in that field. The horizontal line, with the whiskers, represents the 95 % confidence interval on that measure.Footnote 19 The results suggest that there were gains for all the students in terms of general knowledge, ranging from 0.1 to 0.2 of a standard deviation. It was noteworthy that students in physics and computer science presented the larger gains. In terms of the subject area component, there were also gains of a much larger magnitude ranging from 0.5 to 1 standard deviation. The programs in which students presented the larger gains were physics and architecture.

Fig. 1
figure 1figure 1

Gains in average scores in the general and subject area components of ENADE in terms of effect sizes: STEM (a), Social Sciences (b), and Biological Sciences (c)

We also found gains in both general and subject area components of the test for students enrolled in programs in the Social Sciences (Fig. 1b). The gains in the general knowledge component ranged from about 0.05 to 0.2, with students in business administration gaining more. In terms of the subject area component, there were larger gains ranging from 0.4 to 0.6, in which students in accounting exhibited the most gains and a very small standard deviation in the estimates was observed.

Finally, for students enrolled in Biological Sciences programs and evaluated in 2010, one can observe the larger gains in both the general and subject area components compared to the students enrolled in the STEM and Social Sciences (Fig. 1c). For these students, the gains in the general knowledge component of the examination ranged from zero in medicine to 0.3 in pharmacy. In terms of the subject area component, the results ranged from 0.5 in physical education to 2 standard deviations in medicine. The results for medicine suggest that this program takes highly academically prepared students, so they do not gain much in terms of general knowledge, but there were substantial gains in the specific subject area component.

Gains in SLOs in the general and subject area components for students from the top and bottom income level

The previous results clearly illustrate that there seemed to be larger gains for the students in the subject area component compared to the general knowledge one. This was not surprising given that the general component was not aligned with the curricula of the programs of study. We were also interested in testing whether there was some variation in gains between students of low and high income (Fig. 2a–c). The results were consistent with the one in Fig. 1a–c. It was noteworthy that there were no major differences (with a couple of exceptions in law, pharmacy, and physical therapy) in the gains exhibited by either low- or high-income students.

Fig. 2
figure 2figure 2

Gains in average scores in the general and subject area components of ENADE in terms of effect sizes by lowest and highest income levels: (a) STEM, (b) Social Sciences, (c) Biological Sciences, (d) all major fields of study

The previous results suggested that there were no clear patterns in terms of the gains for students from different income levels by major field of study. However, this did not mean that there were no differences in the overall scores between low- and high-income students who took the test. In order to test this, we computed the effect size of the gain for all students, as a whole, instead of separating them into major fields of study. The bar indicated the 95 % confidence interval on the effect size, so if there was no intersection in the confidence intervals, there was a significant difference between the effect sizes (with 95 % confidence). The results in Fig. 2d clearly show that for the combined major fields of study the pattern of relatively higher gains in the subject area part, as opposed to the general part of the examination, prevailed. It was also worthy to note the lack of statistically significant differences when dividing the sample between the proportion of students from either low- or high-income backgrounds. Finally, there was substantial variation in the scores of students attending programs in the Biological Sciences compared to their peers in the STEM and Social Sciences.Footnote 20

Gains in SLOs in the general and subject area components of the test by institutional control

We tested whether the differences in effect sizes varied by institutional control (i.e., public vs. private institutions). The results in Fig. 3a–c show some differences for specific major field of study. For students in STEM programs, it was noteworthy that there were basically no differences in the gains between students from private and public institutions. It was also difficult to make any generalizations in terms of the variability in scores between these two groups of institutions. There was a lot of variation in the scores of the general part for students in physics in the private institutions compared to public ones. The opposite is true for students enrolled in computer science programs; the range in scores was much wider in the public than in the private institutions. One program that stood out was engineering where the gains in the general component were much higher for students attending private institutions. The same randomness in the patterns remained for students in the Social Sciences programs (Fig. 3b). There was a wide variation in the general component scores and very little in the subject area scores. There was also no clear pattern in terms of institutional control as students in tourism programs in public institutions gained a lot compared to their peers in private institutions. However, students in accounting, business, and law enrolled in programs in private institutions gained more compared to the ones in public institutions. Finally, in the programs of the Biological Sciences there were no observed differences in the gains in the general component by institutional control (Fig. 3c). However, it was noteworthy that students enrolled in physical therapy and medicine programs at public institutions gained much more in terms of the subject area component, compared to their peers at private institutions.

Fig. 3
figure 3figure 3

Gains in average scores in the general and subject area components of ENADE in terms of effect sizes by institutional controls: (a) STEM, (b) Social Sciences, (c) Biological Sciences, (d) all major fields of study

Figure 3d shows the results of combining the students in all areas. In general, students in private universities achieved larger gains than students in public universities, with the exception of the subject area component of the examination in Biological Sciences.

Conclusions and policy implications

The results of this study provided empirical evidence that students were gaining both general and subject area knowledge in most of the programs of the STEM, Social Science, and Biological Sciences offered by both public and private institutions in Brazil. The results illustrated that there appears to be larger gains in terms of the subject area compared to the general knowledge one. We also found that the majority of the students enrolled in the Biological Sciences fields (with the exception of medicine in the general component) gained more in terms of both general and subject area knowledge than those in STEM and Social Sciences. Interestingly, we found no major differences in gains for students from the highest and lowest income levels when compared by major fields of study. Finally, we could not discern a clear pattern by institutional control by major fields of study.

The findings of this study were in line with the results of Rossefsky-Saavedra and Saavedra (2011) for a single cohort and analogous sample of students (i.e., two different samples of freshmen and seniors enrolled in the same programs) who participated in a pilot study for the development of a college-exit examination in Colombia. In our case, we found gains for observationally similar students in the two components of the test: the general and subject area knowledge. Even though these studies used different types of tests that were not measuring the same competencies, it was important to note that students were indeed gaining knowledge and skills. These findings differed from the work of Arum and Roksa (2011). These contradictory findings illustrated the problems of estimating models to measure the gain in SLOs without controlling for the issues of selection of students into institutions and the non-random attrition of seniors (Domingue et al. 2014; Melguizo et al. 2015). Future studies that continue to build in this emerging literature and attempt to produce unbiased measures of gains in SLOs in terms of learning in college, need to use appropriate instruments and methods to address the methodological issues embedded in these types of study. Some recommendations include: (1) choose a college-level test with content that is aligned to the programs of study being evaluated, (2) use a college-level test that ideally has some type of consequences, so the students take it seriously, (3) use appropriate statistical techniques to control for factors associated with college persistence and attainment (i.e., previous academic preparation and non-cognitive factors), and (4) address the issue of non-random attrition of students, especially in the first two years, which is when most of the dropouts take place.

The results of this study have important policy implications for countries interested in developing comprehensive systems to evaluate the quality of higher education institutions. First, the USA could learn from the experiences of Brazil and Colombia and engage in a long-term process of developing a comprehensive evaluation system (Coates 2014). The USA should avoid simply trying to develop a ranking system like the one developed by U.S. News and World Report. Second, as countries start to develop appropriate instruments to measure the general and subject area knowledge gained by students, they should work with researchers and testing companies to identify the appropriate instrument to measure the type of competencies that they are interested in measuring. Third, as countries create datasets that can be used to measure the growth in SLOs, researchers, policy makers will have to deal with the methodological problems inherent to these types of studies such as non-random attrition of students. Fourth, governments should also work toward creating a K-20 data system, so they have ample variables to control for previous academic preparation and non-cognitive factors associated with college persistence and attainment. Finally, the information from growth in SLOs should be used in a formative way and comparisons among institutions should be avoided. As documented in the pilot studies of AHELO, it is very important that the information be given back to the institutions as a way for them to continue to work toward improving the students’ learning outcomes.