Introduction

Insofar as we live in a fully interconnected world, “global development is now largely a function of enabling countries and their people to function productively in the global economy and the network society” (Castells 2005, p. 19), hence the central role of innovation, ultimately rooted on the quality of the higher education system (Himanen 2005, p. 358), in the knowledge society. It is therefore widely recognized that a good education is of paramount importance for the success of individuals and countries in the network society.

The high value assigned by individuals to their education has extensively increased the demand for information about the quality of universities and higher education systems. University rankings constitute a logical response to that demand, and appear to help students, country officials and the public at large in making sense of a remarkably diverse higher education landscape. Among the truly worldwide higher education rankings, the Shanghai Jiao Tong University Institute of Higher Education Academic Ranking of World Universities (ARWU), also known as the Shanghai ranking, which ranks academic institutions on the basis of their research performance, has lately received a great deal of attention, because it is based on solid, transparent numerical data of research quality and quantity (Marginson 2005). The Shanghai ranking focuses exclusively on research and does not rely on subjective data. All its indicators are open to public scrutiny, since they measure either scientific production or individual excellence recognized by very prestigious awards or by a high number of citations. However, in spite of the conscious effort made to rank universities by comparing their research capacity and output, the ranking is increasingly being used as a stick to measure institutions, and not just in relation with research (Docampo 2011). According to the findings of Hazeltorn (2008), by and large Higher Education officials think that rankings, and particularly ARWU, force them to be more accountable, set strategic planning goals, and provide comparative information to students, parents and other stakeholders. They agree upon the fact that international rankings “provide a methodology—albeit the quality of the methodology is contested—by which institutions can benchmark their own performance and that of other institutions” (Hazeltorn 2008, p. 199).

The controversy around the Shanghai ranking has raised a number of criticisms to its methodologies and its choice of indicators. The ranking is clearly biased in favor of science and technology, favors English-speaking universities as English is the predominant language of academic publications, and uses overly simple bibliometric indicators (van Raan 2005a). This last remark concerning the misuse of bibliometric indicators was contested by the authors, but the dispute remains largely unsettled (Liu et al. 2005; van Raan 2005b).

Dehon, McCathie and Verardi (2010) warn us about one-dimensional measures that can be misleading due to oversimplification, while Zitt and Filliatreau (2007) point out that the Shanghai ranking essentially measures overall production, which favors large universities since the per capita indicator does not offset the strong size-dependency of the other measures, a shortcoming also pointed out by Kivinen and Hedman (2008) in their analysis of the Scandinavian universities in the ARWU ranking. The inherent defects that accompany any attempt to measure higher education performance raised a barrage of criticism from Billaut, Bouyssou, and Vincke (2009), who made use of multiple criteria decision making theory to conclude that the Shanghai ranking is not a pertinent tool to discuss the “quality” of academic institutions. Ioannidis et al. (2007) warn institutions not to focus just on the numbers being accounted for the ARWU indicators, since no measurement has perfect construct validity for the many faces of excellence.

It is therefore beyond any reasonable doubt that the Shanghai ranking, as any other attempt to capture in a few figures the essence of a highly complex institution, suffers from a number of shortcomings difficult to cope with. As reported by Van Parijs (2009), the website of Shanghai’s ARWU, as displayed in November 2008, candidly recognized that its way of ranking universities suffered from “many methodological and technical problems”. Nevertheless, it is also clear that “despite growing concerns about technical and methodological issues, there is a strong perception among university leaders that rankings help maintain and build institutional position and reputation” (Hazeltorn 2008, p. 195). Rankings are here to stay, and the Shanghai ranking in particular has not seen its popularity diminished in the past few years.

In scientific circles, the most damaging criticism to a classification that claims to be based upon objective and rigorous data is the one that states the irreproducibility of its results. Hence, after the serious concerns raised by Florian (2007) about the reproducibility of the Shanghai ranking results, in a paper that has not been refuted so far, it has become common currency among the published research that the results of ARWU can not be replicated, further igniting the case against the Shanghai ranking by diminishing its credibility among the recipients of its annual verdict. It is therefore of paramount importance for institutions that worry about their reputation to settle the question for good: are or are not the ARWU results reproducible? If the answer is no, the loss of credibility of the ranking results would constitute an open invitation to disregard them as useless. If the answer is positive then university officials would be able to monitor the results of their institutions; not only the authorities of the institutions that are classified among the five hundred so called world class universities, but most importantly, also the policy makers of institutions that did not make it, who could then properly assess the gap between their universities and the ARWU list.

It is the aim of this paper to give a response beyond a reasonable doubt to the concern about the reproducibility of the Shanghai ranking results. In a previous contribution (Docampo 2008) I had found out that at least in the case of the numerically simpler indicator—the one concerning the number of highly cited (HiCi) authors—the way of dealing with the apparent inconsistency of the ranking scores noted by Florian was to change the way proportions are obtained to compute relative scores: once the highest scoring institution is identified, the one with the largest number of HiCi authors, relative scores of the other institutions are calculated not in direct proportionality of the number of HiCi authors, but in direct proportionality between the square roots of those numbers. That was the first successful attempt to uncover the ambiguous statement by Liu and Cheng (2005), “… the distribution of data for each indicator is examined for any significant distorting effect and standard statistical techniques are used to adjust the indicator if necessary”. It was an empirical finding that provided a starting point to formulate the hypothesis that the same procedure would be used in the rest of the indicators.

This paper will show that we can indeed positively check the validity of the hypothesis that the procedure to compute relative scores on the HiCi indicator is also used to compute the scores of the remaining five indicators. Besides, other indicator-specific considerations shall be discussed. The paper will settle the issue of the reproducibility of the Shanghai ranking results by showing how the six indicators can be accurately replicated.

The rest of the paper goes as follows. First the background of the paper is covered by reproducing the methodology of the ranking. The five direct indicators (Alumni, Award, HiCi, N&S, and PUB) are then analyzed to show how the ARWU results can be accurately reproduced. Finally, the composed indicator (PCP) is dealt with and the paper is closed with the conclusions.

ARWU methodology

According to Liu and Cheng (2005), the six indicators that compose the Shanghai ranking are:

  • Alumni The number of alumni of an institution winning Nobel Prizes and Fields Medals. Alumni are defined as those who obtain a bachelor, master or doctoral degree from the institution. Different weights are set according to graduation times. The weight is 100 % for alumni obtaining a degree in 2001–2010, 90 % for the period 1991–2000, 80 % for the period 1981–1990, and so on, and finally 10 % for alumni obtaining a degree in 1911–1920. If a person obtains more than one degree from an institution, the institution is considered only once.

  • Award The number of staff of an institution winning Nobel Prizes in Physics, Chemistry, Medicine and Economics and Fields Medal in Mathematics. Staff is defined as those who work at an institution at the time of winning the prize. Different weights are set according to the periods of winning the prizes. The weight is 100 % for winners in 2001–2010, 90 % for winners in 1991–2000, and so on, and finally 10 % for winners in 1911–1920. If a winner is affiliated with more than one institution, each institution is assigned the reciprocal of the number of institutions. For Nobel prizes, if a prize is shared by more than one person, weights are set for winners according to their proportion of the prize.

  • HiCi The number of highly cited researchers in 21 subject categories. The definition of categories and detailed procedures can be found at the website of Thomson ISI.

  • N&S The number of articles published in Nature and Science between 2006 and 2010. To distinguish the order of author’s affiliations, a weight of 100 % is assigned for corresponding author, 50 % for first author (second author if the first and corresponding authors share the same affiliation), 25 % for the next author affiliation, and 10 % for other author affiliations.

  • PUB Total number of papers indexed in the Science Citation Index-Expanded and the Social Science Citation Index in 2010. Only “Articles” and “Proceedings Papers” are considered. When calculating the total number of papers of an institution, a special weight of two was introduced for papers indexed in the Social Science Citation Index.

  • PCP The weighted scores of the above five indicators divided by the number of full-time equivalent academic staff. If the number of academic staff for institutions of a country cannot be obtained, the weighted scores of the above five indicators are used. For ARWU 2011, the numbers of full-time equivalent academic staff are obtained for institutions in Australia, Austria, Belgium, Canada, China, Czech, France, Italy, Japan, Netherlands, New Zealand, Norway, Saudi Arabia, Slovenia, South Korea, Spain, Sweden, Switzerland, UK, USA etc.

Nobel prizes and Field’s medals

To analyze the results on the indicators related to Nobel Prizes and Field’s Medals, let us first point out the three major differences between the Award and Alumni indicators:

  1. 1.

    Only Laureates in the Sciences are taken into account in the Award indicator, while Literature and Peace prizes are accounted for the Alumni indicator as well.

  2. 2.

    The year of the award is what counts for the Award indicator, while the year of graduation is what counts for the Alumni indicator. In case of multiple graduation times from an institution—e.g. BS and PhD from the same university—it is the latest graduation time that counts.

  3. 3.

    In the computation of the Award indicator all the Field’s Medalists get three points to their institution while Nobel Laureates get three points only when they do not share the prize. When a Nobel Prize is shared, the three points are distributed according to the partition of the prize. Points are shared in case of multiple affiliations. In the computation of the Alumni indicator, however, all the Nobel Laureates and Field Medalists get one point for all the institutions from which they graduated.

Harvard University achieves the maximum score on the two indicators in the 2011 edition of ARWU, with 37.88 points in the Award indicator and 28.90 points in the indicator Alumni. Let H be the number of points of Harvard, and X the number of points of any other institution. In both cases, Alumni and Award, estimated scores are computed through the same formula:

$$ \hbox{EST}=100\sqrt{{\frac{X}{H}}} $$
(1)

In the case of the Award indicator the results of the Shanghai ranking have been accurately matched, as Table 1 shows. Acronyms in the table stand for:

  • pts: Total number of points from each institution.

  • EST: Values of formula 1.

  • Award: Scores on the indicator Award from the ARWU web page.

Table 1 Scores on the Award Indicator

Table 1 includes universities with different scores; in the case of universities with the same score, only one of the institutions is shown. Due to the difficulty in judging how the legacy of Paris-La Sorbonne should be distributed among its heirs, Universities Paris 5, 6, 7, 9 and 11 have not been included in the table; inasmuch as inaccuracies in those cases would be due to the incorrect identification of the institutions, their inclusion would not have been useful to test the estimation procedure.

In the case of the Alumni indicator almost all the results of the Shanghai ranking have been properly matched. Table 2 shows the results of the 25 highest scoring institutions. The acronyms 1st, 2nd, 3rd, used in Table 2 stand for the cases in which the Field’s Medalist or Nobel Laureate was awarded his or her first, second or third degree, respectively. The rest of the acronyms in the table stand for:

  • pts: Addition of the previous three columns to compute the raw data.

  • EST: Values of formula 1.

  • Alumni: Scores on the indicator Alumni from the ARWU web page.

Table 2 Scores on the Alumni Indicator

Table 2 shows only a few cases of clear inaccuracies, undoubtedly related to the demanding task of identifying all the graduates from those institutions, not to a shortcoming of the estimation procedure. Excel files containing the results of all the universities included in the Shanghai ranking on both the Alumni and Award indicators are available upon request.

Computing the HiCi indicator

The major difficulties in computing the scores on this indicator arise from the inaccuracies of the official information provided by Thomsom Reuters on the affiliations of the highly cited authors. When searching for the information about the highly cited authors from an institution the following problems have to be dealt with:

  1. 1.

    Outdated information. Researchers move and the new affiliation is not always registered in the official web page.

  2. 2.

    Some of the authors have unfortunately passed out in the past few years, but there names are still on the list.

  3. 3.

    Mistaken identities: a few of the authors have been mistakenly assigned to a different institution due to the difficulties in recognizing researchers with the same last name and initials.

  4. 4.

    The information about the institution is missing in a great deal of cases. We find the name of the research unit, hospital or institute, but not the institution to which they are affiliated.

  5. 5.

    When a reference to a hospital is made, it is not clear whether the author belongs to an academic institution as well: sometimes they do, sometimes they do not.

  6. 6.

    The trickiest problem is how to deal with fractional appointments, adjunct professorships, double affiliation for consulting purposes—particularly with universities in the Middle East with a generous budget to purchase in the new market of the highly cited authors—and positions in external units associated with academic institutions.

The best showcase for all these difficulties is Harvard University. It is of paramount importance to come up with the correct figure for the raw score of Harvard University, since the scores of all the other institutions will be related to it. A quick look at the Thomson Reuters site results in a list of 225 possible highly cited authors related to Harvard University or to its associated teaching or research units. After a careful evaluation of all the cases the number of highly cited researchers currently affiliated with Harvard University in the year 2011 was set to 192. Therefore, to proceed further, the value of 100 was allocated by ARWU to Harvard in correspondence with the square root of 192, as the yardstick to measure the scores of the other institutions.

Computing the indicator for all the institutions would be a very tedious exercise, since we would have to prune the results from a database of highly cited authors that includes more than 7,000 researchers. Hence, a representative sample of institutions showing different numbers of highly cited authors has been selected to compute the estimated scores on the HiCi indicator.

The results are shown in Table 3, in which the acronyms stand for:

  • NHiCi: Number of highly cited authors from an institution (accounting for shared affiliations in some cases).

  • HiCi: Actual score on the indicator HiCi in ARWU 2011 of the institutions selected for the Table.

  • EST: Estimated score using the procedure explained in this section with H = 192:

    $$ \hbox{EST}=100\sqrt{{\frac{\hbox{NHiCi}}{H}}} $$
    (2)

    The table confirms that this indicator can be computed with absolute accuracy once the highly cited authors data are appropriately checked and corrected as necessary.

Table 3 Scores on the HiCi Indicator

Computing the N&S indicator

In the case of the N&S indicator the methodology appears to be precise and clear; we can proceed by giving one point to the institution of the corresponding author, 0.5 points to the institution of the first author, 0.25 points to the institution of the next author, and finally 0.1 points to the remaining institutions. It is worth pointing out that an institution only scores once on each paper.

In order to test the square root hypothesis the first task is to evaluate the score of Harvard University. However, there is a number of papers during the period 2006–2010 coauthored by authors from that institution in excess of 700; hence, an attempt to compute the points awarded to Harvard by looking up all those papers is a doomed endeavor. Assuming the hypothesis holds, there is another way to arrive at the desired result by triangulating through the scores of universities with just a few papers in the period 2006-2010. To do so, we first compute the number of points for universities with a score below 7.0 in ARWU 2011. Since the scores are rounded up or down to the first decimal digit, the bounds for the number of points of Harvard university can be estimated as follows:

  1. 1.

    Suppose that an institution with a score y gets x points. Let H be the number of points of Harvard University. If the square root rule applies, \(y=100\sqrt{x/H}\), and H could then be estimated by reversing the formula:

    $$ H=x\left({\frac{100}{y}}\right)^2. $$
  2. 2.

    Now, since y has been rounded to the nearest first decimal digit, two bounds can be set for H, namely:

    $$ \begin{aligned} H_{\rm l} &= x\left({\frac{100}{y+0.05}}\right)^2 \\ H_{\rm u} &= x\left({\frac{100}{y-0.05}}\right)^2 \\ \end{aligned} $$

    By computing the bounds of all the universities with a score below 7.0 in ARWU 2011 we arrive at an interval of values in which H must lie. The lower endpoint of the interval would be the upper value of all the H l,  L, and the upper one would be lower value of all the H u,  U. Provided that L < U, the center of the interval (LU) would be an accurate estimation of the value of H. We can then test our hypothesis by computing the number of points of a number of universities in ARWU and evaluating the scores according to the square root formula. Depending on the results, we will know whether the rule remains correct.

Table 4 shows the number of points of institutions covering all the scores from 1.5 to 6.9 in ARWU. The acronyms in the table stand for:

  • pts: Results of the computation once all the papers from those institutions have been looked up.

  • N&S: True scores on the N&S indicator in ARWU 2011.

  • LOWB: Lower bound for the estimation of H, the number of points of Harvard University, as discussed above.

  • UPRB: Upper bound for the estimation of H.

The final row of the table shows the tightest bounds for H (L and U), maximum value of column LOWB and minimum value of column UPRB, respectively.

Table 4 Bounds for the number of points of Harvard University

Given that L < U, we can proceed further and check the validity of the square root hypothesis. Let first H be the value of the center of the interval (LU),  H = 436.9. We shall use that value to estimate the scores of a number of institutions and check whether the estimates match the true values. The validity of the square root rule will be checked for institutions scoring between 7.0 and 9.5 in ARWU 2011 along with universities from Spain.

Let “pts” be the number of points obtained by an institution. To compute scores on the N&S indicator, the formula will again be

$$ \hbox{EST}=100\sqrt{{\frac{\hbox{pts}}{H}}} $$
(3)

A total of 23 institutions have been analyzed to test the estimation methodology. All the articles in Science or Nature with at least one author from any of those institutions have been identified through the Web of Knowledge.

Table 5 shows the number of papers of the universities under analysis, and the position of the institution affiliation within the list of authors.The acronyms of the columns of Table 5 stand for:

  • NP: Number of articles in Science or Nature between 2006 and 2010.

  • CA: Number of articles as corresponding author.

  • FA: Number of articles as first author.

  • NA: Number of articles as next author.

  • OA: Number of remaining articles.

  • pts: Total number of points of each institution on the indicator N&S.

  • EST: Estimated score on N&S of each institution.

  • N&S: True value of indicator N&S of each institution in ARWU 2011.

As Table 5 shows, the estimated score coincides with the one assigned by ARWU in all the cases. We can safely state that the N&S indicator is indeed reproducible using the procedure described in this section.

Table 5 Estimation of N&S scores of selected institutions

Computing the PUB indicator

It is obvious, by just taking a quick look at the number of articles published every year by each institution, that the method used for assigning scores to the PUB indicator follows the square root rule as well. Using a rough approach we could then get a first estimation of the points awarded by ARWU. There are, however, other problems associated with the computation of the PUB indicator. The problems mainly arise when searching the Web of Knowledge, since a great deal of universities show a variety of affiliation names, and taking them all into account is a very demanding task.

Besides the difficulties that arise when trying to properly identify affiliations in the web of knowledge, which constitute a disturbing source of noise in the process of assigning papers to institutions, it is not easy to interpret the meaning of the “special” weight of two introduced by the authors of the ranking for papers indexed in Social Science Citation Index (SSCI), since we encounter papers indexed in the SSCI that are indexed in the Science Citation Index Expanded (SCIE) as well, and papers listed only in the SSCI.

To overcome the first hurdle a large sample of easily identifiable institutions has been analyzed. To come up with an answer for the meaning of the “special” weight a linear regression analysis was performed, using the number of papers indexed in different ways (as explained below) as predictors of the final score (dependent variable).

Let’s first split the papers from an institution (articles and proceeding papers only) into three different sets, namely:

  • oc: Papers that are listed only in the SCIE.

  • cs: Papers that are listed in both the SCIE and the SSCI.

  • os: Papers that are listed only in the SSCI.

There are two extreme approaches to assign a special weight to the papers listed in the Social Science Citation Index: either a weight or 2 for just the os papers, or a weight of 2 for all the SSCI listed papers, ( os plus cs ). There is a third possibility though, one that results in a different weight for the papers included in os and the papers included in cs. To explore the likelihood of the different hypotheses, a regression analysis was carried out in the case of the two extreme approaches mentioned before.

According to Tabachnick and Fidell (2007, p. 123), the minimum sample size requirement for a linear regression analysis depends on the number of independent variables. Those authors recommend a sample size in excess of 50 + 8N (where N stands for the number of independent variables). Since we have three variables ( oc, cs, os ), at least 74 institutions will have to be included in the analysis. All those institutions are listed below.

1: Aarhus; 2: Beijing Normal; 3: Brandeis; 4: Caltech; 5: Leuven; 6: Columbia; 7: Dalian Tech; 8: Dartmouth; 9: Duke; 10: Erasmus; 11: Jilin; 12: Kanazawa; 13: King Fahd; 14: Kobe; 15: Kyushu; 16: Med Vienna; 17: Michigan State; 18: MIT; 19: Nankai; 20: Nanyang Tech; 21: Natl Cheng Kung; 22: Natl Tsing Hua; 23: Natl Singapore ; 24: New York; 25: Oxford; 26: Peking; 27: Princeton; 28: Shandong ; 29: Sichuan; 30: Stanford; 31: Aberdeen; 32: Antwerp ; 33: Birmingham; 34: Bologna; 35: British Columbia; 36: Buenos Aires; 37: Calgary; 38: Calif Davis; 39: Calif Irvine; 40: Calif Los Angeles; 41: Calif Santa Barbara; 42: Cambridge; 43: Geneva; 44: Helsinki; 45: Kiel; 46: Koln; 47: Leeds; 48: Liverpool; 49: Manchester; 50: Melbourne; 51: Michigan; 52: Milan; 53: Munster; 54: New Mexico; 55: Nottingham; 56: Oslo; 57: Padua; 58: Pisa; 59: Queensland; 60: Rochester; 61: Sao Paulo; 62: Siena; 63: Southern California; 64: Stockholm; 65: Texas Austin; 66: Tubingen; 67: Warwick; 68: Washington; 69: Wuerzburg; 70: Zurich; 71: Uppsala; 72: Vanderbilt; 73: Washington St Louis; 74: Xiamen.

Procedure to test the two extreme cases:

  1. 1.

    Compute the points awarded to Harvard University according to the selected case; let again H be the result of that operation.

  2. 2.

    Evaluate, according to their ARWU score, the points that should accrue to all the universities in the sample were the selected case the true one. Hence, and assuming the square root rule is in place,

    $$ PUB=100\sqrt{{\frac{\hbox{pts}}{H}}}\Rightarrow {\hbox{pts}}=H\left({\frac{\hbox{PUB}}{100}}\right)^2 $$
  3. 3.

    Perform a linear regression analysis using the three variables ( oc, cs, os ) to predict the points computed in step 2.

  4. 4.

    Test the validity of the procedure by looking up the confidence intervals for the regression coefficients.

Table 6 shows the data gathered from the Web of Knowledge on the three variables ( oc, cs, os ) corresponding to the institutions included in the regression analysis. Column I points to the number assigned to the university in the list of institutions provided before.

Table 6 Scientific Production in 2010 of the institutions included in the regression analysis

A preliminary analysis was conducted to ensure no violation of the assumptions of normality, linearity and homoscedasticity. In the two extreme approaches already mentioned, the Normal P–P plots show points closely aligned to a straight diagonal line, suggesting no major violations from normality. Besides, no outliers were found, the residuals obey to no systematic pattern and conform to a rectangular distribution. We are now in a position to explore the results of the multiple regression analysis in both cases.

Let’s begin by assigning the weight of 2 to all the papers included in the SSCI index to compute the raw score of universities (including Harvard University). In that case, the predictor coefficients for the three variables ( oc, cs, os ) should be (1,  2,  2).

The three variables explain a 100 % of the variance of the sample, \( \hbox{F}(3,70)=300.56, \,p\ll0.001 \). The values of the predictor coefficients were (1.050,  1.556,  2.106), and the confidence intervals were:

$$ {\tt oc:} (1.045,\;1.055); {\tt cs}: (1.500,\;1.612); {\tt os}: (2.064,\;2.148). $$

These results are not in line with the assumed (1,  2,  2) weighting scheme. In fact, they are very close to (1,  1.5,  2).

Let’s now assign the special weight of 2 just to the papers listed only in the SSCI index. In that case, the predictor coefficients for the three variables ( oc, cs, os ) should be (1,  1,  2). The three variables again explain a 100 % of the variance of the sample, \(\hbox{F}(3,70)=300.56, \,p\ll0.001\). The values of the predictor coefficients were (0.956,  1.417,  1.917), and the confidence intervals were:

$$ {\tt oc:} (0.951,\;0.960); {\tt cs}: (1.365,\;1.468); {\tt os}: (1.879,\;1.955). $$

Again, these results are not in line with the assumed (1,  1,  2) weighting scheme. The predictors remain very close to (1,  1.5,  2).

We are then entitled to check the possibility of a very special weighting scheme, in which the papers listed only in the SSCI would receive a weight of 2, while the papers listed in both the SSCI and the SCIE would just get a weight of 1.5. Using those weights to recalculate the value of H and the points awarded to all the universities, the three variables again explain a 100 % of the variance of the sample, \(\hbox{F}(3,70)=300.56, p\ll0.001\). The values of the predictor coefficients were (1.003,  1.486,  2.012), and the 95.0 % confidence intervals were:

$$ {\tt oc:} (0.998,\;1.007); {\tt cs}: (1.433,\;1.540); {\tt os}:(1.972,\;2.052). $$

This time, the results are in line with the assumed (1,  1.5,  2) weighting scheme. The solution to our problem is then to assign a weight of 2 to papers listed only in the SSCI and a weight of 1.5 to papers listed in both the SCIE and the SSCI. Let now H be the number of points of Harvard University using those weights, and ( oc, cs, os ) the sizes of the three different sets of papers from an institution. To compute scores follow the already well known path:

$$ \hbox{pts}=\hbox{oc}+1.5\hbox{cs}+2\hbox{os};\quad \quad \hbox{EST}=100\sqrt{{\frac{\hbox{pts}}{H}}} $$
(4)

Table 7 shows the results of the estimation procedure for the sample of universities under analysis. Acronyms in the table stand for:

  • pts: Total number of points from Eq. 4.

  • EST: Estimation of the PUB indicator Eq. 4.

  • PUB: Value of the PUB indicator taken from the ARWU web site.

The results obtained are very accurate, with a mean square error of just 0.014, although small and non systematic errors in the computation may arise as Table 7 shows. Those errors can be caused by a number of reasons, ranging from the already mentioned difficulty in checking all the possible affiliations linked to an institution, to the fact that results in the WOK do not remain constant in time but depend on the day the searching takes place.

Table 7 Estimated and true PUB scores in ARWU 2011

Computing the PCP indicator

There are a number of issues that make the computation of the PCP indicator very difficult. First of all, there are two ways of computing the indicator: “the weighted scores of the above five indicators divided by the number of full-time equivalent academic staff. If the number of academic staff for institutions of a country cannot be obtained, the weighted scores of the above five indicators are used” (Liu and Cheng 2005). A number of caveats are in place. First, the coefficients chosen for “weighting” the scores. Second, the number of full-time academic staff from an institution. Third, whether the square root rule applies or not in this case.

As of the weights used to compute the scores, it is not difficult to produce them in the case of the countries in which the authors do not make use of the full-time equivalent academic staff data. Multiple regression analysis was carried out to assess the weights applied to the first five indicators to produce the scores on the PCP indicator. The universities available for this regression analysis were the ones from Argentina, Brazil (all universities but Universidade de Sao Paulo), Chile, Croatia, Germany, Egypt, Finland, Greece, Hungary, India, Ireland, Israel, Mexico, Malaysia, Russia, Poland, Portugal, Singapore, and Turkey. They conform a sample of 90 universities. Again, a preliminary analysis was conducted to ensure no violation of the assumptions of normality, linearity and homoscedasticity. The Normal P-P plots show points closely aligned to a straight diagonal line, suggesting no major violations from normality. Besides, no outliers were found, the residuals obey to no systematic pattern and conform to a rectangular distribution.

The minimum sample size requirement for a linear regression analysis in this case is 50 + 8 × 5 = 90, since we have five predictors. So we have the appropriate sample size to carry out the regression analysis. The hypothesis is again that the square root rule does apply; however, a first attempt to use it directly resulted in an unlikely set of weights, very different from the expected (0.1, 0.2, 0.2, 0.2, 0.2). Since the indicators had been computed after taking the square root of the raw scores, the rule was tried with the weights applied to the squares of the scores, not to the scores themselves. This time results do conform with the expected weighting scheme. Multiple regression analysis shows that the five variables explain a 100 % of the variance in the PCP indicator, F(5, 84) = 535.84, p ≪ 0.001. The values of the predictor coefficients were (0.105,  0.213,  0.213,  0.213,  0.213), and the 95.0 % confidence intervals were:

$$ {\tt Alumni:} (0.104,\;1.006); {\tt Award:} (0.211,\;0.215); {\tt HiCi:} (0.211,\;0,216); {\tt N\&S:} (0.211,\;0.216); {\tt PUB:} (0.212,\;0.214) $$

The results are in line with the assumptions, although an extra factor must be introduced to correct for the slight deviation from the nominal 0.1 and 0.2 weights; that enables us to estimate the results in the PCP indicator with a high degree of accuracy, in the case of universities from countries for which the authors do not make use of the full-time equivalent academic staff data, using the following formula:

$$ \hbox{EST}^2={\frac{1}{0.94}}\left(0.1\hbox{Alumni}^2+0.2(\hbox{Award}^2+\hbox{Hici}^2+\hbox{N}\&\hbox{S}^2+\hbox{PUB}^2)\right) $$
(5)

In the sample, the PCP indicator was predicted correctly in 80 out of the 90 cases. The square error in the other 10 cases was less than 0.005, and the average square error was 0.001. Undoubtedly those errors are caused by the rounding operation to the first decimal unit in the ARWU web site. The authors of the ranking use the true values but only publish the rounded ones.

By and large, there is no easy access to the number of equivalent full-time faculty of all the institutions listed in ARWU. Australian universities have been chosen as a showcase, since there are public and reliable data published by the Department of Education, Employment and Workplace Relations of the Australian Government (2011).

In the case of universities from Australia, the number of Full Time Equivalent Academic Staff used in ARWU seems to be the number of Full Time Equivalent Faculty at the levels above Senior Lecturer and Lecturer (Level C). Data from Australian universities are shown in Table 8, where the acronyms of the three columns stand for:

  • ASL: Full Time Equivalent number of Faculty above Senior Lecturer.

  • LLC: Full Time Equivalent number of Faculty at Lecturer Level C.

  • FTE: Full Time Academic Staff for the calculations of the ARWU data.

Table 8 Full time equivalent staff of Australian universities in ARWU 2011

To get the actual PCP values we first compute the weighting sum of the squares of the first five indicators; let’s call that value WSS.

$$ \hbox{WSS}=0.1\hbox{Alumni}^2+0.2(\hbox{Award}^2+\hbox{Hici}^2+\hbox{N}\&\hbox{S}^2+\hbox{PUB}^2) $$
(6)

Let’s call WSSCT the value of WSS for Caltech, the university with the highest score in the indicator PCP. Let FTECT be the Full Time Equivalent Staff of Caltech. To estimate the PCP indicator of the university X the following operation will be carried out:

$$ \hbox{EST}=100\sqrt{{\frac{{\frac{\rm{WSSX}}{\rm{FTEX}}}}{{\frac{\rm{WSSCT}}{\rm{FTECT}}}}}}=100\sqrt{{\frac{\rm{FTECT}}{\rm{WSSCT}}}}\sqrt{{\frac{\rm{WSSX}}{\rm{FTEX}}}} $$
(7)

To compute formula 7 the value of 276 has been assigned to the Full Time Equivalent Academic Staff of Caltech. Since it follows from the 2011 ranking data that WSSCT = 3,157.47, we have that

$$ \sqrt{{\frac{\rm FTECT}{\rm WSSCT}}}=0.29565\Rightarrow {\rm EST}=29.565\sqrt{{\frac{WSSX}{FTEX}}} $$
(8)

Table 9 shows, for all the Australian universities in ARWU 2011, the values on the five indicators (Alumni, Award, HiCi, N&S, and PUB). The rest of the acronyms in the table stand for:

  • FTE: Full Time Academic Staff for the calculations of the ARWU data.

  • WSS: WSS sums from formula 6.

  • PCP: Values of the PCP indicator taken from the ARWU website.

  • EST: Estimates of the PCP indicator according to formula 8.

Table 9 Values of the PCP indicator (actual and predicted) of Australian Universities in ARWU 2011

Results from Table 9 show that the PCP indicator was computed correctly in 17 out of 19 cases. The square error in the other 2 cases was less than 0.005, and the average square error was 0.001. It is apparent again that those errors are caused by the rounding operation to the first decimal unit in the ARWU web site. The accuracy of the computation has been proved, so by reversing formula 7 one can get the value of the full-time academic staff of an institution that has been used by the authors of the ranking.

Conclusions

In spite of the barrage of criticism that the Shanghai ranking has raised due to the multiple shortcomings of its methodologies, it would not have attracted so much interest if the ranking had not been used so extensively as a rough guide to gauge the quality of research universities.

A painful feeling arises from having to acknowledge that the reputation of an institution is not good enough according to this or that classification, while, by sheer lack of information, there is no clear path to assessing why and in which way things should be improved. Rankings not based solely on hard data, that incorporate a certain degree of perceived reputation in the form of opinions, are doomed to replicate the status quo, and offer little help to new higher education institutions; meanwhile, rankings based on hard evidences enable universities to better assess their strengths and shortcomings, provided that the results of the ranking are reproducible.

Given that the Shanghai academic ranking fulfils the condition of being based upon hard data, it was of paramount importance to clarify a matter that has been the subject of strong criticism since the first edition of the ranking was released: the reproducibility of its results. As discussed in the paper, it has been taken at face value the statement that the ARWU results were in fact not reproducible. That helped in fueling the debate on the shortcomings of the ranking, while at the same time diminished its credibility among the scientific community at large. It is therefore important for the academic world to settle this issue for good so that we can move forward and discuss ways in which the Shanghai ranking could become a useful tool for academic policy makers.

To settle the issue a complete methodology to compute scores of universities on all the indicators that compose the ARWU ranking has been presented. The main finding of this paper is that the accuracy of the computed scores attests to the reproducibility of the results of the Shanghai academic ranking of world universities in all its indicators.

In searching for the solution to the problems posed in reproducing the results of the ranking some possible sources of error have also been identified, particularly in the case of the indicators related to the scientific output of the universities. In (Docampo 2012) the case of universities from Spain has been extensively analyzed. After carefully reviewing the publication databases, a number of examples of incorrect assignment of affiliations to institutions were identified; among those examples, the most striking one was the case of a very well known university, that clearly qualified to be among the 300–400 in the 2011 edition of ARWU, but did not even make it to the final list due to those identification problems. The acknowledgement of those issues, and the subsequent dialogue established between university authorities and ranking makers will hopefully result in the reparation of the damage made to the reputation of that university.

The presentation of the findings of this paper have been documented in order to be used as a guide to replicate the results of any Higher Education institution. All the necessary computations have been well described, including the final equations to render the scores on each indicator. Those guidelines could then help university officials around the world to monitor the results in the six indicators from the Shanghai ranking, regardless of whether their institutions are listed among the five hundred world universities.

One of the best contributions that can be made to further the academic and political debate about university rankings and league tables is to focus, among other aspects, in reducing the uncertainty and clarifying the possible sources of errors when interpreting their results (Leyersdoff 2012). As of the Shanghai ranking, the proof of the reproducibility of its results clearly results in a significant reduction of uncertainty, which could help in making better informed decisions. The identification of particular sources of error (particularly those related to the use of all the affiliations corresponding to the same institution) could also help to fostering the necessary communication between institutions and ranking makers.

Methods summary

ARWU data on academic institutions were gathered directly from the Shanghai Jiao Tong University ARWU website, http://www.shanghairanking.com; data on the scientific production of the institutions analyzed in the paper were taken from the Web of Knowledge in October 2011. Estimation of scores were computed using different Excel files containing all the data from the ARWU and WOK websites. The multiple regression analyses were performed using SPSS Statistics 18.0. All Excel and SPSS files are available upon request.