Introduction

The coming of the Web and Internet has created a transformation of the scientific communication, questioning traditional ways in which scientists interact among them and the appreciation of the research activity by the society. The term “Science 2.0” defines this new form of Science (Shneiderman 2008) in which the collaborative activities and the free exchange of information are modelling new academic results (open access journals, academic repositories, etc.) and an alternative assessment system (altmetrics, webometrics, etc.). In this context, social networking sites such as Academia.edu, ResearchGate or Mendeley have recently raised as platforms to improve the social participation, the sharing of papers and the seeking of new collaborators. At the same time, academic search engines are broadening the publication outlets (repositories, digital libraries, etc.) at the expense of journals, while emphasize the role of authors and documents (Ortega 2014). These new services are causing a challenge for research evaluation questioning the position of some agents (journals, publishers, journal level indicators, etc.), introducing open access products (repositories, web publishing, etc.) and suggesting new ways to measure science (altmetrics, webometrics, etc.). In this framework, studying population dynamics in those platforms would shed light on the representativeness of these sources and their reliability for research evaluation.

Google Scholar Citations (GSC) is a Google Scholar’s (GS) service that allows the building of a short personal page for free from the papers indexed in their databases, besides the addition of individual bibliometric indicators computed by the system. The novelty of GSC for research evaluation is that it makes possible the definition of specific research units, mainly researchers, which are able to be compared with others inside the same institution or research interest. In addition, the comprehensive coverage of research materials in GS favours that these pages offer a wide view of the research production and impact. And finally, the fact that these profiles are publicly available, it helps that an author can be appreciated for a broader range of academic activities.

However, GSC presents a singularity with respect to other academic search engines. Their profiles are directly created and made public by the researchers themselves. This causes that the population of GSC could be similar to academic social networking sites. This fact can have important consequences for research evaluation because it could produce unbalanced samples at disciplinary, country and institution level both in a static and longitudinal perspective. In this sense, this study pursues to observe dynamics on the use of social sites by researchers and how these services are settled along the time. Ultimately, to see whether the process of colonization of GSC—this is, the way in which GSC was taken up since their first moments—could shows important biases that influence the data collection and, in consequence, compel to adopt more precise sampling methods.

Related research

Literature on demography in social network sites is rather scarce and in many cases, this makes up just descriptive reports about the geographical distribution of users. The most recent was the Duggan and Smith (2013) report which prompts important demographic differences between users from a social network site and another, signalling that each platform shapes its own population according to their services. In this sense, Boyd and Ellison (2007) already noticed the dissimilar successfulness of different services regarding to countries, gender or interests, which favours the changing nature of these sites. For example, Chang et al. (2010) described deep ethnicity changes in the American Facebook during 3 years, while Garcia et al. (2013) analysed the resilience of these sites facing the fast loss of users. Similar results were found by Mislove et al. (2011) on the United States population signed into Twitter.

However, literature on demographic aspect in academic social networks is even scanter. A few of papers have explored the presence of scientist in academic social sites. Haustein et al. (2014) followed the footprint of 57 bibliometricians on the Web, finding that 23 % were in Google Scholar Citations and 16 % had a Twitter account; whereas Mas-Bleda et al. (2014) tracked 1517 researchers in several academic sites, detecting a low adoption rate and a limited overlapping between those sites. On the other hand, some reports, provided by the site itself, describe general statistics that illustrate the unbalanced distribution of researchers. Thus, a global report of Mendeley (2012) shows a strong presence of Biologist and Biomedicine users (31 %) as well as a high weight of francophone countries and institutions. ResearchGate (2014) also presents a similar disciplinary distribution, with a hegemonic presence of Bio and Medicine users. Menendez et al. (2012) studied the positions and affiliations in Academia.edu finding that it is populated by young researchers and the presence of emergent countries is significant. As in generalist social networks, academic ones are also populated by different users from different countries, institutions and disciplines. Contrarily, most of the papers on academic social networks are focused on the use (Van Eperen and Marincola 2011; Hogan and Sweeney 2013). In this sense, Almousa (2011) observed disciplinary differences in the use of Academia.edu. Thelwall and Kousha (2014) described differences in the use of this site by gender and disciplines. Chakraborty (2012) compared Facebook and ResearchGate to detect the academic motivations to use both sites. And Ebner and Reinhardt (2009) studied the role of Twitter in scientific conferences.

But the most active interest on academic social networks is done from a research evaluation view, exploring the relationship between usage, followers, visits, etc., with citations and papers. In other words, examining the relationship of altmetric/webometric indicators with bibliometric ones. Li et al. (2012) found significant correlations between citations and numbers of bookmarked papers in Mendeley and CiteULike. Eysenbach (2011) observed that the tweet mentions can predict the future impact of highly cited papers. Contrarily, different results did not find a clear relationship between downloaded papers and their further scientific impact (Moed 2005; Watson 2009; Halevi and Moed 2014; Glänzel and Heeffer 2014).

With regard to academic search engines, studies have been basically centred on Google Scholar and Microsoft Academic Search (MAS), the two most relevant engines that include author profiles. A comparative study showed that while MAS presented a balanced population, GSC was biased to computer-related disciplines (Ortega and Aguillo 2014). Haley (2014) also compared both engines at journal level, finding correlations between bibliometric indicators (citations and h-index). More concretely on GS, some studies were focused on its coverage in relation to other citation databases (Bakkalbasi et al. 2006; Meho and Yang 2007), its connection with web citations (Kousha and Thelwall 2007) and its suitability to the scientific assessment (Jacsó 2008; Aguillo 2012).

More specifically, GSC profiles were studied almost since its beginning (Pitney and Gilson 2012; Huang and Yuan 2012). Ortega and Aguillo (2012) mapped the labels included in each profile to build a Map of Science. They themselves analysed country and institutional collaboration networks using co-authors lists of these profiles (Ortega and Aguillo 2013). On the other hand, Delgado López-Cózar et al. (2014) evidenced the possibility of manipulating bibliometric scores of profiles. However, no previous studies have addressed how this service was populated since their origins from a longitudinal view, discussing their implications for research evaluation. This papers attempt to represent the evolution of users by several demographic attributes (country, organization, subject matter, positions, etc.) as way to illustrate the representativeness of this population for research evaluation studies.

Objectives

The principal objective of this work is to describe the growth of GSC in its initial moments (2011–2012) through a set of personal attributes such as bibliometric indicators, positions, disciplines, organizations and countries. This objective aims to make clear the biases that could appear in this population and discuss how they would affect the research evaluation. Several research questions can be formulated from this primary objective:

  • How is the growth of profiles in GSC and how can the number of profiles be estimated?

  • How have the characteristics that define this population (bibliometric indicators, position, discipline, affiliation and country) evolved during this initial moment?

  • What consequences could have this distribution of profiles for research evaluation?

Methods

Data obtaining and processing

The way in which this data was taken and processed was already detailed in previous works (Ortega and Aguillo 2012, 2013). Data processing was developed in two stages: in the first one, a SQL script was written to crawl the entire service asking for the 25 letters of the Latin alphabet in groups of three for the first sample (December 2011) and in groups of two for the remaining ones. The objective was to identify as many profiles as possible and extract their author identification. Once the crawler finished, a second script harvested the fundamental data from each profile such as name, affiliation, labels, number of papers and citations. Five quarterly samples were taken from December 2011 to December 2012 in a unique attempt, which sum 191,858 unique profiles. The first sample in December 2011 did not extract the number of papers because the script was not developed at all.

However, one of the most important problems of GSC, from a bibliometric view, is that the information about each profile is filled out by the users themselves in a natural language. For this reason, this raw data has to be cleaned hard and normalized before any statistical analysis because it is possible, for example, that a same organization is written in multiple different forms. For instance, Universidade de São Paulo could be written more than 20 diverse ways such as University of Sao Paulo, Sao Paulo University, USP, U Sao Paulo, etc. This problem gets worse when positions, departments, faculties, etc., are included in affiliations. Another problem related with affiliations is that sometimes a user is appointed to several organizations because he/she is a visiting professor or works for various institutions. In this case the first organization was always adopted as a main affiliation. In instances where no affiliations were detected, the web domain of the e-mail was considered as an affiliation, although they didn’t always coincide.

Similar inconsistencies occur in other fields. Labels can present a same keyword in different languages, abbreviated or in plural/singular form. Sometimes labels with imprecise meaning such as control, reliability or assessment were not classified. On the other hand, the existence of duplicated profiles—different profiles that correspond to the same author—is rather scarce because these are created and maintained by their own users. A search of similar names returned only 2.1 % of duplicated profiles; notice that it includes many common names such as Wey Wang, John Smith or José López. Due to this, the real percentage of duplicated profiles could be under 1 %.

To solve these problems Google Refine (Google Refine 2015) was mainly used for organisations and labels to group similar variants of the same name or word.

Indicators

To test the reliability of the sample and to estimate the total population of GSC the Lincoln–Petersen formula was applied (Seber 2002). This equation is widely used in Wildlife management and it is based on the mark and recapture method. This counting method assumes that a high proportion of repeated items would be an indicator of the completeness of the sample. As more samples are tested more consistency gains the population estimation.

$$N = \frac{{\sum {\left( {M_{i} C_{i} } \right)} }}{{\sum {R_{i} } }}$$

where N is the total population to estimate, M is the total number of profiles retrieved by the crawler, C is the number of unique profiles and R is the number of repeated profiles that appear several times during the crawling process.

Compound annual growth rate (CAGR) was used to measure the increase rate of the profiles and their attributes. This formula was considered because it is suitable for models with exponential trends. Thus, V 1 is the initial observation, V n the final one and n is the number of moments between the first and the last observation. Next, it was converted to percentage:

$${\text{CAGR}} = \left[ {\left( {\frac{{V_{n} }}{{V_{1} }}} \right)^{\frac{1}{n}} - 1} \right] \times 100$$

In addition, GSC calculates some bibliometric indicators that describe the performance of each profile and are analysed in this paper:

  • Papers: number of items indexes in GS and included in each profile.

  • Citations: total number of citations that receive those items from the indexed papers in GS.

  • H-index: it is the largest amount of papers (h) which have received at least the same number of citations each (h). For example, an h-index = 5 means that the one author has published at least five papers that have been cited five or more times.

Results

Samples

This part traces the growth of the successive samples obtained along 2011–2012 and the consequent estimations of the size of GSC in profiles.

Figure 1 and Table 1 describe the evolution of GSC’s profiles along each trimester, since December 2011 to December 2012. During this period, the number of unique profiles grew 164.9 %, going from the 26,682 profiles in December 2011 to the 187,301 profiles in December 2012. At the same time, the number of estimated profiles increased 158.8 %, from the 36,325 in December 2011 to the 243,435 in December 2012. It is interesting to notice that the new incorporations have remained stable (30,000 profiles approx.) until December 2012, when the number of new profiles was doubled. According to the comprehensiveness, which measures the percentage of unique profiles into the full estimation, it has been enhanced from 73.4 to 79.3 %. This high rate of completeness shows that these samples are enough representative of the total population.

Fig. 1
figure 1

Growth and evolution of GSC by number of profiles

Table 1 Evolution of GSC profiles

Bibliometric indicators

Bibliometrics indicators (#papers, #citations and h-index) from each sample are graphed in a log–log plot to describe the evolution of the scaling exponent (α) and median of each distribution.

Figure 2 plots the frequency distribution of papers, citations and h-indexes of each sample. Table 2 contains the main parameters that describe these distributions as well. These parameters were only obtained for descriptive purposes and not for estimation attempts, which is the reason why these distributions were not logarithmically binned (Milojević 2010). In general, it is perceived that the scaling exponents (α) grow as time goes by, mainly since June 2012 when an important leap is perceived. This means that the differences between profiles increase in each sample, causing that the distributions of papers, citations and h-indexes are more and more unbalanced. In addition, median values gradually descend which indicates that the new added profiles in each sample correspond with small users in bibliometric terms. This is confirmed by the increasing values of percentages <10 papers, citations and h-indexes.

Fig. 2
figure 2

Papers, citations and h-indexes distributions by sample

Table 2 Principal parameters of papers, citations and h-indexes distributions by samples

Academic positions

From the total 191,858 unique profiles, 88,335 (46 %) profiles showed an academic status. The aim is to present the scholar position as a way to describe the youthfulness or maturity of the population in academic terms. Six professional categories, as close as possible to the academic hierarchy, are defined to group these academic statuses (Table 3). Thus Professor is the position most frequent (38 %), being followed by Assistant Professor (18.4 %) and Doctoral Student (16.3 %). These two categories could correspond to young professional statuses which suggest that GSC is being settled more by young researchers than recognised professionals such as Professors. This is confirmed if Research Fellow is added to this group of young scholars (46.1 %). This explains the low proportion of Associate Professor (15.2 %), an intermediate scale, or Emeritus Professor (.7 %). In line with this, the academic positions that most rise are Doctoral Student (Δ18.84 %) and Assistant Professor (Δ12.48 %) as well. This confirms that young researchers and professors are getting a considerable presence in this service.

Table 3 GSC profiles grouped by academic statuses

Labels

Labels that describe the research activity of each profile were counted and classified to study the evolution of GSC according a subject matter view. Scopus Subject Area scheme was used to group each label and show hence an easier disciplinary evolution.

Descending on the subject class level (Table 4), it can be valued that the disciplines with highest number of labels are Computer Sciences (15.56 %), followed far by Engineering (7.61 %) and Physics and Astronomy (6.48 %). However, the disciplines that get the most joining up to GSC are Environmental Science (Δ18.55 %) and Physics and Astronomy (Δ17.95 %), while Computer Science (Δ2.94 %) is the field that increases most slowly, missing the beat with the rest of disciplines. This suggests that a disciplinary change could be happening, where information technologies disciplines are given away to the biological and physical subject matters.

Table 4 Evolution of the new labels added in each moment by research classes in GSC

Affiliations

Processing and analysing affiliations makes it possible to know the origin of each profile and above all to know how the working place influences the settlement of an academic service. Figure 3 and Table 5 describe the number of new added profiles in each sample by country. Recognised countries in the scientific world such as the United States (25.78 %) and the United Kingdom (7.85 %) occupy the first positions, as well as emerging countries such as Brazil (6.6 %) and India (2.8 %) which are taking important places. The rest of the countries, such as Italy (5.24 %), Australia (4.08 %) or Canada (3.57 %), are important scientific countries that have relevant positions in most of the research rankings.

Fig. 3
figure 3

Distribution of new profiles by country and sample

Table 5 Distribution of new profiles by country and sample

But perhaps the most important fact is that the proportion of profiles from each country has changed as samples were taken. Thus, the first sample in December 2011 shows a high proportion of Anglo countries such as the United Kingdom, Australia and Canada, besides other important scientific countries such as Spain and Germany. Next, the sample of March observes the emergence of other European countries such as Italy and France, while in the sample of June and September 2012 it occurs the explosion of Brazil. This shows that the addition of new profiles is not done in a constant way but by following waves. According to the growth rate, Italy (Δ52.09 %) and Brazil (Δ43.57 %) are countries with the most new profiles added to GSC in this period.

Going into further detail, the distribution by organisations fits more clearly with the statement that this service is settled by waves and that these could come from certain countries. In general terms, the principal institutions by number of profiles are the Brazilian Universidade de São Paulo (1.83 %) and Universidade Estadual Paulista (.77 %), followed by Harvard University (.53 %) from the United States and the Universidade Estadual de Campinas (.53 %), again a Brazilian university. This ranking confirms the huge increase of the Brazilian profiles. However, this process is not sequential but abrupt. Figure 4 and Table 6 illustrate how the first sample is occupied mainly by American universities (Harvard University, Massachusetts Institute of Technology and University of Michigan), but it is in the third and fourth sample when the Brazilian universities blast off taking the hegemony of Google’s service. Thus, for example, the universities that most increase their profiles are Universidad Estadual Paulista (Δ116 %), Universidade Estadual de Campinas (Δ68.6 %) and the Universidade de São Paulo (Δ59.2 %). On the contrary, it is surprising to notice that important international universities such as Harvard University (Δ−3.61 %) and Massachusetts Institute of Technology (Δ−.19 %) are slowed down the inclusion of profiles.

Fig. 4
figure 4

Distribution of new profiles by institution and sample

Table 6 Distribution of new profiles by institution and sample

Discussion

Methodologically, this work presents the challenge of estimating the population of GSC using a capture–recapture method. The principal weakness of this study is that it only has a sample for each moment, because the data processing and obtaining require a great technical effort and time-consuming. This affects the Lincoln–Petersen formula because it produces overestimations when few samples are used (Tilling 2001). This recommends taking these estimations with caution and considering lower values. A previous study (Radicchi and Castellano 2013), crawling profiles from labels in common, obtained similar figures—49,365 for March and 89,786 for July 2012. This lets us suppose that the real population could be slightly under our estimations and close to the retrieved profiles by the crawler.

Results on GSC point out a good evolution of this service during 2012, with a CAGR of 159 % of estimated profiles which represents a sevenfold increase in a year. Although it is necessary to be reminded that these services suffer from a high volatility (Garcia et al. 2013), in fact, a recent crawler operated in December 2013 brought just an 11.7 % of annual increase which supposes a growing stabilisation of profiles.

The longitudinal analysis of the population that was settling GSC along 2012 has made it possible to build a standard profile of the users of this service. The great majority is researchers with a short curriculum because the median is 26 articles, 154 citations and 6 h-index, low numbers that describe an incipient research activity. Even more, these figures decrease as time goes by which suggests that new added profiles in each sample are mainly researchers with a short career. This observation fits with academic positions where more than 34 % of the profiles correspond to young academic categories (Doctoral Students and Assistant Professors) that have just started their academic careers as well as being the most increasing posts. This youthfulness is a characteristic of other academic sites where “graduated students” prevail (49 %) (Menendez et al. 2012). This same occurs in generalist social network sites (Duggan and Smith 2013) where most of the users are younger than 30 years old.

According to the thematic distribution, GSC is dominated by computer science researchers and other professionals related with information technologies and web environments, being the 15.56 % of the total profiles. This fact was already observed in a previous study on GSC, where a Map of Science showed a core of computer science labels centring the picture (Ortega and Aguillo 2012; Radicchi and Castellano 2013). However, the disciplinary evolution of the service draws that other research fields such as Environmental Science (Δ18.55 %) and Physics and Astronomy (Δ17.94 %) are quickly growing, while Computer Science becomes stabilised with the lowest growing rate (Δ2.9 %). This suggests that GSC advances toward a thematic equilibrium with a fairer proportion of researchers from all disciplines. Even so, subject matter distributions are also unbalanced in other academic services. Thus, Mendeley (2012) and ResearchGate (2014) bring very different figures with a strong presence of Bio and Medicine users.

One of the most interesting aspects of the population of this social platform is that this is done by waves of researchers from different countries and institutions. In the first stages, this service was settled by researchers from English-speaking countries such as the United States (35.4 %) or the United Kingdom (9.25 %) (December 2011). But in following rounds, European countries such as Italy (5.1 %) and France (3.5 %) (March 2012) strongly emerged (Ortega and Aguillo 2013); and in the last samples, it shows emergent countries such as India (3.8 %) and, above all, Brazil (12.4 %) that is one of the countries with the highest growth (September 2012). These continuous series of users are better observable at institutional level. Thus, while the first period (December 2011 to March 2012) is dominated by American universities such as Harvard University (1.14 %) and Massachusetts Institute of Technology (.8 %), in June 2012, abruptly Brazilian universities turn up such as Universidade de São Paulo (3.2 %) and Universidade Estadual Paulista (1.45 %), taking up the service (Ortega 2014). These sudden changes and unexpected distributions of countries and institutions were already reported in early studies on social networks, where the successfulness of these services differs from one country to another (Boyd and Ellison 2007) and where the fast emergence of different groups is usual (Chang et al. 2010). For example, Menendez et al. (2012), analysing Academia.edu, found similar figures for the United States and the United Kingdom but, however, detected important differences regarding Brazil and India. Mendeley’s (2012) fact sheets described a singular presence of francophone countries and institutions. These population biases could be motivated by external reasons such as certain institutional policies or styles between scientists inside a country which cause a non-random occupation of these services.

This evidence a volatile reality, where country, institutional and thematic distributions frequently fluctuate along the time, provoking heterogeneous populations. This fact has important implications for bibliometric studies because these profiles are not representative of the total population of researchers in the world. On the contrary, they make clear the influence of specific institutional politics for the use and population of these services that cause intentional alteration of the population distribution. In this way, macro studies at institutional, country or subject matter level can not be extrapolated to the global scientific performance due to GSC represents only a specific group of researchers that jointed this platform for particular reasons. In this case, stratified approach would be recommended to select representative samples instead of random selections.

Conclusions

Several conclusions can be extracted from the results:

GSC was growing very fast during 2012, going from 26,600 profiles in December 2011 to 187,301 in December 2012. At least from the harvested data, because our estimations suggest 236,000 profiles, which is close to 10 times of the initial size.

According to bibliometric indicators, GSC is getting settled by young researchers with a starting career which boost a low bibliometric performance. The low median values and the increasing differences between the same parameters along the time, evidences the strong irruption of these new researchers. This is confirmed by the high presence of Assistant Professors and Doctoral Students.

From the subject matter point of view, GSC is dominated since its beginnings by researchers close to Computer Science and related disciplines. However, the last samples appreciate the emergence of researchers from Physics and Environmental Sciences and Medicine that balance the thematic distribution of the service.

Both country and institutional distributions exhibit evidence that this service is getting populated by waves of researchers, firstly from English-speaking countries where Harvard University and Massachusetts Institute of Technology were outstood; then from European countries and finally from emergent countries, highlighting Brazil and their Universidade of São Paulo and Universidade Estadual Paulista.

Finally, these results have important implications for research evaluation because they evidence that GSC’s profiles, created by the scholars’ will, generate a population biased towards any aspect (disciplinary, organization, country, etc.) and with rapid and strong fluctuations. This suggests that the use of this source for research evaluation should not be done randomly, but selecting precise strata of population.