Introduction

Since mid-nineties webometrics is slowly placing a role in the description and evaluation of the scholarly communication (Thelwall et al. 2005; Orduña-Malea and Aguillo 2014). The lack of reliable data sources for link analysis is still one of the main barriers for its fully acceptance by the metric community (Thelwall 2010). But links are not the only web indicators that can be used for measuring science impact and several alternatives has been proposed like mentions of the names of institutions/authors or of paper/monographs titles (Cronin et al. 1998; Kretschmer and Aguillo 2004, 2005). Problems with name variants, incomplete texts or non-ASCII characters are a formidable obstacle, so it become popular to use link mentions (Ortega et al. 2014), i.e. the strings of the target URL of links not being necessarily an active link (for example the domain of a mail server present in the after-@ email addresses).

The emergence of social tools in recent years provides new opportunities for metric analysis of the scholarly impact. The most successfully proposal for profiting from that opportunity, the Jason Priem ‘altmetrics’ (Priem et al. 2010) that stands for alternative metrics, consists of a large and very heterogeneous set of measures with a very unfortunate umbrella term.

Social web-based metrics (= altmetrics) exploit a wide range of platforms, from almost bibliographic services like Mendeley, ResearchGate or Academia to general social networks like Facebook or Google+. The number of social tools are probably exceeding the thousand, but not all of them are equally popular and services like Twitter, Wikipedia or YouTube are plenty of academic users and contents. (Holmberg 2015; Kousha et al. 2012; Haustein et al. 2014a, b)

From a webometric point of view, the altmetrics are not only useful as article level metrics, but they also can be applied to research-related units like individuals (Mas-Bleda et al. 2014). In the same way the old webometrics issues regarding the inconsistency of results, understanding the meaning of the results or the vulnerability to manipulation are equally valid regarding altmetrics (Shema et al. 2014; Thelwall and Kousha 2015).

In spite of these limitations, we explore altmetrics as a potential tool for measuring research impact beyond the scientific communities, the so-called societal impact. Several authors have found correlation between altmetrics and citation measurements (Bornmann 2014; Erdt et al. 2016; Eysenbach 2011; Hammarfelt 2014; Sud and Thelwall 2014; Mohammadi and Thelwall 2014; Zahedi et al. 2014), but the choosing of sources can influence that result. There are also composite indicators like the Altmetric Attention Score developed by the company Altmetric.com (https://www.altmetric.com/about-our-data/the-donut-and-score/) that are becoming popular providing article level metrics to repositories and journals.

The aim of this paper is to use a webometric approach, using link mentions (URLs appearing in third party websites), to analyse the presence of the contents of Open Access Institutional Repositories (IRs) in a wide range of social networks and tools.

Methodology

The general search engines are not commonly used in bibliometric papers due to the limitations and shortcomings of these tools for quantitative analysis of the scholarly communication (Vaughan 2004; Vaughan and Thelwall 2004; Bar-Ilan 2004; Thelwall 2016; van den Bosch et al. 2016). Since 2004 with the introduction of the Rankings web (Aguillo et al. 2006), our team has developed a series of strategies for reducing the impact of the sometimes-weird behaviour of the major search engines. Usually one of the key issues is the level of noise when the terms searched are very common or they have many variants (including different languages versions), so trying to use web domains instead of names of individuals or institutions was the preferred option. However, even in these cases short domains (as for example the domain of the University of Seville in Spain is “us.es”) tended to over-estimate the results.

However, the Institutional Repositories usually use not only the institutional web domain, but they add their own subdomain, so using at least 3 “words” instead of 2 clearly reduces the mentioned noise.

The obvious choice for extracting data was Google, the largest and most popular search engine. However, when checking the number of records answered by Google for the same request using computers at the same location or repeating the search after a few moments in the same computer, the figures can be very different. The reason is that requests sent to Google can be answered by different data centres located around the globe for avoiding saturation of the servers, so if the closest centre (in a sense that does not mean necessarily geographical proximity) is not available at that moment, then another one will fulfil the request. Unfortunately, the data centres are not updated at the same time and it could be that their databases can be (greatly) different during large periods, a fact that can be unnoticed by the users.

In order to face this problem, we identified the IPs of several of the Google data centres (Table 1) and made a simultaneous common request to all of them. When comparing the numeric results, we realised that several groups of data centers gave the same counting result, so we assume that the members of each group shared the same Google database. Then, we chose the largest group and we used their IPs addresses for later extracting the IRs altmetrics. The Table 1 includes only the IPs that gave the same results during the experiment, the total number of addresses tested was far larger (150). Most of them were unreachable at that moment, but this is a volatile situation as in previous tests several of them were active, so we recommend to check in advance the availability and results of the candidate IPs.

Table 1 List of IPs of Google data centres

For extracting the altmetrics figures, we used a webometric approach based on the advanced operators of Google (Aguillo et al. 2010). The syntax includes two parts. First part is used for filtering for the webdomain of the social network through the operator “site:”, while the second one consists of the URL of the host of the repository between quotes that forces exactly that sequence of characters. The figure obtained is referred as the number of “url mentions” or “link mentions” So if we wish to obtain the link-mentions of the items deposited in the CSIC Institutional Repository (http://digital.csic.es/) in the Researchgate (http://www.researchgate.net) portal the syntax will be:

site:researchgate.net “digital.csic.es”

Other search engines can be used, but Google has the larger coverage by far. However, even Google does not index the whole contents of the most important social tools, especially those highly dynamic and volatile like Facebook or Twitter.

We extracted the list of IRs from the Ranking Web of Repositories (http://repositories.webometrics.info), excluding portals of journals, disciplinary repositories and faculty, school or groups’ repositories when the main organization (mostly a university) has its own repository. The master list includes 2296 IRs from all over the world.

We applied the webometric method described for obtaining altmetrics indicators for those IRs according the following 28 sources: Academia, Bibsonomy, CiteUlike, CrossRef, Datadryad, Facebook, Figshare, Google+, GitHub, Instagram, LinkedIn, Pinterest, Reddit, RenRen, ResearchGate, Scribd, SlideShare, Tumblr, Twitter, Vimeo, VKontakte, Weibo, Wikipedia (all languages), Wikipedia English, Wikia, Wikimedia, YouTube and Zenodo. Unfortunately, Elsevier’s Mendeley is not included, as Google is not indexing it. In this paper, we prepare a grouping of the social tools (see Table 3) only for descriptive purposes, but without any taxonomic intention.

The list is not exhaustive and not all the tools are similarly relevant for the researchers, even for the members of the metrics community (Haustein et al. 2014a, b). But taken into account that most of the IRs are managed by librarians we assumed that overall visibility is their main aim and so we do not restrict the analysis only to the “mainstream” media.

The data was collected during the first two weeks of July 2017. Requests were made twice (two times the same day) for avoiding collection errors. Then the maximum value of these two attempts for each IRs/tool request is chosen.

The list of IRs was cleaned by excluding the lowest values of the duplicate entries (19) and the 92 repositories with zero counts for every one of the sources. The final list includes 2185 entries from 102 different countries. The top 30 countries represented according to the number of IRs is shown in Table 2.

Table 2 Countries with the largest number of Institutional Repositories (IRs) analysed in this study (July 2017)

The top 10 countries includes the 58% of the IRs analysed. Indonesia, Ukraine or Taiwan are also present in that group that probably are more related to local universities initiatives than a true national Open Access policy. Among those countries lacking in the list perhaps Mexico, Belgium, Netherlands and Switzerland are the most surprising absences, although regarding the last three, their relative small number of universities need to be taken into account.

Results

Descriptive summary statistics for the coverage of the IRs are provided for the 28 tools in the Table 3. The population consists of 2185 IRs from which their web address (domain or subdomain) converted into strings of characters are checked for being mentioned in the cited social tools according to Google. As previously described, the proposed link counting method provides not exact numbers, but the counts may be accurate relative to each other.

Table 3 Descriptive statistics of the link mentions in the 28 social tools for the web domains/subdomains of all the Institutional Repositories (n = 2185; Google, July 2017)

The last column is very relevant as it shows that for 19 (68%) of the social tools, more than 1000 IRs have zero values, including 7 (25%) with more 2000 IRs (of 2185) in that situation.

Considering that the number of items deposited in the global IRs is in the order of several millions and that even considering overlaps and duplicates, the number of link mentions for all of the social tools is low, even for the academic networks (RG and Academia with averages below 300 mentions). In that sense, Scribd (e-books deposit) could relatively the most successful of the analysed tools. Low results from CrossRef are probably due to the fact the sources are mainly journals themselves so the links used are surely DOIs.

In spite their huge popularity, even among researchers, none of the main global platforms (Facebook, Linkedin or Google+) not the large local Chinese (RenRen) or Russian (Vkontakte) are being use to disseminate papers deposited in the IRs. However this method only identify links shared in public pages, so links exchanged in private groups are not considered.

In the case of Twitter there are evidence that the Google index only about a mere 5% of its contents, a figure that decreases year after year (Enge 2018).

Regarding YouTube, the specific characteristics of the video media can explain their low usage.

However, perhaps the most surprising result refers to the Wikipedia. As it is not allowed to publish original research, it can be expected that all the scholarly items will include several academic references to recent papers, preferably Open Access full text versions, those usually deposited in IRs. The low numbers can be due to several reasons: (1) OA papers referred by its DOI or other pURL different to the IR address. (2) Many citations can refer to canonical sources not OA or not included yet in repositories. (3) Perhaps many documents are referred to other OA sources, global portals like Researchgate, Academia or regional portals like Scielo or Redalyc.

The poor indexing of seminal papers by top institutions in their own repository is a serious concern for the OA community. The situation is clearly illustrated by the low number of records of the Oxbridge universities repositories included in the Google Scholar database (http://repositories.webometrics.info/en/transparent).

Even although English is by far the main language in academic papers and Wikipedia edition in that language is also the largest one, it looks that independently of the language of the paper, the IRs use the entries in their local Wikipedia for adding the link mentions to their assets.

In order to check specific situations, we identified the most popular IRs for each one of the social tools (Table 4). For 12 (43%) of them, the SAO/NASA Astrophysics Data System is the main contributor, although its huge size (about 13 million documents) and the fact that it cannot be properly tagged as a true IR should advice against its inclusion in the analysis.

Table 4 Repositories with the maximum number of link mentions for each one of the social tools (Google, July 2017)

For the rest of the tools the institutions represented are very diverse. The very large network of institutions of the University of California heads Facebook, while CiteUlike is especially liked by a small Christian Indonesian University. Slideshare is very popular in Latin America and the local Russian and Chinese social tools are represented by “local” institutions (Belorussian and Hong Kong ones).

As the institutional patterns are not evident and it looks strongly dependent of local initiatives or projects, we decided to focus on geographical aggregations: Regions (Table 5) and (selected) countries (Table 6).

Table 5 Global number of link mentions by region of the social tools (Google, July 2017)
Table 6 Global number of link mentions by country (selected) of the social tools (Google, July 2017)

European and North American IRs are virtually tied for every tool, although it should be noted that RG is far more used in Europe while the North Americans prefer Academia. The rest of the regions, including Asia, are far from the figures of the first two. Even for a few tools, Latin America, home of the large journal portals Redalyc and Scielo that topped the development of local IRs, is performing better than the whole Asia. As according to Table 3 the Asian countries are well represented by the number of IRs perhaps the gap can be explained by the size (number of records) and/or the visibility policies of these IRs.

We chose six countries (USA, UK, Japan, Australia, Brazil and South Africa) from different regions for a comparative analysis. Academia mentions less items for every country than RG, except for the USA, although Australian figures are very similar. Brazilian IRs are very active in Facebook and Scribd (here with figures close to Australia).

Discussion and conclusions

There are relevant limitations regarding the results obtained. The documents deposited should be referred in the social tools using the domain/subdomain of the repository. In fact, the opposite is truer, as many of them are recommending the use of handles (like for example http://hdl.handle.net/10261/148387). The handles are a type of pURL (permanent URL) that hide not only the name of the repository and its hosting institution, but even basic information about the paper like the names of author(s) or source, the publishing year or title keywords. This is relevant as the URL mention in the social tool can be the only piece of information the reader has to decide if clicking for reading or downloading the paper. Obviously, a prestigious university web domain can be a relevant hint in a tweet for both scientists and non-scholars users. In the case of Twitter, the use of shortening tools is very popular; so many mentions are lost when the URL does not include the host of the original web address. In fact, handles commonly are long strings of meaningless characters. A similar situation refers to the use of DOIs (another pURL) that it looks a good option with papers published in gold Open Access journals as both versions (published and deposited ones) are cited at the same time.

The Ranking Web of Repositories explicitly stated that its aims was not only to promote the green model of the Open Access initiatives (increasing the deposit in repositories), but to support good practices towards the authors depositing and their institutions. Therefore, the variables were designed for considering only those backlinks or social link mentions relevant to the repository that explicitly used the institution web domain. So since 2016 when altmetrics-based indicators were introduced, the use of pURLs in the IRs penalized their positions in the cited Ranking.

The results showed that most of the IRs managers (librarians in most of the cases) are not actively posting their contents to the social tools. It is possible that many items are really mentioned in the academic networks, but according to the data, they are not cited by the URL of the repository that offers information about the institutional authorship, a hint regarding the quality of the documents. However, there are other possibilities as far more authors than librarians are present on social media and they are surely interested in promoting their research, so when mentioning their papers probably they do not provide the digital location in their IR.

Excluding the most popular tools, local (or even individual) strategies and policies can explain the results for the most specialised tools. The Russian and Chinese services are virtually ignored outside their regional reach, although they have indeed very large audiences. GitHub, Figshare, Zenodo and Datadryad has scarce impact outside Europe and North America.

The recommendation of using pURLs for citing IRs items is probably sound, but using neutral or non-institutional web addresses decreases the informative value derived from the identification of the hosting institution. We suggest that this can penalize the usage of the involved OA papers as prestigious names can attract more visits. It can be also considered a bad practice regarding the institutional (moral) rights, as its authorship is explicitly excluded.

For scholarly communication purposes, researchers themselves are more and more active in both large general and academic tools, like RG, Facebook or Twitter, but from the results obtained it looks that IRs contents play a minor role. Regarding the most specialised tools the results suggests mostly local or individual initiatives.