Introduction

The structure and intensity of online relationships may be examined using different approaches (Barnett and Park 2014). For example, we can consider the physical network (measuring the bilateral internet bandwidth), the content network (measuring the web traffic from specific websites) or the navigational network (measuring the hyperlink connections among web domains). All these techniques make it possible to arrive at an approximation of phenomena that occur in the offline world from phenomena that occur in the online world. Among these different methods, we can highlight the development of link analysis (Thelwall 2004).

Since its inception over a decade ago, link analysis has been widely applied to the study of relationships between organisations on the web (Park and Thelwall 2003; Thelwall 2009), during which time a constantly evolving theoretical framework has been constructed (Thelwall 2006; Orduna-Malea and Aguillo 2014). The rationale behind this method is that relationships between organisations can be approximately quantified through the relationships between the URLs that represent them; so the organisation of the web acts as a mirror of social structure (González-Bailón 2009). Therefore, the construction of networks by means of URL mentions (when the URL that corresponds to website B is mentioned on website A, whether a hyperlink is activated or not) may reflect either expected or unexpected relationships among institutions.

Several authors have used link metrics to map the volume of links exchanged between two or more websites (Lang et al. 2014), estimate the level of collaboration between institutions (Stuart and Thelwall 2006), describe the structure and topology of these navigational relations (Ortega and Aguillo 2009) or even correlate the volume of links received with other indicators, such as citations, in the case of research institutions (Almind and Ingwersen 1998; Thelwall 2001). These analyses may provide information reliable enough to be used for policymaking (Thelwall 2002).

Likewise, hyperlink analyses have been applied to a wide set of environments, which include, among others, business (Vaughan and Wu 2004; Vaughan and You 2006; Orduna-Malea et al. 2015a, b), industry (Stuart and Thelwall 2006; Romero-Frías and Vaughan 2010a; Orduna-Malea and Aytac 2015), and politics (Park 2011; Romero-Frías and Vaughan 2010b). However, these are open areas where link analysis becomes more complicated and encounters both conceptual and methodological problems.

One reason for link analysis being somewhat problematic is that linking has a very low cost barrier (Weber and Monge 2011). Therefore, hyperlink analysis may contain irrelevant information or inconsequential relationships (Barnett and Park 2014). Another is that, if these analyses rely on the use of commercial search engines, we can find a number of technical limitations, especially in the accuracy and precision of web metrics based on hit count estimates (Vaughan and Thelwall 2004; Wilkinson and Thelwall 2013).

The limitations of link analysis have led to its use expanding more in certain academic and controlled environments—such as universities (Aguillo et al. 2006) and academic journals (Vaughan and Thelwall 2003)—than in more open and uncontrolled environments—such as private companies, organisations or public institutions. Moreover, if link metrics are used for relational purposes (to determine whether the relationship between “web domain a” and “web domain b” is stronger than that between “web domain a” and “web domain c”, for example), these limitations are minimised since all URLs are subject to the same error, and thus the error is statistically dispersed.

Nevertheless, and beyond all the well-known advantages and disadvantages, link analysis possesses a special feature: its high sensitivity to cultural, linguistic and geographical differences, unattainable with bibliometric indicators. For example, Thelwall (2002) found that interlinking among universities in a homogeneous country (UK) decreases with distance while Vaughan (2006), who examined how linguistic and cultural factors affect university relationships in a heterogeneous country (Canada) through co-links, concluded that views on French Canada are based more on linguistic or cultural difference than geographical location.

In the case of Europe, despite the finding of Thelwall et al. (2003) that English is the standard web language for linking throughout the EU, Heimeriks et al. (2003) detected cultural and linguistic patterns among the relationships of 220 European universities. Moreover, in the case of national relationships, different behaviour was detected in comparison with international relationships. Ortega (2007) found that the majority of Western European academic websites are international and multilingual, with English and national languages operating in tandem throughout, a fact that explains why English is influencing and attracting more external links to European universities than other languages (Orduna-Malea et al. 2014). Nonetheless, the influence of English may be different in other countries outside Europe (Thelwall and Tang 2003).

Notwithstanding, this sensitivity property makes link analysis a useful tool as a complement to bibliometric studies oriented to the analysis of relationships among institutions (mainly through co-authorship patterns). Thus it makes sense to assume that if two institutions do collaborate closely (by means of a number of articles published jointly) this fact may mean that some relationship between these institutions actually exists. And this relationship should, to a certain extent, be reflected on the web if the corresponding institutional websites are designed adequately and if they are suited to webometric analysis.

Nonetheless, the underlying human motivations for link creation—even within such specific and controlled spaces as research institutions—are complex, and professional, research-oriented and informative issues seem to be the main reasons for link creation among these websites (Wilkinson et al. 2003; Bar-Ilan 2005; Seeber et al. 2012). This means that hyperlinks and co-authorship data may reflect both different and complementary aspects of institutional relationships. However, the correlation between these two metrics should be statistically significant if institution websites actually reflect the scientific activities of the corresponding institutions, either directly (openly depositing academic material) and/or indirectly (informing about academic undertakings).

The complementarity of these two institutional relationship views is of special interest when studying systems composed of institutions that are related but with different purposes or activities (i.e., governmental, research, informative or management, or all of these activities combined but with different intensity), where geographical differences among institutions can also determine the degree and nature of their relationships. This is the case for studies centred on Triple-Helix relationships (THR). Specifically, THR analyses of an entire heterogeneous country will be strongly determined by its legal and political structure (more centralised or more distributed) as well as by geographical and language patterns (if different official local languages co-exist, for example).

Triple Helix studies could thus benefit from link analysis techniques to achieve a better understanding of the relationship between different institutions (Khan and Park 2011), offering alternatives to strictly coded and commercially available indicators like co-authorship publications. The application of web metrics to THR has, however, been scarce to date (Leydesdorff and Curran 2000) and applied to excessively aggregated entities with few exceptions (Stuart and Thelwall 2006; García-Santiago and Moya-Anegón 2009). Heimeriks and Van den Besselaar (2006) advise of the need to disaggregate data obtained through link analyses from complex institutions—such as universities—since link patterns depend on scientific fields, so the analysis of lower-level, disaggregated institutions might offer more detailed information.

Therefore we have found a gap in the literature on the analysis of Triple-Helix relationships: taking geographical differences into account and combining hyperlinks and publications in order to achieve a better understanding of the relationships among disaggregated institutions. Moreover, in contrast to the University and Industry Helices, government linking (the analysis of links between government websites or websites depending on public government, including research bodies) has hardly been studied at all, despite its considerable interest, with the exception of works by Holmberg (2010) and Holmberg and Thelwall (2009), although these studies apply to homogenous government entities in very local environments with results that are difficult to extrapolate. But a nationwide study on a set of heterogeneous entities depending on public government has not yet been carried out.

Objectives

The main objective of this study is to determine how, and if, the combination of link analysis and co-authorship techniques can reveal complementary insights when identifying the institutional relationships within a geographically distributed system composed of a heterogeneous set of public government entities (including research bodies).

To accomplish this objective, the following secondary goals were set out:

  1. (a)

    Determine the suitability of the analysed entities for link analysis (mainly by URL syntax).

  2. (b)

    Study the web impact of the corresponding institutions (mainly by page count and web visibility metrics).

  3. (c)

    Analyse the interlinking patterns between institution web domains.

  4. (d)

    Ascertain the correlation between interlinking and co-authorship data.

The first two secondary objectives were necessary prior to the interlinking analysis to characterise the set of web domains studied, since the URL syntax used by institutions and the amount of content (i.e. the relationship between content and the total number of links received) may help to explain the specific number of hyperlinks created between institutions.

Methods

Selecting the sample

The sample of institutions was obtained by studying a specific sector (public health) in a specific country (Spain).

On the one hand, the public health sector constitutes a very special and specific case study within the Government Helix since it is a complex system that brings together a wide range of entities (research institutions, health centres, government bodies, etc.): it provides different relationship patterns for link analysis and co-authorship techniques. On the other hand, Spain is a particularly apposite research subject because it is a nation with a high degree of regional autonomy, large cultural differences and diverse co-official local languages, which means greater complexity in the design and analysis of distributed system data which in turn could determine institutional relationships.

Although the Spanish public health system has been partially analysed from a bibliometric point of view, with special mention of the Spanish bibliometric map in Biomedicine and Health Sciences (Méndez-Vásquez et al. 2008), currently updated to the period 1997–2011 (Méndez-Vásquez et al. 2012), data on co-authorship between all Spanish health system institutions is lacking. Moreover, García-Lacalle et al. (2011) additionally detected an absence of studies on healthcare web presence, amply justifying the choice of this sample for this study.

The political distribution of health authorities in Spain is regulated by: the Spanish Constitution of 1978; General Health Act 14/1986 of 25 April; and Act 16/2003 of 28 May on cohesion and quality in the National Health System (Ministry of Health, Social Services and Equality 2012). According to this distribution, the national government is responsible for the overall coordination of national healthcare, foreign healthcare, and drug legislation, while the regions are responsible for health planning, public health and the management of health services.

Obtaining the sample

In the first phase of the study we recorded all Spanish health entities that belong to the national government or to any of the regional governments. The regions comprise the 17 autonomous communities and two autonomous cities into which Spain is divided (see Appendix I), and correspond to Eurostat NUTS 2 level (2011). We included all public and mixed, but not private, entities.

Hospitals have been excluded from this study as it would have been extremely complicated to include them. These establishments (that may depend directly on public universities, national or regional government, private entities or any combination thereof) currently have a web presence that reflects hierarchies and institutional relationships poorly in their corresponding official URLs and where the existence of consortia further complicates web analysis. Because of this, hospitals are being studied independently. Since this study has a more methodology-based approach, the exclusion of these establishments does not affect the proposed objectives: the combination of interlinking and co-authorship analysis methods to identify institutional relations in a distributed national public sector, rather than a mere description of the sector itself.

Next, the corresponding URL of each institution was obtained. We established that the unit of study would be the more general URL of the entity website (Orduna-Malea and Ontalba-Ruipérez 2013), as this contains all the web pages of the corresponding institution. All entities whose website contained a single web file or a limited number of web files under different URL hierarchies (different subdomains or subdirectories) were excluded since this web practice jeopardises link analysis. Unfriendly URLs (those including diacritics or non-optimised dynamic URLs) were excluded as well since search engines cannot retrieve web metrics accurately from them.

The identification and listing of Spanish public health institutions (and their corresponding URLs) was performed manually by browsing and searching all official websites related to health at national and regional level. The process was as follows: we began by browsing the different official national (Ministries) and regional websites (Councils), searching for all hypertext references to any element of the Spanish health system on all levels. The URLs that were identified and validated for these websites were subsequently revisited to repeat the process of identifying all hypertext references to any element of the Spanish health system.

If a health service, research centre or any other institution was not linked to from these seed websites, it was discarded. This process was conducted between June and September 2014.

Obtaining web metrics

Next we obtained the following web impact indicators: page count (number of files hosted within a domain name and indexed in search engines) and web visibility (number of links or sites linking to a website) only for those URLs whose syntax allows these data to be captured. This enabled us to determine not only the suitability of the Spanish online public health system for link analysis but also its impact. Since the different parts of a URL are not always used consistently, the nomenclature used in describing URL syntaxes is described in Fig. 1.

Fig. 1
figure 1

URL syntax nomenclature

The indicators used, their scope and the source used to obtain them are summarised in Table 1 (web impact metrics). In the case of web page count, the source used was Google (<google.com>) because it is the most comprehensive search engine in the world today. Since the Spanish version (<google.es>) showed similar results for a small sample of data, the international version was used.

Table 1 Web indicators: impact (page count and visibility) and network

Use of the page count indicator can generate noisy data because it includes all kinds of material (images, icons, multimedia, etc.). For this reason, file type filtering may be necessary in some cases to obtain cleaner data. However, filtering also gives rise to various problems:

  • First, different web content generation practices prevent us from knowing a priori the best file type to filter. Some websites may have few HTML files but a lot of rich PDF files or collections of images of interest; conversely, other websites may have many banal HTML files.

  • Secondly, there is an increasingly widespread practice of hiding extensions in URLs, which means that filtered file quantification cannot be performed with precision for certain file types (HTML in particular).

  • Third, web files also present additional problems due to the already widespread use of content management systems (CMS) that generate dynamic URLs without a file type extension.

  • Finally, automatic searches through query commands depend on the performance of advanced search (in this case Google). For example, it is no longer possible to search for HTML files directly through the “filetype” command.

Since total page count correlates highly with the specific page count of rich files (Orduna-Malea et al. 2014), we decided to use total page count for data correlation despite the noise it generates. In any case, PDF page count was also used to check correlation with overall data size.

For web visibility data, it was not possible to obtain domain-level link indicators with Google (Thelwall and Sud 2011) so the source used was Open Site Explorer (OSE),Footnote 1 an online application created in 2010 to extract and analyse link data from the Mozscape Index, a data repository that provides access to over 145 billion URLs (12 November 2015). It also provides free and paid services, search engine or API requests, and generates its own metrics (MozRank, MozTrust and Domain Authority). OSE can be used as a reliable tool for collecting webometric data, as previous research has demonstrated (Orduna-Malea et al. 2015a, b).

Finally, all data were entered into a spreadsheet, where they were subjected to statistical analysis with the XLSTAT application. We obtained the correlation coefficient between the impact indicators to ascertain the possible influence of web content on total links received (an issue that may determine interlinking behaviour). The Spearman coefficient was applied, in light of the generally unequal distribution of web data (Barabasi and Albert 1999). A principal component analysis (PCA) was also conducted in order to visualise the different dimensions of the web indicators used. This statistical analysis was carried out between September and October 2014.

Obtaining interlinking metrics

In the case of selective links (links from a particular domain to another) neither Google nor Open Site Explorer may be used, at present, to obtain this indicator (Sud and Thelwall 2014). Therefore a proxy created from the URL mentions indicator was employed, a method already used in scholarly literature (Ortega et al. 2014), using Google as a data source.

Queries were made by introducing the following syntax in the search field: <“abc.tld” site:xyz.tld>. This query returns the number of times the domain “abc.tld” has been mentioned in the domain <xyz.tld> and is therefore an indicator of the intensity of the relationship between the two web domains.

We thus created all the possible combinations between all the URLs recorded in the previous phase, whether they were domains or subdomains (syntax types where visibility indicators function correctly). For each combination, the hit count estimate provided by Google was obtained manually (this task was performed equally by all the authors in order to check and classify possible errors with URL mentions; this task took 4 weeks). Subsequently, a.NET file was generated manually from these data and imported into the Gephi application, which was used to produce a graph of all the relationships.

The same application was used to obtain various indicators (Table 1) at both node (Degree, Betweenness, Clustering Coefficient) and network level (Average Degree, Average Clustering Coefficient, Average Path Length, Diameter, Density, Eigenvector); these indicators describe the characteristics of the network. Impact indicators and URL mentions were recorded between October and November 2014.

Obtaining co-authorship metrics

The total scientific output for each institution (only those entities with valid URLs for link analysis) was retrieved from Scopus. Then, the identification (ID) provided by Scopus for each institution was used to obtain the number of publications co-authored by each pair of institutions through the following query search: <AF-ID(institution1) and AF-ID(institution2)>. These bibliometric data were retrieved on July 2015, and the total cumulative data at this time for each institution were retrieved. Given the slower pace of the publication process compared to the creation of hyperlinks, we consider the gap of about 10 months between the retrieval of hyperlink and co-authorship data to be useful for the analysis of academic relationships that already existed when the link data were retrieved.

Finally, the correlation between interlinking and co-authorship data was obtained. Since this information was compiled in two different matrices, the Mantel test (Spearman; two-tailed test) was performed. In view of the fact that interlinking is a non-reciprocal process (the number of links from institution A to institution B is not equal to the number of links from institution B to institution A), all links between each pair of institution URLs were added up prior to the correlation process. In the case of institutions with alternative web domains (more than one valid URL), mentions to/from each of the alternative web domains were considered. This statistical analysis was carried out during July 2015.

Results

Suitability of Spanish online public health system for link analysis

A total of 263 URLs were located, of which 25.1 % (66 URLs) are national, while the remaining 74.9 % (197) belong to the 19 regions, of which Catalonia, with 17.5 % (46), Andalusia, with 8.7 % (23) and the Canary Islands, with 7.6 % (20), stand out in particular. Appendix II (see supplementary material) shows all analysed URLs, grouped by area of jurisdiction (national or regional), and URL type (domain, subdomain, subdirectory, combined).Footnote 2

With regard to URL type, domain is the most used syntax, accounting for 46.7 % (123 URLs) of all the websites analysed. Nationally it accounts for 62.1 % of URLs. Regionally, the average percentage of domains is 41.6 %, Catalonia leading the category with domains making up 71.7 % (33) of the total for that region, while for Andalusia, despite it being the region with the second-highest total number of domains (10), these domains make up only 43.5 % of the total. Finally, it is worth noting that Madrid does not have any domains. The subdirectory is the second most used syntax, with 82 URLs (31.2 %); the Canary Islands and Andalusia are the standout regions, with 15 and 13 subdirectories respectively. Finally, the subdomain, with 41 URLs (15.6 %), is the third most used syntax, the region of Valencia having the highest number, with 7 web addresses. Additionally, 17 URLs (8 national and 9 regional) composed of a combination of domain/subdomain and subdirectories were identified.

Moreover, subdomains, subdirectories and combined syntax may be used at different depth levels, thus reflecting a particular knowledge organisation. For each of the 263 URLs located, both their depth and their degree of friendliness were analysed. Full details may be consulted in the supplementary material (Appendix IV).

The general distribution of URL syntaxes according to web depth level is illustrated in Table 2. At regional level, we encountered greater intensity in the design of subdirectories (38 URLs presented more than two hierarchical levels), which greatly hinders their use for not only web metric purposes but also search engine optimisation.

Table 2 Distribution of URL syntaxes and web depth levels

With regard to combined syntax, this URL type was concentrated at national level in the Ministry of Health, Social Services and Equality (specifically in subdirectories within <msssi.gov.es>), and at regional level in the Basque Country (5 URLs) and Valencia (3).

Furthermore, unfriendly URL types (i.e. dynamic URLs that have not been optimised for search engines) were detected in both subdirectories and combined syntax. This effect was observed in 38 URLs (14.4 % of all URLs analysed), all at regional level; it is particularly significant in the case of the Madrid Regional Ministry and the Regional Ministry of Health and Social Services of La Rioja when we consider that they are high-level administrative structures.

Finally, of the initial 263 URLs, 38 unfriendly URLs were discarded, as were two other URLs (<ib-salut.caib.es> and <osakidetza.euskadi.net/r85->) that are not websites per se, rather simple web pages. This produced a total of 223 valid URLs for web impact analysis, divided into 131 domains, 44 subdomains and 48 subdirectories.

Web impact of Spanish online public health system

Web impact metrics

Table 3 shows the sites whose URL contained more than 100,000 files (a total of 13 centres), with the National Centre for Biotechnology ranked first (page count of 850,000), followed far behind by the Spanish Agency for Medicines and Health Products (386,000) and the Carlos III Health Institute (376,000); all of these are national centres. Conversely, low values (a page count of fewer than 100) were obtained in 54 URLs (24.21 %), of which 26 were subdirectories; in 6 (2.7 %) no result was obtained. Full data is available in Appendix V.

Table 3 Top health centres according to page count with web visibility metrics

To analyse web visibility, subdirectories were discarded (since link indicators are not precise enough for this URL type), which left us with 175 URLs (domains and subdomains), from which the total number of external links, root domains and MozRank were extracted.

We observed an uneven distribution of external link data. The figure of 1000 links was attained by more than 70 entities (40 %), a figure that is by no means negligible (the Hospital del Mar Medical Research Institute, which ranks first for this indicator, received 110,268 links, followed by the Carlos III Health Institute with 102,762). On the other hand, we identified 29 entities which received fewer than 50 external links (details in Appendix V), with the National Centre for Biotechnology faring particularly poorly (only 2 external links); as mentioned above, this is the entity with the largest web page count. Finally, special mention must be made of the high web visibility of the Catalan centres (in fact, of the 50 centres with the most external links received, 18 pertained to Catalonia, while 15 were national).

However, this high visibility diminishes when the links are aggregated per linking root domain (LRD). For example, the National Plan on Drugs (<pnsd.msssi.gob.es>), despite receiving 15,793 external links, received the links from only 102 different websites. We noted 18 centres for which the number of external links per root domain exceeded 100 (see Appendix VI). This indicates that only a few sites provide the bulk of the external links. For this reason the usefulness of external links is limited in this regard, and it is preferable to use alternative document models such as LRDs to more adequately discern the web impact of units.

Linking root domains also show an uneven distribution of data; while 60 entities (34.3 %) had over 100 root domains, 40 (22.9 %) did not attain the figure of 20 root domains, which is considered a low figure given the importance of the centres analysed. The Andalusian School of Public Health (3878), Carlos III Health Institute (3473), Cantabrian Health Service (2131), Galician Health Service (1132) and the National Cancer Research Centre (1046) are the institutions that attained the highest number of root domains.

An even more elaborate step is to quantify the quality of the sites from which the links are received. In this aspect, the data provided by MozRank are generally very scarce. Only 14 of the 175 URLs analysed scored higher than a 5 (maximum score 10): the Carlos III Health Institute (5.76) obtained the highest score, followed by the National Cancer Research Centre (5.48).

Correlation between web impact metrics

To verify the possible association or dissociation between page count and web visibility metrics, the correlation between all web impact indicators was calculated (Table 4).

Table 4 Spearman correlation between web impact indicators

Although all the correlations were statistically significant (α = 0.01), page count values were slightly lower than expected for MozRank (r = 0.42), total external links (r = 0.46) and root domains (r = 0.47). On the other hand, high correlation levels were obtained between different visibility data, especially between root domains and MozRank (r = 0.87).

All correlation data were recalculated a year later (November 2015) in order to test their reliability and to run a control on the PDF page count (HTML page count was omitted for the technical reasons outlined above). Results show that the correlations obtained a year later are very similar, and maintain a clear separation between size and visibility indicators. The correlation between total page count (2014) and total page count (2015) is also very high (r = 0.866), reinforcing the validity of the data obtained.

In order to visualise the relationships between indicators for the sample of units under analysis, a principal component analysis (PCA) was conducted in which the unit level (national or regional) was included as a qualitative variable to detect any patterns in the performance of the indicators based on this variable (Fig. 2).

Fig. 2
figure 2

Principal component analysis (PCA) of units according to region

Figure 2 shows the separation between the page count variable and the three visibility measurements, where the first two factors of analysis explain 89.6 % of variance, a high value which reinforces the separation of these two indicator dimensions.

Moreover, we observe the existence of outliers, especially in national units, both in page count (National Centre for Biotechnology, <cnb.csic.es>; Spanish Agency for Medicines and Health Products <aemps.gob.es>) and web visibility metrics (Carlos III Health Institute, <isciii.es>). Similarly, values for some regional centres also produced outliers, both in page count, e.g. the Valencia Regional Health Ministry (<san.gva.es>) or the Murcia Health Service (<murciasalud.es>), and visibility (Andalusian School of Public Health, <easp.es> or the Hospital del Mar Medical Research Institute, <imim.es>), as discussed above.

Interlinking patterns between Spanish online public health institutions

Finally, URL mentions from each URL to the others (a total of 30,450 combinations) were calculated from the final sample of 175 URLs corresponding only to domains and subdomains to create a directed graph of the composition of the Spanish health system on the Web.

Characteristics of the network

The results show a network with very low density values (D = 0.06), average degree (10.05) and average clustering coefficient (0.41), while values were relatively high for diameter (6) and average path length (2.35).

Low density values were determined by the low relationship intensity (i.e. the low number of mentions from one URL to another). In fact, 94.2 % of the combinations (28,692) gave a null value while more than 100 mentions were only recorded for 72 combinations (the combinations with the highest intensity in the network are available in Appendix VII). It shows how relationships of greater intensity are linked to the website of the Ministry of Health, Social Services and Equality (<msssi.gob.es>), and in particular to the relationships between entities hosted on subdomains within the Ministry.

In any case, these nodes, despite belonging to the same entity, have been kept separate in the network in order to analyse the influence of each URL separately. Table 5 lists the 20 URLs with the greatest betweenness centrality, with the Carlos III Health Institute (8617.08) in first position, followed at some distance by the Ministry of Health, Social Services and Equality (3470.93). Further details about network metrics are provided in Appendix VIII.

Table 5 Node-level network indicators

Figure 3 shows the online network of the sampled institutions. High centrality may be observed in the nodes corresponding to national units (in red) where, as well as the Ministry of Health, Social Services and Equality (<msssi.gob.es>) and the Carlos III Health Institute (<isciii.es>), other major centres also feature, such as the Spanish Agency for Medicines and Health Products (<aemps.gob.es>), the National Centre for Biotechnology (<cnb.csic.es>) and the National Cancer Research Centre (<cnio.es>). Next to this central zone, dominated by national units, are the regional health centres of Galicia (<sergas.es>), Basque Country (<osakidetza.euskadi.net>), Murcia (<murciasalud.es>) and Valencia (<san.gva.es>), all of which are close to the Ministry node.

Fig. 3
figure 3

Spanish online network (n = 175; Fruchterman Reingold algorithm)

Around the central zone, containing the national units, the regional units with the highest number of nodes in the network are located: Catalonia (yellow), Andalusia (green) and Extremadura (brown), with a clear cohesion between them. Valencia (blue), on the other hand, is much more widely dispersed. The other nodes, regardless of the region to which they belong (and with very few nodes in the network) appear in grey.

Hierarchical relationships

Data for relationships between the various institutions under the Ministry of Health, Social Services and Equality are highly inflated due to both the institutional relationship and a shared parent domain. For example, from <msssi.gob.es>, 73,100 hits to <pnsd.msssi.gob.es> were recorded, 56,500 to <aesan.msssi.gob.es> and 9360 to <ingesa.msssi.gob.es>.

Subnets

Given the complexity of the overall network, and the evident geographical influence, we disaggregated data for the national subnet (Fig. 4, top), and the regional subnets of Catalonia (Fig. 4, centre left) and Andalusia (Fig. 4, centre right), which comprise the highest number of nodes (38 and 18 respectively).

Fig. 4
figure 4

National and Regional (Catalonia and Andalusia) networks and metrics

These subnets also obtained very low density values. The national network (formed by 55 nodes), has a density of 0.126 (Fig. 4, bottom), while that of Catalonia, which includes the relationship with the highest intensity in the whole network (that between Hospital del Mar Medical Research Institute <imim.es> and the Barcelona Biomedical Research Park <prbb.org>, with 15,400 mentions) is slightly higher (0.148).

Furthermore, the two most complex subnets (national and Catalan) are interconnected, due mainly to the relationships between the nodes with the highest eigenvector value in both networks (<isciii.es>, <cnio.es> and <msssi.gob.es> in the national network, and <imim.es>, <iconcologia.net>, <prbb.org> and <ibecbarcelona.eu> in the Catalan network), as can be seen in Fig. 5.

Fig. 5
figure 5

Interconnection of the national and Catalan subnets

Asymmetry

It should be noted that the relationships established in the different networks and subnets are directed and, therefore, not symmetrical. Moreover, the high average degree for some nodes is sometimes due to the high number of mentions towards the other nodes of the network (e.g. the Ministry of Health, Social Services and Equality, which acts as a Hub) or to the number of mentions received from other nodes (e.g. the Hospital del Mar Institute of Medical Research, which acts as an Authority). Table 6 shows the main Hub and Authority nodes in the network.

Table 6 Main Authority and Hub nodes in the network

These asymmetries in the degree of the nodes are produced by the non-reciprocity of the mentions. For example, 338 references were recorded from <msssi.gob.es> towards <isciii.es>, while 505 were recorded in the opposite direction. The most notable case in point is in the Catalonia subnet, where 15,400 mentions from <prbb.org> towards <imim.es> were recorded, while not a single mention was registered in the opposite direction (data from September 2014; in November 2014 only 77 were obtained).

Correlation between interlinking and co-authorship data

The Scopus database shows academic output (at least one article published) for a total of 46 institutions. The Catalan Institution for Research and Advanced Studies ranked first (6753 articles published) followed by the Carlos III Health Institute (5454) and the Biological Research Centre (4839). Full details about the academic production of these 46 institutions are provided in the supplementary material (Appendix IX). The correlation between the number of articles published by these 46 institutions and the page count of the corresponding institution web domains is lower than expected but statistically significant (0.4; α = 0.01).

The co-authorship data between each pair of the 46 institutions (454 combinations) was also gathered. The combination with the most activity is that between the Andalusian School of Public Health and Catalan Institute of Oncology (354 papers co-authored). It is also worth noting that within the top 25 institution combinations according to number of co-authored articles, we found thirteen combinations formed by a national and Catalan institution and seven formed by two Catalan institutions. Conversely, only three combinations are formed by two national institutions.

A co-authorship matrix was created (see Appendix X) and compared with the interlinking matrix. The Mantel test between these two matrices shows a significant correlation between the number of co-authored papers and the number of links between the corresponding institution web domains (0.490; α = 0.01).

These correlations (page count vs academic output; interlinks vs co-authored papers) show a complementary view of institutional relationships as we can observe in Fig. 6, where a dispersion graph is provided for each comparison.

Fig. 6
figure 6

Dispersion between page count and academic output data (left) and between interlinking and co-authorship data (right)

Additionally, the co-authorship degree (number of institutions with which a particular institution has collaborated) and interlinking degree (number of web domains to which a particular web domain is linked) were calculated (see Appendix XI). In this case, we obtained a better correlation between these two metrics (0.7; α = 0.01).

We nonetheless found some important exceptions. For example, the Foundation for the Development of Genomic and Proteomic Research only co-authored articles with 2 other institutions whereas it is hyperlinked with 17 different institution web domains. A similar situation occurred with the Thematic Network of Cooperative Cancer Research (which shares academic publications with 4 institutions though it is hyperlinked with 31). Conversely, the Cardiovascular Research Centre shares publications with 17 institutions while it is hyperlinked with only 7; and the Pere Virgili Health Research Institute publishes with 18 different institutions whereas it is connected with only 7 via the web.

Homogeneous set of research centres

The above correlations were obtained from the interlinking and co-authorship patterns of the 46 institutions that met our study requirements, i.e. between institutions not only with different lines of research but with different institutional goals. In order to observe behaviour amongst a small set of homogeneous institutions, we analysed the institutional relationship patterns between the various Biomedical Research Networking Centres (CIBER), now integrated according to their field of research within the CIBER ISCII consortium.Footnote 3 Figure 7 maps the co-authorship (left) and interlinking (right) networks for the different centres.

Fig. 7
figure 7

Co-authorship (left) and interlinking (right) networks for Spanish CIBER centres (n = 9; Fruchterman Reingold algorithm). Co-authorship node size: papers published; interlinking node size: page count

The Spearman correlation between the number of articles published jointly and the number of links between each pair of centres is very low (r = 0.3, α = 0.01). The correlation in terms of size (number of articles published and page count) is not only equally discrete but even returned a negative value (r = −0.4).

Discussion

Web data results should be always treated with caution, especially page count and web visibility metrics. Inconsistencies observed and summarised by Wilkinson and Thelwall (2013) apply to any web research which relies on commercial search engines.

Linking root domain seems a more robust and appropriate metric to measure web visibility than total links received, though we must take into account the fact that they actually measure complementary views of visibility (wideness and intensity, respectively). As regards MozRank, while it seems robust as well, this metric suffers from a lack of international coverage that currently prevents it being used for worldwide analyses that include subdomains and subdirectories.

Selective linking through URL mentions (via Google) should be given particular attention. The manual gathering of these data permitted the authors to identify and classify up to seven types of errors associated with this metric, some of them previously detected in the literature (Thelwall and Sud 2011; Ortega et al. 2014; Sud and Thelwall 2014). Notwithstanding, we have detailed them below to give a better understanding of our study:

  1. (a)

    High variability

For example, the difference in the number of mentions from <imim.es> towards <prbb.org> between the September (0 entries) and November (77) data collections is large. This may be the result of collaboration between the institutions, the follow-up of an ad hoc event, or improvements or changes made to the coverage of the search engine.

  1. (b)

    The detection of email mentions

For example, for the query <“cnic.es” site:iccc.cat>, Google detects the mention “xxx@cnic.es”. Although this is not a false positive, since it is a mention as such, the nature of an email is different to that of a hyperlink to <cnic.es>, which must be taken into account.

  1. (c)

    The search string of the target does not appear in the source

For example, the query <“prbb.org” site:icrea.cat> produced 7 results. After a manual check, we found that the string “irbbarcelona.org” did not appear in any of the results, which means that, in this case, false positives were given. This is because the source (<idibgi.org>) is a dynamic web resource and its contents vary periodically, so the string perhaps appeared at the time that the crawler indexed the page, but subsequently disappeared, as the index had not been updated at the time the query was made.

  1. (d)

    Errors in string detection

Despite the use of quotation marks to delimit the strings to be detected, the search engine did not properly identify spaces, dots and other symbols and diacritics, thus returning false positives. This was a particularly problematic issue for domains with the <.es> suffix, since “es” is Spanish for “is” (third-person singular form of the verb “to be”). Hence phrases like “Idibell es un centro (is a centre)…”, “PREDIMED es el acrónimo (is an acronym)…” or “El Instituto de Salud Carlos III (ISCIII) es un (is a)…” are detected as <idibell.es>, <predimed.es> or <isciii.es> respectively. These domains consequently obtained slightly inflated results.

  1. (e)

    Different language versions

Those websites with versions in different languages (especially in regions with co-official regional languages) generated a result for each of the languages used. For example, the query <“recava.com” site:idibaps.org> generated 3 results, which are actually the same, but repeated in the three language versions of the website <idibaps.org> (Spanish, Catalan and English).

  1. (f)

    Targets in second- and third-level subdomains

Occasionally, when measuring a particular subdomain, we are actually measuring a subdomain hosted within it. For example, all the results returned for the query <“san.gva.es” site:isglobal.org> are actually hosted on <sp.san.gva.es>, which is an entity for which data was gathered separately.

  1. (g)

    Target and source hosted by the same web domain

When considering distinct units hosted on different subdomain levels in the same web domain, the system is unable to calculate mentions among them with precision.

For example, when counting mentions to <bioef.org>, the system returned all strings with these terms, incorporating all existing <xxx.bioef.org> subdomains, e.g. <oiker.bioef.org>. Therefore, it is impossible to correctly determine the number of mentions that are solely directed towards the general unit and, more specifically, those from the lower unit to the general unit (and vice versa). This effect particularly affects the performance of the Ministry of Health, Social Services and Equality (<msssi.gob.es>), which assumes a more central role in the area of public health: it contains a large amount of services, centres and institutions, and also includes navigation links between all of them.

These limitations especially affect the use of these indicators to evaluate performance (the exact number of references to a URL). However, if used for relational purposes (to determine whether the relationship between “a” and “b” is greater than that between “a” and “c”, for example), these limitations are minimised since all URLs are subject to the same error, and thus the error is statistically dispersed. Therefore we consider that the method efficiently detects combinations in which the intensity of the relationship is high despite potential specific errors. Moreover, even assuming a minimal rate of error, given the general low URL mentions obtained, the effect on network indicators is practically imperceptible, as they remain unaffected by changes in the intensity of the relationship.

Conclusions

Institutional relationships between Spanish public health entities have been identified both through link analysis and co-authorship methods. We found a discrete but statistically significant correlation between the number of selective links across a pair of entities and the number of academic articles published jointly by them (r = 0.5).

This value suggests a complementary, though not alternative, view of each method in the identification of institutional relationships. The existence of institutions with scarce academic collaboration but an intense web relationship (reflecting a relationship based on informative and web navigation) and, conversely, institutions with an intense academic relationship but insufficient web connection (reflecting the existence of inadequate policies for creating web content), reinforces this issue.

Notwithstanding, if we perform the comparison by means of the degree indicator (interlinking and co-authorship degree) the correlation clearly improves (r = 0.7). This fact suggests that hyperlinks are better suited for identifying entities that, academically, collaborate intensively with many others instead of measuring with precision the intensity of a particular relationship.

Moreover, both interlinking and co-authorship methods detect the existence of strength subnets of geographically distributed nodes (especially the Catalan entities) as well as their high connectivity with the main national network nodes (subnet of nodes distributed according to dependence on national government, in this case Spain). Each method is capable additionally of identifying different connection purposes within each subnet. In this specific case, whereas the Catalan network is as well connected through hyperlinks as it is through academic publications, the national network is better connected through hyperlinks. This fact may be due to the role adopted by hyperlinks at a national level, mainly devoted to navigational purposes.

In any case, the general network of web relationships between the analysed Spanish public health entities has very little cohesion, is highly distributed, inadequately connected, and the relationships between nodes are not particularly intense. The dissociation between page count and web visibility metrics provided by the web impact analysis (and reflected in the moderate correlation obtained between them) may suggest that the creation of web content is not influencing the attraction of external hyperlinks, which may explain the general low cohesion of the network.

The widespread use of unsuitable URL syntaxes (subdirectories and unfriendly forms of subdirectory and combined URL syntaxes) may also have an influence. These URLs, despite functioning on a technical level, do not facilitate the generation of links by users or correct quantification by search engines.

The web relationships between a parent institution and its subsidiaries generate significant distortion because of the large number of existing navigation links. This means that link analysis reinforces institutional relationships based on informative communication and access to content. However, this is the only way to approach a multilevel analysis using webometric indicators (since subdomains reflect relationships of institutional affiliation). Suitable filtering of these links could provide a greater granularity between these parent/subsidiary relationships, although this task is tedious and difficult to automate. Therefore, relationships between institutions with different web domains will always be more methodologically precise.

Finally, the analysis of small samples of specialised research centres (in this case, the CIBER centres) has shown that institutional relationships detected by each method may differ even among thematically (biomedical research) and functionally related centres (all linked to the Spanish National Health System). There may be web relationships (functional, administrative, communicative, etc.) without there being a scholarly relationship (the centres do not publish articles jointly). However, the existence of scholarly relationships that lack correspondence on the web is evidence of gaps in the creation and adequate dissemination of content in the corresponding centres. Hence the combination of the two methods of applied analysis not only serves to identify a wider range of institutional relationships but also to highlight gaps in the generation of content that is indicative of existing relationships.

Greater planning of web policies for the creation and dissemination of the web content of entities related to the Spanish healthcare sector would result in an improvement, not only in web impact indicators, but also in obtaining a richer and more intense network of relationships. This would more accurately reflect the potential relationships between the various entities (including academic relationships) and minimise the limitations of URL mentions. Finally, social metrics reflecting institutional relationships—and not based on hyperlinks—may provide complementary insights; this is an aspect that will therefore be considered in future studies.