Keywords

1 Introduction

Along with the fast-growing digitization of scholarly communication, all people can easily, immediately get scholarly information through the Web nowadays. In such an environment, Digital Object Identifier (DOI) is absolutely necessary to identify each electronic document. DOI is the best-known international standard infrastructure that assigns persistent and unique identifiers for any type of objects [1]. As of November 2015, the total number of DOIs are approximately 130 million [2]. CrossRef is the largest DOI Registration Agency [3]. It reports that Top 4 referrers of DOIs assigned by CrossRef (i.e., CrossRef DOIs) are academic literature databases (i.e., Web of Knowledge, Serials Solutions, ScienceDirect, and Scopus) and the \(5^{\mathrm{th}}\) largest referrer is Wikipedia [4]. Wikipedia is a free, collaboratively edited, and multilingual online encyclopedia. As of February 2016, there are 246 language Wikipedias that include English (enwiki), Japanese (jawiki), and Chinese (zhwiki) [5]. According to Alexa Internet, Wikipedia was the \(7^{\mathrm{th}}\) most viewed website in the world in 2015 [6].

Therefore, as typified by Wikipedia, open websites seem to build and enhance a bridge between Web users and scholarly information through DOI links. Furthermore, it is assumed that these connections redound to make the best use of scholarly information — not only by researchers or specialists, but also by more various people such as students and general public.

However, few studies have attempted to analyze scholarly information, including DOI links, referenced on Wikipedia. In other words, which publishers or academic societies have content that is highly referenced on Wikipedia? What are the differences in referenced contents among other Wikipedia languages? How and when was the scholarly information written on Wikipedia? These viewpoints are important for understanding characteristics and meanings about DOI links that are referenced on Wikipedia.

Thus, we aim to answer the following two research questions:

  • RQ1. Which publishers or academic societies have content that is highly referenced on Wikipedia?

  • RQ2. Does the highly referenced content vary among Wikipedia languages, or is it very similar to other languages?

To answer these research questions, the present study analyzes DOI links on enwiki, jawiki, and zhwiki. It reveals which kinds of scholarly information are referenced on these Wikipedias. The reasons why this study set targets on enwiki, jawiki, and zhwiki are as follows:

  • Because enwiki is the largest language version of Wikipedia, it is meaningful to identify its influence on jawiki.

  • If some similarities or common points are observed between jawiki and enwiki, we should check whether the similarities with enwiki are also seen on other language Wikipedias — or are peculiar to jawiki.

  • jawiki and zhwiki have some similarities in that both are Asian languages, and they are equal in quantity of articles. Thus, we also use zhwiki.

2 Related Work

2.1 Analyses of Academic/scientific Citations on Wikipedia

Nielsen [7] analyzed referenced journals in English Wikipedia (as of April, 2007) and checked the correlation to their Journal Citation Reports Impact Factor, a measure of journal influence. As a result, the Top referenced journals were Nature and Science. Journals in the field of astronomy were highly referenced. Not all of the journals had high Impact Factors.

By using DOIs, Lin & Fenner [8] analyzed references of articles published in a series of open access journals by PLOS (Public Library of Science) on the Top 25 language versions of Wikipedia (as of March 2014). As a result, 4.13 % of all the PLOS articles at the time were referenced on Wikipedia, and 47 % of them were referenced on Wikipedia other than English version. They argued that “the number of referenced PLOS articles on Wikipedia highly correlates with the number of active users that are associated with that Wikipedia”.

The “Extract academic citations from Wikipedia” tool [9] is used to extract identifiers (such as DOI, PubMed, ISBN, and arXiv) on Wikipedia. The tool was developed by Halfaker from Wikimedia Foundation. Halfaker et al. analyzed and showed the amount of each identifier on English and Dutch Wikipedias (as of June 2015). The most referenced identifier was ISBN, and the second most-referenced identifier was DOI; their amounts change over time [10, 11].

The “Wikipedia DOI citation live stream” [12] is a service that collects DOI links on Wikipedia and shows them as real-time streams. This service displays which DOI links are referenced from which Wikipedia pages.

The “Wikipedia Cite-o-Meter” [13] is a service developed by Wikimedia Tool Labs. This service shows the reference status on a prefixes basis — in 100 language versions of Wikipedia. For example, it illustrates how PLOS contents (prefix:10.1371) were referenced on Japanese and English Wikipedias.

In summary, past studies have investigated scholarly information on Wikipedia from the viewpoint of journal titles [7] and that of specific publisher’s contents [8]. Although these studies showed interesting results, investigations from viewpoints of publishers and academic societies seem to be lacking. On the other hand, existing services focus on the number of DOI links on Wikipedias, but it is not clear how they overlap among different Wikipedia languages.

2.2 DOI Usage Analyses by CrossRef

CrossRef analyzed its access log about DOIs and reported their referrers. According to the CrossRef Blog [14], as of 2014, the \(8^{\mathrm{th}}\) largest referrer was Wikipedia. It revealed that users actually click DOI links. CrossRef also reported that (as of 2015) Wikipedia was the \(5^{\mathrm{th}}\) largest referrer, which followed four academic literature databases (Web of Knowledge, Serials Solutions, ScienceDirect, and Scopus) [4]. In addition, the Top 10 Wikipedias that were most frequently accessed were (in decreasing order) English, English (mobile), German, Japanese, Spanish, French, Russian, Chinese, Italian, and Portuguese.

The “DOI Chronograph” [15] is a service about referrers of CrossRef DOIs, which is supplied by CrossRef Labs. This service shows the number of clicks on the basis of DOI link, referrers’ domain names, and referrers’ sub-domain names. However, this service is not from all access log data — but from small sample data.

2.3 Analyses of Wikipedia External Links

Tzekou et al. [16] analyzed external links on English Wikipedia (as of October, 2009) to investigate their decay and distribution in the English Wikipedia articles. Their results showed that roughly 18.3 % of external links were dead links. However, they noted that the majority of external links on Wikipedia were reachable, because very few articles contained a considerable amount of dead links and approximately 77.3 % of Wikipedia articles did not have dead links.

Sato et al. [17] investigated characteristics of external links and dead links on Japanese Wikipedia (as of April, 2011). As a result, they pointed out that (1) approximately 11 % of external links were dead, (2) contents hosted by the domains edu, co.jp, and go.jp had a high rate of access failures, and (3) many access failures occurred on contents hosted by newspaper-company websites.

3 About DOI

The DOI is an infrastructure that provides resolvable, persistent, and interoperable links. Each DOI consists of a prefix, a slash (/), and a suffix.

A prefix is assigned to a particular DOI registrant, such as publishing companies or academic societies. DOI registrants assign suffixes to their contents and register DOIs through DOI Registration Agencies (RAs). There are 10 RAs.

Some RAs that handle scholarly resources (such as journal articles, books, and datasets) are CrossRef, JaLC, ISTIC, and DataCite. JaLC is the only RA in Japan, ISTIC is a Chinese RA, and DataCite is an RA for research data. As of April 2016, there are 76,944,396 DOIs registered by CrossRef (CrossRef DOIs); 23,422,068 DOIs by ISTIC (ISTIC DOIs); 6,614,478 DOIs by DataCite (DataCite DOIs); and 1,401,144 DOIs by JaLC (JaLC DOIs).

The DOI also provides hyperlinks (DOI links) by adding DOI after “http://doi.org/” or “http://dx.doi.org/.” DOI links redirect to each original content’s URI.

4 Materials and Methods

4.1 Datasets

In this study, we analyze DOI links on enwiki, jawiki, and zhwiki. To extract DOI links (as well as the page and the namespace written by these languages) from Wikipedia, we made use of the English dump file on March 4, 2015; the Japanese on March 13, 2015; and the Chinese on March 4, 2015.

In particular, we used extraction conditions that URLs of external links contained “doi.org” in the el_to column of externallinks.sql or the prefix of interwiki links equaled to “doi” in the iwl_prefix column of iwlinks.sqlFootnote 1. Thereafter, we removed non-DOI links. Table 1 shows the overview of our dataset.

Table 1. Dataset overview

4.2 Methods

In this study, we performed a detailed analysis of DOI links on each language Wikipedia through the following three analyses:

Prefix-Level Analysis. We counted each prefix to clarify which registrant’s content is most commonly referenced.

Overlap Analysis of Unique DOI Links Between Two Language Wikipedias. To analyze the overlap of unique DOI links between two different language Wikipedias, we used their difference set and product set. The former refers to DOI links referenced only in one language or another; the latter refers to those referenced in both languages (as Fig. 1 illustrates).

Fig. 1.
figure 1

Overview of the overlapping analysis of unique DOI links between two language Wikipedias

Comparison of DOI Links Through Interlanguage Links and Page-Revision Histories. Some DOI links seemed to be added to enwiki, before they were first added to jawiki or zhwiki pages. Thus, we extracted common DOI links through the following four steps:

  • STEP1: We extracted DOI links, written in main namespace pages on each language Wikipedia (see Fig. 2: “ALL”).

  • STEP2: We extracted the pages that have interlanguage links [18] to enwiki (i.e., correspondent pages) and DOI links written on these pages (see Fig. 2: “The pages with a langlink to enwiki”).

  • STEP3: We extracted the pages that have common DOI links with the correspondent page — and the DOI links written on these pages (see Fig. 2: “The pages with one or more common DOI links to enwiki”)

  • STEP4: We extracted the pages that have 10 or more common DOI links with the correspondent page (see Fig. 2: “The pages with common DOI links greater than or equal to 10”). Figure 3 shows an example of a page with common DOI links between jawiki and enwiki. This extraction condition, sharing 10 or more DOI links, was set on the basis of data observation.

We analyzed whether the extracted common DOI links were added to jawiki or zhwiki through the translation from enwiki. We used page-revision histories [19] to identify the edit summary and timestamp. We judged whether the edit summary mentions the edit as a translation from enwiki by manual. Moreover, we distinguished whether the timestamps on common DOI links that were first added to corresponding pages were earlier than jawiki or zhwiki. In the analysis, we used Wikipedia API [20].

Fig. 2.
figure 2

A workflow of comparison of DOI links between different Wikipedia languages

Fig. 3.
figure 3

An example of common DOI links between jawiki and enwiki

5 Results and Discussion

5.1 Overview

Table 2 shows the number of total DOI links for RAs. Most of DOI links in these Wikipedia are CrossRef DOIs. The second most-referenced DOI links in enwiki are mEDRA DOIs; those in jawiki are JaLC DOIs; those in zhwiki are ISTIC DOIs. Note that JaLC DOIs are not referenced in zhwiki, and ISTIC DOIs are not referenced in jawiki. In other words, the scholarly content in Japan tends to be referenced in jawiki, the content in China tends to be referenced in zhwiki.

Table 2. The number of total DOI links for RAs

5.2 Prefix-Level Analysis

Tables 3, 4, and 5 demonstrate Top 5 prefixes in enwiki, jawiki, and zhwiki, respectively. The top-ranked prefix in these Wikipedias is 10.1016 (Elsevier BV) which accounts for about 15 %. Additionally, Nature Publishing Group (prefix:10.1038) and Wiley-Blackwell (prefix: 10.1002, prefix: 10.1111) are also common registrants in Top 5 prefixes.

Springer+Business Media (prefix: 10.1007) and American Chemical Society (prefix: 10.1021) are the common Registrants in two languages. From these findings, it is evident that a few common registrants in these Wikipedias host the majority of referenced contents.

Table 3. Top-5 Prefixes in enwiki (n=1,474,230)
Table 4. Top-5 Prefixes in jawiki (n=28,799)
Table 5. Top-5 Prefixes in zhwiki (n=36,669)

5.3 Overlap Analysis of Unique DOI Links Between Two Language Wikipedias

Table 6 illustrates overlaps between two Wikipedias per unique DOI links. For instance, “jawiki-enwiki” refers to the set of jawiki and enwiki in this table. Then, their difference set is constituted of the DOI links that are written in jawiki but not written in enwiki, and their product set is constituted of the DOI links that are written in both jawiki and enwiki. Each percentage is the proportion of the set of DOI links to jawiki.

From the product sets, overlaps to enwiki are 79 % in jawiki and 93 % in zhwiki. While zhwiki has more DOI links than jawiki, the overlapping ratio to enwiki is higher than jawiki. The product sets “enwiki-jawiki” and “enwiki-zhwiki” are small, compared to enwiki (about 5 %). The product sets of “jawiki-zhwiki” to jawiki and “zhwiki-jawiki” to zhwiki are also small, so overlapping ratios between jawiki and zhwiki are low. These findings indicate that many DOI links might be added by translating from one to another — in the case of “jawiki-enwiki” and “zhwiki-enwiki.” On the other hand, a consideration has been made that there are few DOI links added by translating from jawiki to zhwiki, and vice versa.

Table 6. Results of overlapping analysis of unique DOI links between two language Wikipedias

5.4 Comparison of DOI Links Through Interlanguage Links and Page-Revision Histories

Table 7 shows the number of DOI links and pages concerning common DOI links between enwiki and other Wikipedia languages — through the workflow described in Fig. 2. Table 8 reveals that about 88 % of the common DOI links in the corresponding pages in jawiki were added by translating from enwiki.

Table 7. The number of DOI links and pages concerning common DOI links between enwiki and other Wikipedia languages

Thus, there are a lot of DOI links in jawiki by translating from enwiki. While about 85 % DOI links in zhwiki were added with no information about translation in edit summaries, approximately 12 % remaining DOI links were identified by translating from enwiki. This discrepancy seems to have occurred due to the difference between translation guidelines.

Table 8. The number of DOI links that is identified as translation from enwiki or other language page by using edit summaries

Figure 4 is an example of edit summary that is recorded in (Lion) page of jawiki. It mentions the edit is translation from Lion page of enwiki and specifies its revision of original page.

Fig. 4.
figure 4

An example of edit summary that mentions translation from enwiki

While the translation guideline in jawiki [21] requires mentioning of the source language and article when jawiki translates from other Wikipedias, the translation guideline in zhwiki [22] does not require such mentioning. Therefore, in zhwiki, it is difficult to identify DOI links through translations from enwiki.

Table 9. The number of DOI links that were added in enwiki before they were first added to the page

Table 9 shows the number of DOI links that were added to enwiki before they were first added to the page. There are about 98 % DOI links in jawiki — and about 99 % DOI links in zhwiki — that were added to the page. Thus, the majority of DOI links in zhwiki are thought to be written through derived enwiki.

6 Conclusion

In this study, we analyzed DOI links on English, Japanese, and Chinese Wikipedias to answer the following two research questions:

  • RQ1. Which publishers or academic societies have content that is highly referenced on Wikipedia?

Most DOI links in these Wikipedias were CrossRef DOIs. The second most-referenced DOI links in jawiki were JaLC DOIs, whereas those in zhwiki were ISTIC DOIs. JaLC DOIs are uniquely referenced in jawiki, and ISTIC DOIs tend to be referenced in zhwiki. In terms of the analysis of prefixes, Elsevier BV is the largest registrant in all languages.

Also, Nature Publishing Group and Wiley-Blackwell are commonly referenced. The content hosted by these registrants is shared among the Wikipedia communities.

  • RQ2. Does the highly referenced content vary among Wikipedia languages, or is it very similar to other languages?

Overlapping analysis showed that jawiki and zhwiki share the DOI links at a similar high rate with enwiki. An analysis of revision histories showed that the DOI links were added to pages in enwiki, before they were added to the corresponding pages in jawiki and zhwiki.

This analysis means that the majority of DOI links in jawiki and zhwiki were added by translating from enwiki. These findings imply that the DOI links in Wikipedia may result in multiple counts of altmetrics.