Introduction

With the establishment of digital object identifier (DOI) system in 1997 (Paskin 1999), managed by the International DOI Foundation (IDF) (Chandrakar 2006; Paskin 1999, 2010; Simmonds 1999), DOIs have been assigned uniquely to many digital objects, such as publications (Boundry and Chartron 2017; Gorraiz et al. 2016), illustrations or tables (Wang 2007), scientific data (Neumann and Brase 2014) and so on. The DOI name is a case-insensitive alphanumeric string, and consists of two parts separated by a forward slash (Sidman and Davidson 2001; Simmonds 1999): (a) a prefix beginning with the numeral 10 assigned by IDF or by DOI registration agencies, and (b) a suffix assigned by the registrants .

It is well known that comprehensive bibliographic databases, such as Scopus and Web of Science (WoS), largely promote the development of scientometrics and informetrics. However, one should keep in mind that these databases are not free of errors (Jacso 2006; Franceschini et al. 2013, 2014, 2016), though data quality has improved significantly over the past decade. Buchanan (2006) divided the database errors into two categories: (a) author errors and (b) database mapping errors. The illustrative examples are also given for each type of database errors in Buchanan (2006).

So far, errors have been found to happen to almost each field of publications, such as author names (Buchanan 2006), author address (Liu et al. 2018), publication year (Buchanan 2006), omitted citations (Franceschini et al. 2014), funding acknowledge (Tang et al. 2017), etc. The documents are also even missed from the bibliographic database (Krauskopf 2019). Of course, it is no exception for the DOI field. Franceschini et al. (2015) revealed that quite a few single DOI names were incorrectly assigned to multiple publications indexed in the Scopus database. The incorrect DOI names in the WoS database are also discovered by Zhu et al. (2019), Zhu et al. (2019) and Huang and Liu (2019).

By definition, each DOI name should be unique and must identify one and only one entity (Paskin 1999). Thus, one can utilize DOI names to identify and disambiguate the scientific publications. However, DOI errors present challenges for the data collection from different sources in order to avoid unwanted duplicate entries (Valderrama-Zurián et al. 2015), the application of new metrics, like altmetrics (Jobmann et al. 2014; Haustein et al. 2015), the accuracy of thematic structures extraction (Xu et al. 2018), and so on. In fact, apart from DOI errors described in Franceschini et al. (2015), Zhu et al. (2019) and Zhu et al. (2019), it remains unknown that whether there are other types of DOI errors, how often each type of errors occur, and whether it is possible to automatically correct these errors. In this work, various DOI errors of cited reference in the WoS database are deeply analyzed and a cleaning approach is put forward to alleviate the extent of DOI errors of cited references.

Dataset

The bibliographic data in the gene editing field was collected from the WoS core database on 25th January, 2018 from the library of Beijing University of Technology. The following search strategy is used in this study: “TS = (gene edit*) OR TS = (crispr) OR TS = (clustered regularly interspaced short palindromic repeats)”. The language is limited to English, and the document type includes article, proceedings paper and review. The publication year spans from 2000 to 2017. The downloaded scientific publications contain the full records and resulting cited references in the tab-delimited file format. It is very surprised that two records with the same DOI (10.3389/FIMMU.2017.00351) in the retrieved results have different WoS IDs (WOS:000398414900001 and WOS:000399835000001). On closer examination by manual, these two records are found to refer to the same publication, so one record is removed directly. In total, the number of publications is 13,909 and Table 1 reports the distribution of number of publications over year.

Table 1 Distribution of number of publications over year for gene editing dataset

To discover various DOI errors of cited references, one should delve into the reference list, i.e. CR field in the WoS database. Each cited publication in the reference list is usually only shown the following fields: the first author’s name (family name and surname’s initials), publication year, abbreviated publication venue (e.g., journal or conference name), volume number starting with the character “V”, starting page number beginning with the letter “P”, and DOI name with the prefix “DOI ”. If a field (such as starting page number) is unavailable for some articles, the related information is directly missed from the cited publication in the reference list. Here, “ ” denotes the whitespace character. These fields are separated by “, ”, and the cited references are delimited by “; ”. A snippet of the cited references is illustrated in the Fig. 1. It is worth mentioning that multiple DOI names can be attached to the same cited reference (cf. the second one in Fig. 1). In this case, multiple DOI names are enclosed by square brackets with the delimiter “, ”. If a cited reference has no DOI name, the prefix “DOI ” is omitted directly (cf. the third one in Fig. 1).

Fig. 1
figure 1

A snippet of the cited references in the WoS database

According to whether the cited reference attaches a DOI name or not, the cited references are divided into two categories: the cited references with DOIs and those without DOIs. The number of the cited references with and without DOIs is 341,317 and 74,643 respectively. Due to the difficulty and workload of filling with the resulting DOI names for the latter, the cited references without DOIs are excluded from further analysis in this study.

Cleaning method

Through careful analysis, this study finds that various DOI errors of the cited references exist in the WoS database. That is to say, DOI names of cited references in the WoS database are contaminated to some extent. As a matter of fact, due to the variety of DOI errors, it is not trivial to clean automatically DOI names. To the best of our knowledge, no softwares public available can competent for this cleaning task until now. Hence, a method for cleaning DOI names is proposed in this work, as shown in Algorithm 1. On the basis of manual curation rules, this approach is made up of one procedure (Cleaning) and three functions (JoinDoi, TrimDoi and IsBracketMatch). To facilitate the understanding, many data types and built-in functions from Java programming language, shown in sans serif font family font style, are explicitly utilized here. For more elaborate and detailed description on these data types and built-in functions, we refer the readers to Java API reference.

Fig. 2
figure 2

Regular expressions for cleaning various DOI errors

The procedure Cleaning takes a cited reference (CR) field of an interested publication as input, splits it into multiple cited references (Line 2 in Algorithm 1), and then try to separate DOI name(s) from other information one by one (Line 3–14 in Algorithm 1). This study mainly focuses on various DOI errors, the cited references without the clue substring “, DOI” are discarded directly. The cited references with DOI name(s) are further grouped into two cases: those with multiple DOI names (Line 7–10 in Algorithm 1) and those with single DOI name (Line 11 in Algorithm 1). Note that it is very possible that for the former case (multiple literal DOI names), only one DOI name is actually output (e.g., id = 1 and 5 in Table 5). The function JoinDoi devotes to removing the duplicate DOI names processed by the function TrimDoi.

Table 2 Mapping between regular expressions in Fig. 2 and DOI instances in Table 5

The function TrimDoi tries to trim the DOI names by several regular expressions. Though most legal Unicode characters are allowed by ISO standard (ISO 26324:2012–Information and documentation–digital object identifier system), it is very seldom that DOI names contain whitespace characters. Exceptions are still found in the WoS database, such as id = 18 (2nd one) in Table 5. Hence, before cleaning further DOI names, all whitespace characters are removed (Line 32 in Algorithm 1), Then, prefix-type (Line 33–36 in Algorithm 1), suffix-type (Line 37–44 in Algorithm 1) and other-type (Line 45–46 in Algorithm 1) errors of DOI names are cleaned sequentially with the regular expressions in Fig. 2. The prefix- and suffix-type errors are further grouped into several cases. For convenient understanding, Table 2 illustrates the mapping between regular expressions in Fig. 2 and DOI instances in Table 5.

In addition, the function TrimDoi (Line 45–46 in Algorithm 1) is also able to deal with several special cases, such as forward slash (id = 10 in Table 5), double underlines (id = 11 in Table 5), double dots (id = 12 in Table 5), XML tags (id = 13 in Table 5) and so on. In the end, if trimmed DOI names do not follow the specified characteristics (Sidman and Davidson 2001; Simmonds 1999) (Line 47 in Algorithm 1), trimmed DOI name and false status are returned. One can find the resulting instances in Table 5, e.g., id = 40–44 and 47. Otherwise, if trimmed DOI names end with hyphen or underline symbol (such as id = 32 (1st), 33 and 39 in Table 5), these DOIs are also illegal (Line 48–49 in Algorithm 1). Then, the function IsBracketMatch is used to check whether the involved brackets match in trimmed DOI names or resulting substrings excluding the last letter (Line 50–53 in Algorithm 1). Please refer to the DOI instances with id = 6 and 38 in Table 5 for more details. Since the functionality of IsBracketMatch is very simple and easy to implement, the corresponding pseudo-code is omitted in this study.

Results and discussions

Table 3 summarizes the distribution of various DOI errors in the gene editing dataset. From Table 3, one can see that the vast majority of DOI errors belong to the prefix-type error. In fact, the number of DOI errors with the prefix “DOI ” is 4,968, which accounts for 92.39% DOI errors. Amongst the other errors, the number of the incoherently described DOI errors (not beginning with the prefix “10.”) is 154, as reported in the supplementary material S1. To evaluate the performance of our cleaning method, the number of publications with multiple DOI names before and after cleaning is shown in Table 4. It is not difficult to see that the number of cited references with two and three DOI names is reduced drastically from 9,704 to 1,990 and from 45 to 33, respectively. This indicates that the quality of DOI names of cited references in the WoS database has been greatly improved. Please check the attached supplementary materials S1 and S2 for more details.

Table 3 Distribution of various DOI errors in the gene editing dataset
Table 4 Examples of various DOI errors in the WoS database
Table 5 The number of cited references with multiple DOI names in the gene editing dataset

It is worth noting that our cleaning method cannot conquer the following several situations: (a) if similar characters are confused with each other (Zhu et al. 2019), e.g. id = 7, 26 and 34 (1st) in Table 5, incorrect DOI names will be output, such as “10.3892/MC0.2013.131” (id = 7 in Table 5), “10.1038/NC0MMS10548” (id = 22 in Table 5), “10.1182/BL00D-2011-12-396879” (id = 26 in Table 5) and “10.1139/008-008” (the 2nd one with id = 34 in Table 5); (b) if multiple DOI names are incorrectly assigned to the same cited reference, our cleaning method cannot currently differentiate which one is correct, e.g. id = 34–36, 49 in Table 5 (DOI names with blue color are correct); (c) A DOI name assigned to the scientific publication cannot be resolved by the DOI system (http://dx.doi.org/), e.g. “10.1007/3-540-48194-X17” (id = 32 in Table 5) and “10.1093/NAR” (id = 48 in Table 5); (d) A DOI name assigned to the scholarly article is resolvable, but it is resolved to some knowledge unit by the DOI system. For instance, the DOI name “10.7554/ELIFE.25956.001” is not resolved to the cited reference per se by DOI system, but to its abstract (id = 37 in Table 5).

As a matter of fact, after our preliminary analysis, unsupervised anomaly detection algorithm (Goldstein and Uchida 2016) can be utilized to deal with most above cases. Let’s take the prefix “10.7554/ELIFE.” (id = 37 in Table 5) as an example. In our dataset, all DOI names with this prefix ends with five digital letters, except for the following three cited references: “10.7554/ELIFE.08716.001”, “10.7554/ELIFE.11553.001” and “10.7554/ELIFE.25956.001”. Of course, sometimes the truth is in the hands of a few people. For example, among DOI names with the prefix “10.3892/MC” in our dataset (id = 7 in Table 5), only one is found to be correct (“10.3892/MCO.2013.119”). Therefore, one should determine interactively whether or not the detected abnormal DOI names should be remained. Experimental results and insights from unsupervised anomaly detection algorithm will be described in another paper. However, as for the case with id = 49 in Table 5, any automatic approach seems be helpless, since all DOI names are resolvable and these publications are written by the same author (Jane Bates) and published in the same journal (Nursing Standard) with the same volume number (31) and page number (31–31),but with different issue number.

figure a

Conclusions

As noted by Zhu et al. (2019), there is no simple way to recognize and thus to evaluate the extent of DOI errors in the Web of Science database. After careful analysis on the bibliographic data in the gene editing field, several classic DOI errors of cited references, such as prefix-, suffix- and other-type errors, are identified. The other-type errors can be further divided into three subgroups: (a) those containing special characters (such as id = 10–14, 46 in Table 5), (b) incoherently described DOIs (such as id = 40–44, 47 in Table 5), and (c) those with incomplete suffix but with correct DOI prefix (i.e., “10.”) (such as id = 32 (1st), 33 and 39 in Table 5) . Then, a cleaning method of DOI names is put forward on the basis of regular expressions in this work.

Though our cleaning approach can improve largely the quality of DOI names of cited references in the WoS database, several situations cannot still be conquered by our approach: (a) similar characters are confused with each other (Zhu et al. 2019), such as “O” versus “0”, “b” versus “6” and “O” versus “Q”; (b) it is very difficult to distinguish the correct one from multiple DOI names assigned to the same cited reference; (c) A DOI name assigned to some cited reference cannot be resolved by the DOI system; (d) A DOI name is resolvable, but points to some knowledge unit within the interested cited reference. According to our preliminary analysis, it seems that unsupervised anomaly detection algorithm (Goldstein and Uchida 2016) is able to deal with most above cases. In the near future, we will try this algorithm and report our insights in another paper.

In the meanwhile, this work argues that similar DOI errors should also exist for other bibliographic database. For example, wrong DOI names for id = 40–44 in Table 5 seems come from MEDLINE database, since the corresponding publications in these two database are assigned with the same incorrect DOI names. Therefore, it is not without reasons that the cleaning method proposed in this study should be applicable to other databases. Another interesting phenomenon, shown in Fig. 3, can be observed. The correct DOI names can be obtained from the detailed webpages of the resulting articles, though some noise information is followed. But, the correct DOI names and noise information are mixed in the WoS database.

Fig. 3
figure 3

The snapshots of the cited references with id = 27 and 30 in Table 5

From Table 5, in our opinion, it is very complex to figure out possible sources of errors due to the diversity of DOI errors. However, since high-quality bibliometric data is the stakeholders’ ultimate goal, one feasible solution is to clean all DOI names of the cited references in the interested databases with our approach and then unsupervised anomaly detection algorithm (Goldstein and Uchida 2016). After these processing steps, for those publications with still more than one different DOI names or wrong DOI names, the WoS and other databases should recognize and keep the correct DOI names.

Supplementary material

  1. S1

    The cited references with the incoherently described DOI errors http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:papers:s1.xlsx

  2. S2

    The cited references with multiple DOI names before cleaning http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:papers:s2.xlsx

  3. S3

    The cited references with multiple DOI names after cleaning http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:papers:s3.xlsx