Types of DOI errors of cited references in Web of Science with a cleaning method

Xu, Shuo; Hao, Liyuan; An, Xin; Zhai, Dongsheng; Pang, Hongshen

doi:10.1007/s11192-019-03162-4

Types of DOI errors of cited references in Web of Science with a cleaning method

Published: 11 July 2019

Volume 120, pages 1427–1437, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Scientometrics Aims and scope Submit manuscript

Types of DOI errors of cited references in Web of Science with a cleaning method

Download PDF

Shuo Xu¹,
Liyuan Hao¹,
Xin An²,
Dongsheng Zhai¹ &
…
Hongshen Pang³

1006 Accesses
25 Citations
1 Altmetric
Explore all metrics

Abstract

Though the bibliographic databases, such as Web of Science (WoS), largely promote the development of scientometrics and informetrics, these databases are not free of errors. The main purpose of this work is to figure out which types of DOI errors of cited references exist, how often each type of errors occur, and whether it is possible to automatically correct these errors. After careful analysis, several classic DOI errors of cited references, such as prefix-, suffix- and other-type errors, are identified, Then, a cleaning method is put forward on the basis of regular expressions. Experimental results on the bibliographic data in the gene editing field from the WoS database indicate that our cleaning approach can improve largely the quality of DOI names of cited references.

Identifying and correcting invalid citations due to DOI errors in Crossref data

Article Open access 09 June 2022

Errors in DOI indexing by bibliometric databases

Article 23 December 2014

Citation Enrichment Improves Deduplication of Primary Evidence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

With the establishment of digital object identifier (DOI) system in 1997 (Paskin 1999), managed by the International DOI Foundation (IDF) (Chandrakar 2006; Paskin 1999, 2010; Simmonds 1999), DOIs have been assigned uniquely to many digital objects, such as publications (Boundry and Chartron 2017; Gorraiz et al. 2016), illustrations or tables (Wang 2007), scientific data (Neumann and Brase 2014) and so on. The DOI name is a case-insensitive alphanumeric string, and consists of two parts separated by a forward slash (Sidman and Davidson 2001; Simmonds 1999): (a) a prefix beginning with the numeral 10 assigned by IDF or by DOI registration agencies, and (b) a suffix assigned by the registrants .

It is well known that comprehensive bibliographic databases, such as Scopus and Web of Science (WoS), largely promote the development of scientometrics and informetrics. However, one should keep in mind that these databases are not free of errors (Jacso 2006; Franceschini et al. 2013, 2014, 2016), though data quality has improved significantly over the past decade. Buchanan (2006) divided the database errors into two categories: (a) author errors and (b) database mapping errors. The illustrative examples are also given for each type of database errors in Buchanan (2006).

So far, errors have been found to happen to almost each field of publications, such as author names (Buchanan 2006), author address (Liu et al. 2018), publication year (Buchanan 2006), omitted citations (Franceschini et al. 2014), funding acknowledge (Tang et al. 2017), etc. The documents are also even missed from the bibliographic database (Krauskopf 2019). Of course, it is no exception for the DOI field. Franceschini et al. (2015) revealed that quite a few single DOI names were incorrectly assigned to multiple publications indexed in the Scopus database. The incorrect DOI names in the WoS database are also discovered by Zhu et al. (2019), Zhu et al. (2019) and Huang and Liu (2019).

By definition, each DOI name should be unique and must identify one and only one entity (Paskin 1999). Thus, one can utilize DOI names to identify and disambiguate the scientific publications. However, DOI errors present challenges for the data collection from different sources in order to avoid unwanted duplicate entries (Valderrama-Zurián et al. 2015), the application of new metrics, like altmetrics (Jobmann et al. 2014; Haustein et al. 2015), the accuracy of thematic structures extraction (Xu et al. 2018), and so on. In fact, apart from DOI errors described in Franceschini et al. (2015), Zhu et al. (2019) and Zhu et al. (2019), it remains unknown that whether there are other types of DOI errors, how often each type of errors occur, and whether it is possible to automatically correct these errors. In this work, various DOI errors of cited reference in the WoS database are deeply analyzed and a cleaning approach is put forward to alleviate the extent of DOI errors of cited references.

Dataset

The bibliographic data in the gene editing field was collected from the WoS core database on 25th January, 2018 from the library of Beijing University of Technology. The following search strategy is used in this study: “TS = (gene edit*) OR TS = (crispr) OR TS = (clustered regularly interspaced short palindromic repeats)”. The language is limited to English, and the document type includes article, proceedings paper and review. The publication year spans from 2000 to 2017. The downloaded scientific publications contain the full records and resulting cited references in the tab-delimited file format. It is very surprised that two records with the same DOI (10.3389/FIMMU.2017.00351) in the retrieved results have different WoS IDs (WOS:000398414900001 and WOS:000399835000001). On closer examination by manual, these two records are found to refer to the same publication, so one record is removed directly. In total, the number of publications is 13,909 and Table 1 reports the distribution of number of publications over year.

Table 1 Distribution of number of publications over year for gene editing dataset

Full size table

To discover various DOI errors of cited references, one should delve into the reference list, i.e. CR field in the WoS database. Each cited publication in the reference list is usually only shown the following fields: the first author’s name (family name and surname’s initials), publication year, abbreviated publication venue (e.g., journal or conference name), volume number starting with the character “V”, starting page number beginning with the letter “P”, and DOI name with the prefix “DOI ”. If a field (such as starting page number) is unavailable for some articles, the related information is directly missed from the cited publication in the reference list. Here, “ ” denotes the whitespace character. These fields are separated by “, ”, and the cited references are delimited by “; ”. A snippet of the cited references is illustrated in the Fig. 1. It is worth mentioning that multiple DOI names can be attached to the same cited reference (cf. the second one in Fig. 1). In this case, multiple DOI names are enclosed by square brackets with the delimiter “, ”. If a cited reference has no DOI name, the prefix “DOI ” is omitted directly (cf. the third one in Fig. 1).

According to whether the cited reference attaches a DOI name or not, the cited references are divided into two categories: the cited references with DOIs and those without DOIs. The number of the cited references with and without DOIs is 341,317 and 74,643 respectively. Due to the difficulty and workload of filling with the resulting DOI names for the latter, the cited references without DOIs are excluded from further analysis in this study.

Cleaning method

Through careful analysis, this study finds that various DOI errors of the cited references exist in the WoS database. That is to say, DOI names of cited references in the WoS database are contaminated to some extent. As a matter of fact, due to the variety of DOI errors, it is not trivial to clean automatically DOI names. To the best of our knowledge, no softwares public available can competent for this cleaning task until now. Hence, a method for cleaning DOI names is proposed in this work, as shown in Algorithm 1. On the basis of manual curation rules, this approach is made up of one procedure (Cleaning) and three functions (JoinDoi, TrimDoi and IsBracketMatch). To facilitate the understanding, many data types and built-in functions from Java programming language, shown in sans serif font family font style, are explicitly utilized here. For more elaborate and detailed description on these data types and built-in functions, we refer the readers to Java API reference.

The procedure Cleaning takes a cited reference (CR) field of an interested publication as input, splits it into multiple cited references (Line 2 in Algorithm 1), and then try to separate DOI name(s) from other information one by one (Line 3–14 in Algorithm 1). This study mainly focuses on various DOI errors, the cited references without the clue substring “, DOI” are discarded directly. The cited references with DOI name(s) are further grouped into two cases: those with multiple DOI names (Line 7–10 in Algorithm 1) and those with single DOI name (Line 11 in Algorithm 1). Note that it is very possible that for the former case (multiple literal DOI names), only one DOI name is actually output (e.g., id = 1 and 5 in Table 5). The function JoinDoi devotes to removing the duplicate DOI names processed by the function TrimDoi.

Table 2 Mapping between regular expressions in Fig. 2 and DOI instances in Table 5

Full size table

The function TrimDoi tries to trim the DOI names by several regular expressions. Though most legal Unicode characters are allowed by ISO standard (ISO 26324:2012–Information and documentation–digital object identifier system), it is very seldom that DOI names contain whitespace characters. Exceptions are still found in the WoS database, such as id = 18 (2nd one) in Table 5. Hence, before cleaning further DOI names, all whitespace characters are removed (Line 32 in Algorithm 1), Then, prefix-type (Line 33–36 in Algorithm 1), suffix-type (Line 37–44 in Algorithm 1) and other-type (Line 45–46 in Algorithm 1) errors of DOI names are cleaned sequentially with the regular expressions in Fig. 2. The prefix- and suffix-type errors are further grouped into several cases. For convenient understanding, Table 2 illustrates the mapping between regular expressions in Fig. 2 and DOI instances in Table 5.

In addition, the function TrimDoi (Line 45–46 in Algorithm 1) is also able to deal with several special cases, such as forward slash (id = 10 in Table 5), double underlines (id = 11 in Table 5), double dots (id = 12 in Table 5), XML tags (id = 13 in Table 5) and so on. In the end, if trimmed DOI names do not follow the specified characteristics (Sidman and Davidson 2001; Simmonds 1999) (Line 47 in Algorithm 1), trimmed DOI name and false status are returned. One can find the resulting instances in Table 5, e.g., id = 40–44 and 47. Otherwise, if trimmed DOI names end with hyphen or underline symbol (such as id = 32 (1st), 33 and 39 in Table 5), these DOIs are also illegal (Line 48–49 in Algorithm 1). Then, the function IsBracketMatch is used to check whether the involved brackets match in trimmed DOI names or resulting substrings excluding the last letter (Line 50–53 in Algorithm 1). Please refer to the DOI instances with id = 6 and 38 in Table 5 for more details. Since the functionality of IsBracketMatch is very simple and easy to implement, the corresponding pseudo-code is omitted in this study.

Results and discussions

Table 3 summarizes the distribution of various DOI errors in the gene editing dataset. From Table 3, one can see that the vast majority of DOI errors belong to the prefix-type error. In fact, the number of DOI errors with the prefix “DOI ” is 4,968, which accounts for 92.39% DOI errors. Amongst the other errors, the number of the incoherently described DOI errors (not beginning with the prefix “10.”) is 154, as reported in the supplementary material S1. To evaluate the performance of our cleaning method, the number of publications with multiple DOI names before and after cleaning is shown in Table 4. It is not difficult to see that the number of cited references with two and three DOI names is reduced drastically from 9,704 to 1,990 and from 45 to 33, respectively. This indicates that the quality of DOI names of cited references in the WoS database has been greatly improved. Please check the attached supplementary materials S1 and S2 for more details.

Table 3 Distribution of various DOI errors in the gene editing dataset

Full size table

Table 4 Examples of various DOI errors in the WoS database

Full size table

Table 5 The number of cited references with multiple DOI names in the gene editing dataset

Full size table

It is worth noting that our cleaning method cannot conquer the following several situations: (a) if similar characters are confused with each other (Zhu et al. 2019), e.g. id = 7, 26 and 34 (1st) in Table 5, incorrect DOI names will be output, such as “10.3892/MC0.2013.131” (id = 7 in Table 5), “10.1038/NC0MMS10548” (id = 22 in Table 5), “10.1182/BL00D-2011-12-396879” (id = 26 in Table 5) and “10.1139/008-008” (the 2nd one with id = 34 in Table 5); (b) if multiple DOI names are incorrectly assigned to the same cited reference, our cleaning method cannot currently differentiate which one is correct, e.g. id = 34–36, 49 in Table 5 (DOI names with blue color are correct); (c) A DOI name assigned to the scientific publication cannot be resolved by the DOI system (http://dx.doi.org/), e.g. “10.1007/3-540-48194-X17” (id = 32 in Table 5) and “10.1093/NAR” (id = 48 in Table 5); (d) A DOI name assigned to the scholarly article is resolvable, but it is resolved to some knowledge unit by the DOI system. For instance, the DOI name “10.7554/ELIFE.25956.001” is not resolved to the cited reference per se by DOI system, but to its abstract (id = 37 in Table 5).

As a matter of fact, after our preliminary analysis, unsupervised anomaly detection algorithm (Goldstein and Uchida 2016) can be utilized to deal with most above cases. Let’s take the prefix “10.7554/ELIFE.” (id = 37 in Table 5) as an example. In our dataset, all DOI names with this prefix ends with five digital letters, except for the following three cited references: “10.7554/ELIFE.08716.001”, “10.7554/ELIFE.11553.001” and “10.7554/ELIFE.25956.001”. Of course, sometimes the truth is in the hands of a few people. For example, among DOI names with the prefix “10.3892/MC” in our dataset (id = 7 in Table 5), only one is found to be correct (“10.3892/MCO.2013.119”). Therefore, one should determine interactively whether or not the detected abnormal DOI names should be remained. Experimental results and insights from unsupervised anomaly detection algorithm will be described in another paper. However, as for the case with id = 49 in Table 5, any automatic approach seems be helpless, since all DOI names are resolvable and these publications are written by the same author (Jane Bates) and published in the same journal (Nursing Standard) with the same volume number (31) and page number (31–31),but with different issue number.

Conclusions

As noted by Zhu et al. (2019), there is no simple way to recognize and thus to evaluate the extent of DOI errors in the Web of Science database. After careful analysis on the bibliographic data in the gene editing field, several classic DOI errors of cited references, such as prefix-, suffix- and other-type errors, are identified. The other-type errors can be further divided into three subgroups: (a) those containing special characters (such as id = 10–14, 46 in Table 5), (b) incoherently described DOIs (such as id = 40–44, 47 in Table 5), and (c) those with incomplete suffix but with correct DOI prefix (i.e., “10.”) (such as id = 32 (1st), 33 and 39 in Table 5) . Then, a cleaning method of DOI names is put forward on the basis of regular expressions in this work.

Though our cleaning approach can improve largely the quality of DOI names of cited references in the WoS database, several situations cannot still be conquered by our approach: (a) similar characters are confused with each other (Zhu et al. 2019), such as “O” versus “0”, “b” versus “6” and “O” versus “Q”; (b) it is very difficult to distinguish the correct one from multiple DOI names assigned to the same cited reference; (c) A DOI name assigned to some cited reference cannot be resolved by the DOI system; (d) A DOI name is resolvable, but points to some knowledge unit within the interested cited reference. According to our preliminary analysis, it seems that unsupervised anomaly detection algorithm (Goldstein and Uchida 2016) is able to deal with most above cases. In the near future, we will try this algorithm and report our insights in another paper.

In the meanwhile, this work argues that similar DOI errors should also exist for other bibliographic database. For example, wrong DOI names for id = 40–44 in Table 5 seems come from MEDLINE database, since the corresponding publications in these two database are assigned with the same incorrect DOI names. Therefore, it is not without reasons that the cleaning method proposed in this study should be applicable to other databases. Another interesting phenomenon, shown in Fig. 3, can be observed. The correct DOI names can be obtained from the detailed webpages of the resulting articles, though some noise information is followed. But, the correct DOI names and noise information are mixed in the WoS database.

From Table 5, in our opinion, it is very complex to figure out possible sources of errors due to the diversity of DOI errors. However, since high-quality bibliometric data is the stakeholders’ ultimate goal, one feasible solution is to clean all DOI names of the cited references in the interested databases with our approach and then unsupervised anomaly detection algorithm (Goldstein and Uchida 2016). After these processing steps, for those publications with still more than one different DOI names or wrong DOI names, the WoS and other databases should recognize and keep the correct DOI names.

Supplementary material

S1
The cited references with the incoherently described DOI errors http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:papers:s1.xlsx
S2
The cited references with multiple DOI names before cleaning http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:papers:s2.xlsx
S3
The cited references with multiple DOI names after cleaning http://54xushuo.net/wiki/lib/exe/fetch.php?media=resources:papers:s3.xlsx

References

Boundry, C., & Chartron, G. (2017). Availability of digital object identifiers in publications archived by PubMed. Scientometrics, 110(3), 1453–1469. https://doi.org/10.1007/s11192-016-2225-6.
Article Google Scholar
Buchanan, R. A. (2006). Accuracy of cited references: The role of citation databases. College and Research Libraries, 67(4), 292–303. https://doi.org/10.5860/crl.67.4.292.
Article Google Scholar
Chandrakar, R. (2006). Digital object identifier system: An overview. The Electronic Library, 24(4), 445–452. https://doi.org/10.1108/02640470610689151.
Article Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2013). A novel approach for estimating the omitted-citation rate of bibliometric databases with an application to the field of bibliometrics. Journal of the Association for Information Science and Technology, 64(10), 2149–2156. https://doi.org/10.1002/asi.22898.
Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2014). Scientific journal publishers and omitted citations in bibliometric databases: Any relationship? Journal of Informetrics, 8(3), 751–765. https://doi.org/10.1016/j.joi.2014.07.003.
Article Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015). Errors in indexing bybibliometric databases. Scientometrics, 102(3), 2181–2186. https://doi.org/10.1007/s11192-014-1503-4.
Article Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2016). The museum of errors/horrors in Scopus. Journal of Informetrics, 10(1), 174–182. https://doi.org/10.1016/j.joi.2015.11.006.
Article Google Scholar
Goldstein, M., & Uchida, S. (2016). A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE, 11(4), e0152173. https://doi.org/10.1371/journal.pone.0152173.
Article Google Scholar
Gorraiz, J., Melero-Fuentes, D., Gumpenberger, C., & Valderrama-Zurián, J.-C. (2016). Availability of digital object identifiers (DOIs) in Web of Science and scopus. Journal of Informetrics, 10(1), 98–109. https://doi.org/10.1016/j.joi.2015.11.008.
Article Google Scholar
Haustein, S., Costas, R., & Larivière, V. (2015). Characterizing social media metrics of scholarly papers: The effect of document properties and collaboration patterns. PLoS ONE, 10(5), e0127830. https://doi.org/10.1371/journal.pone.0120495.
Article Google Scholar
Huang, M., & Liu, W. (2019). Substantial numbers of easily identifiable illegal DOIs still exist in Scopus. Journal of Informetrics,. https://doi.org/10.1016/j.joi.2019.03.019.
Google Scholar
Jacso, P. (2006). Deflated, inflated and phantom citation counts. Online Information Review, 30(3), 297–309. https://doi.org/10.1108/14684520610675816.
Article Google Scholar
Jobmann, A., Hoffmann, C. P., Künne, S., Peters, I., Schmitz, J., & Wollnik-Korn, G. (2014). Altmetrics for large, multidisciplinary research groups: Comparison of current tools. Bibliometrie-Praxis und Forschung, 3(1), 1–19. https://doi.org/10.5283/bpf.205.
Google Scholar
Krauskopf, E. (2019). Missing documents in Scopus: The case of the journal enfermeria nefrologica. Scientometrics, 119(1), 543–547. https://doi.org/10.1007/s11192-019-03040-z.
Article Google Scholar
Liu, W., Hu, G., & Tang, L. (2018). Missing author address information in Web of Science-an explorative study. Journal of Informetrics, 12(3), 985–997. https://doi.org/10.1016/j.joi.2018.07.008.
Article Google Scholar
Neumann, J., & Brase, J. (2014). DataCite and names for research data. Journal of Computer-Aided Molecular Design, 28(10), 1035–1041. https://doi.org/10.1007/s10822-014-9776-5.
Article Google Scholar
Paskin, N. (1999). The digital object identifier system: Digital technology meets content management. Interlending & Document Supply, 27(1), 13–16. https://doi.org/10.1108/02641619910255829.
Article Google Scholar
Paskin, N. (2010). Digital object identifier (DOI) system. In A. Kent (Ed.), Encyclopedia of library and information sciences (3rd ed., pp. 1586–1592). Milton Park: Taylor and Francis.
Google Scholar
Sidman, D., & Davidson, T. (2001). A practical guide to automating the digital supply chain with the digital object identifier (DOI). Publishing Research Quarterly, 17(2), 9–23. https://doi.org/10.1007/s12109-001-0019-y.
Article Google Scholar
Simmonds, A. W. (1999). The digital object identifier (DOI). Publishing Research Quarterly, 15(2), 10–13. https://doi.org/10.1007/s12109-999-0022-2.
Article Google Scholar
Tang, L., Hu, G., & Liu, W. (2017). Funding acknowledgement analysis: Queries and caveats. Journal of the Association for Information Science and Technology, 68(3), 790–794. https://doi.org/10.1002/asi.23713.
Article Google Scholar
Valderrama-Zurián, J.-C., Aguilar-Moya, R., Melero-Fuentes, D., & Aleixandre- Benavent, R. (2015). A systematic analysis of duplicate records in Scopus. Journal of Informetrics, 9(3), 570–576. https://doi.org/10.1016/j.joi.2015.05.002.
Article Google Scholar
Wang, J. (2007). Digital object identifiers and their use in libraries. Serials Review, 33(3), 161–164. https://doi.org/10.1016/j.serrev.2007.05.006.
Article Google Scholar
Xu, S., Liu, J., Zhai, D., An, X., Wang, Z., & Pang, H. (2018). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. Scientometrics, 117(1), 61–84. https://doi.org/10.1007/s11192-018-2841-4.
Article Google Scholar
Zhu, J., Hu, G., & Liu, W. (2019). DOI errors and possible solutions for Web of Science. Scientometrics, 118(2), 709–718. https://doi.org/10.1007/s11192-018-2980-7.
Article Google Scholar
Zhu, J., Liu, F., & Liu, W. (2019). The secrets behind Web of Science’s search. Scientometrics, 4, 1745–1753. https://doi.org/10.1007/s11192-019-03091-2.
Article Google Scholar

Download references

Acknowledgements

Our gratitude goes to the anonymous reviewers and the editor for their valuable comments.

Author information

Authors and Affiliations

Research Base of Beijing Modern Manufacturing Development, College of Economics and Management, Beijing University of Technology, Beijing, 100124, People’s Republic of China
Shuo Xu, Liyuan Hao & Dongsheng Zhai
School of Economics and Management, Beijing Forestry University, Beijing, 100083, People’s Republic of China
Xin An
Library, Shenzhen University, Shenzhen, 518060, People’s Republic of China
Hongshen Pang

Authors

Shuo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Liyuan Hao
View author publications
You can also search for this author in PubMed Google Scholar
Xin An
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Hongshen Pang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin An.

Additional information

This work was supported partially by the Social Science Foundation of Beijing [Grant Number 17GLB074] and Natural Science Foundation of Guangdong Province under Grant Number 2018A030313695.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, S., Hao, L., An, X. et al. Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics 120, 1427–1437 (2019). https://doi.org/10.1007/s11192-019-03162-4

Download citation

Received: 23 April 2019
Published: 11 July 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s11192-019-03162-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Types of DOI errors of cited references in Web of Science with a cleaning method

Abstract

Similar content being viewed by others

Identifying and correcting invalid citations due to DOI errors in Crossref data

Errors in DOI indexing by bibliometric databases

Citation Enrichment Improves Deduplication of Primary Evidence

Introduction

Dataset

Cleaning method

Results and discussions

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Types of DOI errors of cited references in Web of Science with a cleaning method

Abstract

Similar content being viewed by others

Identifying and correcting invalid citations due to DOI errors in Crossref data

Errors in DOI indexing by bibliometric databases

Citation Enrichment Improves Deduplication of Primary Evidence

Introduction

Dataset

Cleaning method

Results and discussions

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation