Cleaning different types of DOI errors found in cited references on Crossref using automated methods

Boente, Ricarda; Massari, Arcangelo; Santini, Cristian; Tural, Deniz

doi:10.5281/zenodo.4914003

Published June 8, 2021 | Version 2

Working paper Open

Cleaning different types of DOI errors found in cited references on Crossref using automated methods

Abstract

Purpose

The purpose of this work is to find an automated process to repair invalid DOI names that have been collected
by Silvio Peroni while processing data provided by Crossref (2021).

Design / methodology / approach

The data needed for this research is provided as a CSV list containing more than 1 million invalid cited DOI
names. First, to determine an automated process, the errors that characterize the wrong DOI names in the list
need to be classified. Concentrating exclusively on the factual errors, such as additional or invalid characters,
the DOI names that have become valid in the meantime can be removed. Then, a classification of those factual
errors as prefix-, suffix- or other-type errors is proposed. By closer investigation and extension of already
existing research in this field, this research classifies regular expressions that can be used to clean the different
types of invalid DOI names: for example, by deleting additional strings at the end or the beginning. After the
cleanup, the cleaned DOI names are checked for their validity again.

Findings

This research was able to find automated processes based on regular expressions and correct the factual errors
belonging to different subclasses. Applying the proposed algorithm to the mentioned dataset, around 16% of
the DOI names proved valid afterwards. The largest part of those valid DOIs consists of those made valid by
cleaning up suffix errors; however, many DOIs also proved valid without cleaning, being only temporarily
invalid.

Research limitations / implications

Checking if the DOI names are valid either consumes a lot of time or a high amount of RAM, since the process
should be executed before and after the cleaning. Therefore, the described methods are only applicable on
smaller datasets, unless the availability of the necessary resources is ensured. Also, there will always remain
DOI names that cannot be made valid using automated processes. In these cases, it is important to find the
publishers responsible for the incorrect references, which is done in a separate related project (Cioffi et al.,
2021).

Originality / value

Building on existing research, this study extends and improves regular expressions targeted to clean DOI
errors, to enhance the data quality in the COCI dataset. As the COCI project provides open access to reference
lists of scientific works, the whole academic community can profit from this improvement in data quality. In
addition, the methods submitted could be the base for further research in this field, allowing the correction of
DOI name errors in other datasets, too.

Files

Cleaning different types of DOI errors found in cited references on Crossref using automated methods.pdf

Files (575.9 kB)

Name	Size	Download all
Cleaning different types of DOI errors found in cited references on Crossref using automated methods.pdf md5:c764ad46f065acfbc30c115819eb39b2	575.9 kB	Preview Download

Additional details

Documents: Software: 10.5281/zenodo.4723983 (DOI); Dataset: 10.5281/zenodo.4892551 (DOI)
References: Output management plan: 10.5281/zenodo.4733919 (DOI); Software documentation: 10.17504/protocols.io.buuknwuw (DOI)

Boente, R., Massari, A., Santini, C., & Tural, D. (2021a). Classes of errors in DOI names (Data Management Plan) (Version 5). Zenodo. https://doi.org/10.5281/zenodo.4733919
Boente, R., Massari, A., Santini, C., & Tural, D. (2021b). Protocol: Investigating DOIs classes of errors. protocols.io. https://dx.doi.org/10.17504/protocols.io.buuknwuw
Boente, R., Massari, A., Santini, C., & Tural, D. (2021). Classes of errors in DOI names: output dataset (Version v1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4892551
Bostock, M. (2021). D3: Data-Driven Documents. Software Heritage. https://archive.softwareheritage.org/swh:1:dir:35fe697ae5a21e96d9fc01d890b30010e23c16dd
Buchanan, R. A. (2006). Accuracy of cited references: The role of citation databases. College and Research Libraries, 67(4), 292–303. https://doi.org/10.5860/crl.67.4.292
Cioffi, A., Coppini, S., Moretti, A., & Shahidzadeh A.N. (2021, May 3). Investigating missing citations in COCI and publishers involved (Version First). Zenodo. http://doi.org/10.5281/zenodo.4735636
Crossref. (2021). January 2021 Public Data File from Crossref. https://doi.org/10.13003/GU3DQMJVG4
Domanskyi, S., Szedlak, A., Hawkins, N. T., Wang, J., Paternostro, G., Piermarocchi, C. (2019). bioRxiv 539833. https://doi.org/10.1101/539833
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186. https://doi.org/10.1007/s11192-014-1503-4
García-Alonso, C.R., Pérez-Naranjo, L.M. & Fernández-Caballero, J.C. (2014). Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms. Ann Oper Res 219, 187–202. https://doi.org/10.1007/s10479-011- 0841-3
Heibi, I., Peroni, S., & Shotton, D. (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121(2), 1213–1228. https://doi.org/10.1007/s11192-019-03217-6
International DOI Foundation. (2019). DOI® Handbook. https://doi.org/10.1000/182
Krebs, S.L. (2018) Rhododendron. In: Van Huylenbroeck J. (eds) Ornamental Crops. Handbook of Plant Breeding, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-90698-0_26
Massari, A., Santini, C., & Boente, R. (2021). open-sci/2020-2021-grasshoppers-code: Classes of errors in DOI names (Version 1.1.0). Zenodo. https://doi.org/10.5281/zenodo.4723983
Peroni, S. (2021). Citations to invalid DOI-identified entities obtained from processing DOI-to-DOI citations to add in COCI [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4625300
Wang, S., Van Huylenbroeck, J. and Zhang, L.-H. (2020). Adaptability of Rhododendron species to climate and growth conditions at Lushan Botanical Garden. Acta Hortic. 1288, 131-138. https://doi.org/10.17660/ActaHortic.2020.1288.20
Xu, S., Hao, L., An, X., Zhai, D., & Pang, H. (2019). Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics, 120(3), 1427–1437. https://doi.org/10.1007/s11192-019-03162-4
Zhu, J., Hu, G. & Liu, W. DOI errors and possible solutions for Web of Science. Scientometrics 118, 709–718 (2019). https://doi.org/10.1007/s11192-018-2980-7

	All versions	This version
Views	570	434
Downloads	440	210
Data volume	271.7 MB	136.5 MB

Cleaning different types of DOI errors found in cited references on Crossref using automated methods

Authors/Creators

Description

Files

Cleaning different types of DOI errors found in cited references on Crossref using automated methods.pdf

Files (575.9 kB)

Additional details

Related works

References