Cleaning different types of DOI errors found in cited references on Crossref using automated methods
Authors/Creators
Description
Abstract
Purpose
The purpose of this work is to find an automated process to repair invalid DOI names that have been collected
by Silvio Peroni while processing data provided by Crossref (2021).
Design / methodology / approach
The data needed for this research is provided as a CSV list containing more than 1 million invalid cited DOI
names. First, to determine an automated process, the errors that characterize the wrong DOI names in the list
need to be classified. Concentrating exclusively on the factual errors, such as additional or invalid characters,
the DOI names that have become valid in the meantime can be removed. Then, a classification of those factual
errors as prefix-, suffix- or other-type errors is proposed. By closer investigation and extension of already
existing research in this field, this research classifies regular expressions that can be used to clean the different
types of invalid DOI names: for example, by deleting additional strings at the end or the beginning. After the
cleanup, the cleaned DOI names are checked for their validity again.
Findings
This research was able to find automated processes based on regular expressions and correct the factual errors
belonging to different subclasses. Applying the proposed algorithm to the mentioned dataset, around 16% of
the DOI names proved valid afterwards. The largest part of those valid DOIs consists of those made valid by
cleaning up suffix errors; however, many DOIs also proved valid without cleaning, being only temporarily
invalid.
Research limitations / implications
Checking if the DOI names are valid either consumes a lot of time or a high amount of RAM, since the process
should be executed before and after the cleaning. Therefore, the described methods are only applicable on
smaller datasets, unless the availability of the necessary resources is ensured. Also, there will always remain
DOI names that cannot be made valid using automated processes. In these cases, it is important to find the
publishers responsible for the incorrect references, which is done in a separate related project (Cioffi et al.,
2021).
Originality / value
Building on existing research, this study extends and improves regular expressions targeted to clean DOI
errors, to enhance the data quality in the COCI dataset. As the COCI project provides open access to reference
lists of scientific works, the whole academic community can profit from this improvement in data quality. In
addition, the methods submitted could be the base for further research in this field, allowing the correction of
DOI name errors in other datasets, too.
Files
Cleaning different types of DOI errors found in cited references on Crossref using automated methods.pdf
Files
(575.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c764ad46f065acfbc30c115819eb39b2
|
575.9 kB | Preview Download |
Additional details
Related works
- Documents
- Software: 10.5281/zenodo.4723983 (DOI)
- Dataset: 10.5281/zenodo.4892551 (DOI)
- References
- Output management plan: 10.5281/zenodo.4733919 (DOI)
- Software documentation: 10.17504/protocols.io.buuknwuw (DOI)
References
- Boente, R., Massari, A., Santini, C., & Tural, D. (2021a). Classes of errors in DOI names (Data Management Plan) (Version 5). Zenodo. https://doi.org/10.5281/zenodo.4733919
- Boente, R., Massari, A., Santini, C., & Tural, D. (2021b). Protocol: Investigating DOIs classes of errors. protocols.io. https://dx.doi.org/10.17504/protocols.io.buuknwuw
- Boente, R., Massari, A., Santini, C., & Tural, D. (2021). Classes of errors in DOI names: output dataset (Version v1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4892551
- Bostock, M. (2021). D3: Data-Driven Documents. Software Heritage. https://archive.softwareheritage.org/swh:1:dir:35fe697ae5a21e96d9fc01d890b30010e23c16dd
- Buchanan, R. A. (2006). Accuracy of cited references: The role of citation databases. College and Research Libraries, 67(4), 292–303. https://doi.org/10.5860/crl.67.4.292
- Cioffi, A., Coppini, S., Moretti, A., & Shahidzadeh A.N. (2021, May 3). Investigating missing citations in COCI and publishers involved (Version First). Zenodo. http://doi.org/10.5281/zenodo.4735636
- Crossref. (2021). January 2021 Public Data File from Crossref. https://doi.org/10.13003/GU3DQMJVG4
- Domanskyi, S., Szedlak, A., Hawkins, N. T., Wang, J., Paternostro, G., Piermarocchi, C. (2019). bioRxiv 539833. https://doi.org/10.1101/539833
- Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186. https://doi.org/10.1007/s11192-014-1503-4
- García-Alonso, C.R., Pérez-Naranjo, L.M. & Fernández-Caballero, J.C. (2014). Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms. Ann Oper Res 219, 187–202. https://doi.org/10.1007/s10479-011- 0841-3
- Heibi, I., Peroni, S., & Shotton, D. (2019). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121(2), 1213–1228. https://doi.org/10.1007/s11192-019-03217-6
- International DOI Foundation. (2019). DOI® Handbook. https://doi.org/10.1000/182
- Krebs, S.L. (2018) Rhododendron. In: Van Huylenbroeck J. (eds) Ornamental Crops. Handbook of Plant Breeding, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-90698-0_26
- Massari, A., Santini, C., & Boente, R. (2021). open-sci/2020-2021-grasshoppers-code: Classes of errors in DOI names (Version 1.1.0). Zenodo. https://doi.org/10.5281/zenodo.4723983
- Peroni, S. (2021). Citations to invalid DOI-identified entities obtained from processing DOI-to-DOI citations to add in COCI [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4625300
- Wang, S., Van Huylenbroeck, J. and Zhang, L.-H. (2020). Adaptability of Rhododendron species to climate and growth conditions at Lushan Botanical Garden. Acta Hortic. 1288, 131-138. https://doi.org/10.17660/ActaHortic.2020.1288.20
- Xu, S., Hao, L., An, X., Zhai, D., & Pang, H. (2019). Types of DOI errors of cited references in Web of Science with a cleaning method. Scientometrics, 120(3), 1427–1437. https://doi.org/10.1007/s11192-019-03162-4
- Zhu, J., Hu, G. & Liu, W. DOI errors and possible solutions for Web of Science. Scientometrics 118, 709–718 (2019). https://doi.org/10.1007/s11192-018-2980-7