Published September 30, 2022 | Version v1
Conference paper Open

Everything you wish you didn't have to know about metadata matching

  • 1. Crossref

Description

The scholarly community understands very well how important accurate citation links between research outputs are: they provide provenance for the claims in the articles, researchers follow them to extend their domain knowledge, and institutions even tend to use them to estimate the quality and impact of research. And citation links are not the only important relationships between entities in the scholarly ecosystem. Nowadays, the community is becoming more and more interested in relationships between research outputs and institutions, research outputs and funders, contributors and institutions, preprints and journal articles, and so on. But where do those links actually come from? Ideally, they would be provided by the authors while submitting a scholarly article, collected by the publisher and distributed further in a machine-readable format. Indeed, the authors are typically in the best position to provide accurate information about the relationships of various entities mentioned in their article. However, in practice, only about 30% of bibliographic references deposited with Crossref contain the DOI of the cited work, and about 62% of funding information contain the funder identifier. For the remaining bibliographic references and funding information, we try to automatically find the identifier of the referenced item and insert it in the metadata. Both publisher-asserted and Crossref-asserted links are then made available through our APIs, along with the information about who asserted it. This process of finding the referenced item based on a set of (typically messy) information about it is called matching. In this presentation, I share my experience with different flavours of metadata matching at Crossref and present our future plans. I also answer frequently asked questions such as: “As time goes by, do we need to do less and less matching?”, “Is a simple title lookup enough to match a citation?” and “Can we be 100% sure that all citation links we see in the data are correct?”.

Files

Tkaczyk-slides-oc-workshop-2022.pdf

Files (421.7 kB)

Name Size Download all
md5:4bb2ee43a80f85e554dd3ae69c341b8d
421.7 kB Preview Download