Published April 29, 2021 | Version v1
Presentation Open

Data Citation: Prioritizing Adoption

  • 1. California Digital Library

Description

Presentation at the National Academies of Science and National Institutes of Health meeting on "Changing the Culture of Data Management and Sharing" that took place on April 28, 2021

Speaker notes:

Intro
MDC is funded by the Alfred P Sloan Foundation and is run in collaboration with DataCite, Crossref, California Digital Library, DataONE, and bibliometrics institutions

Slide 2
I was asked to focus on the current state of data citation and I want to zoom out for a second to note that while we are all gathered here to discuss research data because we see the value in open management and sharing, and we would like to credit and assess these efforts, we are not yet at a point where we have dependable and consistent evaluation systems 

Slide 3
Researchers publish their articles, data, software, etc. due to mandates and as increasing disciplines embrace sharing. As research supporting stakeholders we want a way to access all of this information for compliance reasons and also to understand the ROI of open data, the impact and reach of the published work, and to eventually have a credit and incentives system in place -- or a space to reward folks for publishing their data. One big piece of this are data citations. Data citations may not be the only indicator that are used - but as bibliometrics studies ramp up to understand what proper metrics are for research data it’s important to get moving on supporting data citation infrastructure.

Slide 4
I only have eight minutes to go through a topic that deserves a full workshop. So instead of diving into specifics I want to try and frame the rationale for a way forward in a complicated space.

Slide 5
This is data that DataCite pulled last week and it doesn’t properly represent all data citations that are out there, but it represents what we can find right now. The top number shows the Data citations that are declared by repositories to DataCite through relation types in metadata and  he third row are data citations declared by publishers to Crossref through reference lists. Clearly, there are millions of data citations that exist but the number of these that are declared and findable are awfully low. These numbers are also lower than they should be because they are not accounting for biomed non-DOI citations as well as citations that will not be found without text mining. I show this slide not only to show the lack of participation but to acknowledge how important it is for the data citations to be accessible. For us to have open and scalable networks of citations, and to understand how many citations exist so we can build on them for evaluation, we need to require a consistency of approach  immediately, for instance supporting our open infrastructure dependencies, Crossref and DataCite.

Slide 7
There seems to be an attitude of waiting until things are perfect for all use cases before jumping in. This has also combined with many inconsistent approaches in guidance and practice for how to cite data and how citations are extracted for public consumption. Many reinvent the wheel approaches have distracted. And because of this we are seeing, like with the numbers on the prior slide, that not enough publishers are supporting data citations in consistent and findable ways.

Slide 8
And why is that? Well data are difficult and there are all different approaches out there to get started. But there is also a tension over why publishers should prioritize this resourcing and retooling over other features and initiatives. And we would be in the wrong to not mention that even if all publishers did so, we will still need resources for text mining and understanding the citations across the larger landscape like government documents and larger cohort datasets. Data citations do not fully paint the picture of data reuse and right now we cant even fully get the picture of data citations. 

Slide 9
But these barriers do not need to be roadblocks. The complexity is not as limiting as it seems. We can take an iterative approach and we can start by emphasizing that data citation be the goal instead of data statements that aren’t machine readable or usable. 

Slide 10
We can move in the right direction and take action, and we propose: let's prioritize adoption over perfection.

Slide 11
We should all begin with what we can for example, journals can cite data in a reference list. Journal publishing continues to evolve and these changes can be daunting but it’s important to do and the notion of citations are not knew, journals are well positioned to accelerate this evolution. Or funders can require persistent identifiers for data that are cited in grant reports and subsequent publications. There are complexities like various types of identifiers and relationships, provenance, large consortia data, dynamic data and so on. But as the space evolves, so will our use cases, so will our technology, and so should we -  so we should not wait to get started. 

Slide 12
There’s a theme in this space where discussions spiral into spawning new working groups repetitive of past work and this sort of approach inhibits doing and making change. We need to build on existing open infrastructure. We also should acknowledge that we all have shared goals: evaluating research data reuse. But we cannot jump the gun and refer to citations as credit or build faulty metrics, without contextualization and a broad understanding of how data are reused across the disciplines. 

Slide 13
This is a call to action for researchers and research supporting infrastructures to participate now. If we want to get to a state where we can reward researchers for doing what we have all been advocating and building for, for the last decades, we need to all prioritize and act now. 

Publishers should include citations to datasets in reference lists. Repositories can continue to declare relationships to articles. Funders and other institutional bodies can mandate data to be cited. We can support and reply on open infrastructure for reporting and aggregating. And researchers, promote and cite your and your colleagues data. 

Slide 14
Our infrastructure will continue to improve, bibliometricians will help to identify proper evaluation indicators for data understanding the contextualization of reuse, but we can’t advance the conversation or have these metrics without the broad community jumping in and taking action now. There’s a clear enough path ahead so let’s focus on adoption. 

Slide 15
It’s an exciting time, we all can play a part. Let’s get moving. I’ve called attention to over arching fact that it is complex but not as complex as it needs to be and there’s a way forward. For more resources on the specifics of how to get involved, what best practices are, check out these resources or get in touch.

Files

NASEM_MDC_Lowenberg (1).pdf

Files (219.3 kB)

Name Size Download all
md5:26abdc75961a92687805f5b2276572e9
219.3 kB Preview Download

Additional details

Related works

Is supplemented by
10.5281/zenodo.4701079 (DOI)