10 Simple rules for design, provision, and reuse of identifiers for web-based life science data

Julie McMurry; Blomberg, Niklas; Burdett, Tony; Conte, Nathalie; Dumontier, Michel; Fellows, Donal K; Gonzalez-Beltran, Alejandra; Gormanns, Philipp; Hastings, Janna; Haendel, Melissa A; Hermjakob, Henning; Hériché, Jean-Karim; Ison, Jon C; Jimenez, Rafael C; Jupp, Simon; Juty, Nick; Laibe, Camille; Le Novère, Nicolas; Malone, James; Martin, Maria J; McEntyre, Johanna R; Morris, Chris; Muilu, Juha; Müller, Wolfgang; Mungall, Christopher J; Rocca-Serra, Philippe; Sansone, Susanna-Assunta; Sariyar, Murat; Snoep, Jacky L; Stanford, Natalie J; Swainston, Neil; Washington, Nicole; Williams, Alan R; Wolstencroft, Katherine; Goble, Carole; Parkinson, Helen

doi:10.5281/zenodo.31765

Published October 2, 2015 | Version v2

Preprint Open

10 Simple rules for design, provision, and reuse of identifiers for web-based life science data

1. European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
2. ELIXIR Hub, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
3. Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
4. School of Computer Science, The University of Manchester, Manchester, United Kingdom
5. Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom
6. Institute of Experimental Genetics, Helmholtz Centre Munich -German Research Center for Environmental Health (GmbH), Neuherberg, Germany
7. Department of Medical Informatics and Epidemiology and OHSU Library, Oregon Health & Science University, Portland, USA.
8. European Molecular Biology Laboratory, Heidelberg, Germany
9. Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark
10. European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom | Babraham Institute, Cambridge, United Kingdom
11. STFC, Daresbury Laboratory, Warrington, United Kingdom
12. Genomics Coordination Center, Department of Genetics, University Medical Center Groningen and Groningen Bioinformatics Center, University of Groningen, Groningen, Netherlands
13. SDBV, HITS, Heidelberg, Germany
14. Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
15. Institute of Pathology, Charite – University Medicine Berlin, Berlin, Germany | TMF – Technologie- und Methodenplattform e. V. Berlin, Germany
16. MIB, University of Manchester, Manchester, UK | Department of Biochemistry, Stellenbosch University, Stellenbosch, South Africa
17. Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), University of Manchester, Manchester, UK.
18. Leiden Institute of Advanced Computer Science, Leiden University, Leiden, Netherlands

Life science data is evolving to be ever larger, more distributed, and more natively web-based. However, our collective handling of identifiers has lagged behind these advances. Diverse identifier issues (for instance “link rot” and “content drift”) have hampered our ability to integrate data and derive new knowledge from it. Optimizing web-based identifiers is harder than it appears and no single scheme is perfect: Identifiers are reused in different ways for different reasons, by different consumers. Moreover, digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Nevertheless, there is substantial room for improvement throughout the life sciences and several other groups have been converging on identifier standards that are broadly applicable.

Building on these efforts and drawing on our experience, we focus on the use case of large-scale data integration: we outline the identifier qualities and best practices that we feel are most important in this context. Specifically, we propose actions that providers of online databases (repositories, registries, and knowledgebases) should take when designing new identifiers or maintaining existing ones (Rules 1-9). In Rule 10, we conclude with guidance to data integrators and redistributors on how best to reference identifiers from these diverse sources. This article may also be useful to data generators and end users as it offers insight into the issues associated with data provision in a web environment. We call upon data providers to take a long-term view of their entities’ scope and lifecycle, and to consider existing identifier platforms and services.

Rule 1. Use established identifiers

Rule 2. Design identifiers for use by others

Rule 3. Help local identifiers travel well: document Prefix and Namespace

Rule 4. Opt for simple durable web resolution

Rule 5. Avoid embedding meaning

Rule 6. Make URIs clear and findable

Rule 7. Implement a version management policy

Rule 8. Do not re-assign or delete identifiers

Rule 9. Document the identifiers you issue and use

Rule 10. Reference responsibly

Notes

This manuscript is a revision of doi:10.5281/zenodo.18003 and was recently resubmitted to PLoS Computational Biology

Files

Files (192.3 kB)

Name	Size	Download all
10RulesIdentifiers_MS_2015-09-24_Final_Clean.docx md5:5313278a2210529fdaccbf0799faea84	130.5 kB	Download
10RulesIdentifiers_S1-S6_2015-09-24_Final.docx md5:b10cc7ba34f117afbd7cee96cd391d30	39.7 kB	Download
10RulesIdentifiersResubmission_Authors_2015-09-23.docx md5:cd82a44d92ed308739a86098022ea51a	22.1 kB	Download

Additional details

Is new version of: 10.5281/zenodo.18003 (DOI)

European Commission
BIOMEDBRIDGES - Building data bridges between biological and medical infrastructures in Europe 284209
European Commission
ELIXIR - European Life-science Infrastructure for Biological Information 211601
European Commission
DIACHRON - DIACHRON – Managing the Evolution and Preservation of the Data Web 601043
European Commission
ISBE - Infrastructure for Systems Biology - Europe 312455

	All versions	This version
Views	3,923	1,848
Downloads	1,272	511
Data volume	917.6 MB	39.8 MB

Files (192.3 kB)

Related works

Funding

10 Simple rules for design, provision, and reuse of identifiers for web-based life science data

Authors/Creators

Description

Notes

Files

Files (192.3 kB)

Additional details

Related works

Funding