There is a newer version of the record available.

Published September 8, 2023 | Version 0.1
Dataset Open

Catalogue of Life Repackaged and Sorted hash://sha256/e7130fb557d9aee033ac7147f4d5c4c75f12223dd43e53c7cbb141372f9579cd hash://md5/882b8744b3ebd5fae371fa659ee52a2b

Description

Taxonomic name alignment is a necessary, and often time consuming, task when integration biodiversity datasets for re-use. This publication aims to facilitate fast (e.g., > 1k names/s), offline-enabled, and reproducible, name alignment workflows through the Catalogue of Life by repackaging optimized versions of the wealth of names contained in the Catalogue of Life.

Introduction

The Catalogue of Life (Bánki 2023) is “[…] is an assembly of expert-based global species checklists with the aim to build a comprehensive catalogue of all known species of organisms on Earth. […]”

This data publication contains a verifiable copy of the Catalogue of Life as well as a reverse sorted version of the NameUsage.tsv table. The aims of this publication are to:

  1. provide a signed citation (Elliott, Poelen, and Fortes 2023) for a copy of Catalogue of Life
  2. prepare Catalogue of Life to be included in the Nomer Corpus of Taxonomic Resources (J. H. (ed. ). Poelen 2023).
  3. pre-process the Catalogue of Life to facilitate optimized indexing and offline taxonomic name alignments using tools like Nomer (J. Poelen and Salim 2023).

Overall, the publications aims to facilitate taxonomic name alignment using the wealth of information provided by the Catalogue of Life to help enable fast, reproduceable, offline-enabled alignment of namelists with taxonomic resources of known provenance (or origin).

An example of an application facilitated by this publication is the Taxonomic Name Alignment tool as provided through https://github.com/globalbioticinteractions/name-alignment-template. This template repository implements an automated workflow using GitHub Action to align scientific names in csv/tsv files and darwin core archive with common taxonomic name lists like Catalogue of Life, NCBI Taxonomy, Integrated Taxonomic Information System (ITIS), and GBIF Backbone taxonomy.

Methods

To capture and process the Catalogue of Life, the following steps were taken:

  1. track and archive a copy of Catalogue of Life
  2. reverse sort NameUsage.tsv
  3. assign an alias to the processed resources

Steps 1-3 are captured and documented using Preston, a biodiversity data tracker. Preston not only helps to documents the steps, but also includes the digital resources that were used and produced.

Track and Archive

To track and archive a copy of Catalogue of Life, the following command was issued:

preston track https://download.catalogueoflife.org/col/latest_coldp.zip

With this, a copy of https://download.catalogueoflife.org/col/latest_coldp.zip is downloaded and their sha256 checksum (or hash) is calculated. Also, the download process is captured machine-readable rdf/nquads statement.

The content id (or sha256 hash) of the copy included in this publication can be found using:

preston alias\
  --anchor hash://sha256/e7130fb557d9aee033ac7147f4d5c4c75f12223dd43e53c7cbb141372f9579cd\
  --remote https://zenodo.org/deposit/8327611/files\
   https://download.catalogueoflife.org/col/latest_coldp.zip\
  head -n1

and is

<https://download.catalogueoflife.org/col/latest_coldp.zip> <http://purl.org/pav/hasVersion> <hash://sha256/d512d769a3e68b6f3be523b97f9f3c05b10d317607f95cb24ddcda27bded03e1> <urn:uuid:f00d82a2-d965-4b1d-9030-0a2f8833e004> .

meaning that hash://sha256/d512d769a3e68b6f3be523b97f9f3c05b10d317607f95cb24ddcda27bded03e1 is the content id (or sha256 hash) of the content produced by https://download.catalogueoflife.org/col/latest_coldp.zip at the time this publication was compiled.

Reverse Sort NameUsage.tsv

The second step was to pre-process (or transform) part of the tracked Catalogue of Life data package and reverse sort their content. We use the following bash script to do this processing:

(sed -u 1q; LC_ALL=C sort -r) | gzip

The script first prints the header (i.e., sed -u 1q), then reverse sorts the following content (i.e., LC_ALL=C sort -r). Finally, the output is compressed using gzip.

This script is part of this data publication, and can be retrieved via:

preston cat\
 --remote https://zenodo.org/record/8327611/files/\
 hash://sha256/03292e3e40a04c565c83debd26ed521516f158a29c011b57a96d1c720c40b6cd

To help make it easier to reference the script, an alias was created using

preston alias\
 urn:example:reverse-sort.sh\
 hash://sha256/03292e3e40a04c565c83debd26ed521516f158a29c011b57a96d1c720c40b6cd
~~

Following the authoring of the reverse sort script, and documenting their alias (i.e. 'urn:example:reverse-sort.sh`), we applied the script to the acquired copy of the Catalogue of Life using:

preston cat
–remote https://zenodo.org/record/8327611/files/
‘zip:hash://sha256/d512d769a3e68b6f3be523b97f9f3c05b10d317607f95cb24ddcda27bded03e1!/NameUsage.tsv’
| preston bash
–anchor hash://sha256/e7130fb557d9aee033ac7147f4d5c4c75f12223dd43e53c7cbb141372f9579cd
–remote https://zenodo.org/record/8327611/files/
-c urn:example:reverse-sort.sh ~~~

The result of this process was the content identified by sha256 hash

hash://sha256/1008433bfa5fe7fb059a720eddfe995e18b3e0e8f25ac0c990f1477c177d18cc

as documented in line 22 of the associated preston processing log in

preston cat\
  --remote https://zenodo.org/record/8327611/files/\
 'line:hash://sha256/8ba35deafc847f0d5d69d357241a431b7fd9b6f2735189575b2b7168d523caa9!/L22' 

Assign An Alias to Processed Resources

The alias ‘col:NameUsage.tsv.gz’ was defined to help make it easier to point to the result using:

preston alias\
  col:NameUsage.tsv.gz\
  hash://sha256/1008433bfa5fe7fb059a720eddfe995e18b3e0e8f25ac0c990f1477c177d18cc

With this, the following command was executed to list the first three lines of the produced resource:

preston cat\
 --anchor hash://sha256/e7130fb557d9aee033ac7147f4d5c4c75f12223dd43e53c7cbb141372f9579cd\
 --remote https://zenodo.org/record/8327611/files/\
 col:NameUsage.tsv.gz\
 | gunzip\
 | head -n3

Where preston cat ... prints the produced resource, gunzip uncompresses the result, and the first three lines are selected using head -n3.

The result of the operation is shown below:

col:ID col:alternativeID col:nameAlternativeID col:sourceID col:parentID col:basionymID col:status col:scientificName col:authorship col:rank col:notho col:uninomial col:genericName col:infragenericEpithet col:specificEpithet col:infraspecificEpithet col:cultivarEpithet col:namePhrase col:nameReferenceID col:publishedInYear col:publishedInPage col:publishedInPageLink col:code col:nameStatus col:accordingToID col:accordingToPage col:accordingToPageLink col:referenceID col:scrutinizer col:scrutinizerID col:scrutinizerDate col:extinct col:temporalRangeStart col:temporalRangeEnd col:environment col:species col:section col:subgenus col:genus col:subtribe col:tribe col:subfamily col:family col:superfamily col:suborder col:order col:subclass col:class col:subphylum col:phylum col:kingdom col:sequenceIndex col:branchLength col:link col:nameRemarks col:remarks
ffc77d7d-2ede-49ff-ab12-03410a1c25db     55434 93MTR 4WGYN provisionally accepted [Semiothisa] lapidata Warren, 1906 species                         zoological acceptable               false                                                
ff82a38f-348b-4fd7-891e-a2a68d8edce4     55434 93MTR 78YFQ provisionally accepted [Sabulodes] arnissa Druce, 1891 species                         zoological acceptable               false                                                

Results

As described in our methods, this publication derived the resource with alias col:NameUsage.tsv.gz and content id hash://sha256/1008433bfa5fe7fb059a720eddfe995e18b3e0e8f25ac0c990f1477c177d18cc . This resource contains a reverse-sorted copy of the NameUsage.tsv provided in the Catalogue of Life data package retrieved from https://download.catalogueoflife.org/col/latest_coldp.zip with content identifier hash://sha256/d512d769a3e68b6f3be523b97f9f3c05b10d317607f95cb24ddcda27bded03e1.

The following tools were used to process the Catalogue of Life resource:

Tools used in this data publication
tool name
preston
bash
gzip
sed
head
sort

Discussion

This publication is intended to facilitate re-use of the Catalogue of Life data package in taxonomic name alignment workflows. While the primary goal was to generate a resource for use in Nomer v0.4.5 (J. Poelen and Salim 2023), other usage can be imagined such as:

  1. Lots of Copies Keeps Stuff Safe (LOCKSS (Maniatis et al. 2005)): keep an identical copy of Catalogue of Life data package outside of the Catalogue of Life infrastructure.
  2. demonstrating how data transformation processes can be documented using Preston
  3. making a streamable copy of a reverse-sorted copy of Catalogue of Life available via https://zenodo.org/record/8327611/files/1008433bfa5fe7fb059a720eddfe995e18b3e0e8f25ac0c990f1477c177d18cc for use in workflows like looking up the first record that contain Enhydra lutris (Sea otter):
 curl -L 'https://zenodo.org/record/8327611/files/1008433bfa5fe7fb059a720eddfe995e18b3e0e8f25ac0c990f1477c177d18cc'\
   | gunzip\
   | grep "Enhydra lutris"\
   | head -n1

Acknowledgements

This work stands on the shoulders of contributors to open source software and openly accessible datasets. Thank you!

References

Bánki, O et al. 2023. “Catalogue of Life Checklist (Version 2023-08-17).” Catalogue of Life. Catalogue of Life. https://doi.org/10.48580/dft7.

Elliott, Michael J., Jorrit H. Poelen, and José A. B. Fortes. 2023. “Signing Data Citations Enables Data Verification and Citation Persistence.” Scientific Data 10 (1). https://doi.org/10.1038/s41597-023-02230-y.

Maniatis, Petros, Mema Roussopoulos, Thomas J Giuli, David SH Rosenthal, and Mary Baker. 2005. “The LOCKSS Peer-to-Peer Digital Preservation System.” ACM Transactions on Computer Systems (TOCS) 23 (1): 2–50.

Poelen, Jorrit H. (ed.). 2023. “Nomer Corpus of Taxonomic Resources hash://sha256 /12051b8aa59930d6561a3ed46b7cf3f67a31a98445a457d78 894f6b8a8e81641 hash://md5/1ff6b3628d7afc15b882cc0c9b1c3815.” Zenodo. https://doi.org/10.5281/zenodo.8326175.

Poelen, Jorrit, and José Augusto Salim. 2023. “Globalbioticinteractions/Nomer: 0.5.4.” Zenodo. https://doi.org/10.5281/zenodo.8329422.

Files

Files (645.2 MB)

Name Size Download all
md5:52742e050f26fbd976c89b41d03bd7e6
37 Bytes Download
md5:7a767c6c69ba1aad610e1464e4699791
78 Bytes Download
md5:3ac4f7368439677472f4c5b78f39a7f8
249.7 MB Download
md5:d548176bdab6e55bff8c5870a2d54d73
78 Bytes Download
md5:a71288331c96a9deb52784cd1df7e1b3
78 Bytes Download
md5:c855768414a9cb25608da73ef05aa2dc
78 Bytes Download
md5:6bf0c01a7399eadc1e884b40816cafef
78 Bytes Download
md5:393a6beada72f4ae814765aaeaf0bd39
4.1 kB Download
md5:d279aa8e68fa34ede27de535d46a04b1
4.0 kB Download
md5:b2613e0f67093c892691f4df0914520a
3.8 kB Download
md5:a4421eb8f92ec1bbf4ed9f25359be674
78 Bytes Download
md5:facc490d8483f71b325abb8bf1cb1660
395.4 MB Download
md5:882b8744b3ebd5fae371fa659ee52a2b
2.7 kB Download
md5:699290389672021de333d5f300f10554
2.7 kB Download
md5:883786539eac11b363423cd58e092162
2.8 kB Download

Additional details

Related works

Is derived from
Dataset: 10.48580/dft7 (DOI)