Presentation Open Access
Presentation given at PIDapalooza, Dublin, 23rd of January 2019
This content is also available on IPFS, packaged using the DataCrate convention: ipfs://QmdhEQrRvvo6mY4iyNz544xChM6nmrJbefvnYNEtVTQzz5 (note, for convenience, this currently links to the ipfs.io http gateway service)
At the risk of causing some respected members of the Pidapalooza community to stab themselves in the eyeball (see https://doi.org/10.6084/m9.figshare.5914312.v1), this session will explore content-addressing of research outputs through the Interplanetary FileSystem (IPFS) and how this approach relates to existing persistent identifier infrastructures.
Research data is driving some new PID requirements. Linking versioned DOIs has been explored in various working groups and implemented on various popular platforms (Figshare, Zenodo, F1000). There is also renewed interest in the idea of data packaging (https://rd-alliance.org/approaches-research-data-packaging-rda-11th-plenary-bof-meeting) and research objects (http://www.researchobject.org/ro2018/) in RDA and related groups as a practical means of bundling data with its metadata in a way that can be easily cited and transmitted as a single payload. Finally, several groups are exploring how best to directly reference content in PID metadata using cryptographic hashing. For example, the Freya project is looking at how best to allow “direct access to content associated with a DOI” (see https://github.com/datacite/freya/issues/2) and RDA is tackling similar issues in the PID Kernel Information Working Group (https://www.rd-alliance.org/groups/pid-kernel-information-wg). This all suggests an appetite for greater consensus around linking PIDs to versioned (and therefore immutable), self-describing, directly accessible content.
The tl;dr of IPFS is that all content on the web/network can be referenced not by where it is located (a particular server or server farm, referenced by a DNS/domain lookup), but by cryptographic identifiers derived from the content itself, allowing the protocol to retrieve the desired information from any node on the network, and removing some of the issues with content moving and drifting on the web. Like bittorrent, git and many others before it, cryptographic hashing plays a key role here, but IPFS aims to make this a ubiquitous, general purpose network protocol, comparable to HTTP URLs. Hashing digital content as a means of ensuring fixity/integrity of content will be familiar to Pidapalooza participants. Moreover, people working in the digital repository and cloud storage space, may well work with content-addressed storage of one form or another. However, the prospect of (relatively) widespread use of peer-to-peer content-addressing at web/network-level, as exemplified in particular by the Interplanetary File System (IPFS; ipfs.io), raises some interesting possibilities for how we manage and cite research data.
On the face of it, IPFS bakes *some* of the core PID use-cases, in particular handling location change and/or multi-resolution, right into the fabric of the network. However, by itself, IPFS is not a panacea. For example, although IPFS has built in ways to update hashing algorithms to deal with future hash collisions, this isn’t much good if you’ve cited something using a now broken hash. IPFS also has emerging specifications for metadata registry and for a mutable namespace to cater for updated content but again, from a PID perspective, these re-introduce some of the fragility of the HTTP web and also some of the familiar requirements for transparent, multi-stakeholder, and relatively centralised governance models that make PID schemes so trusted today.
This session will introduce the peer-to-peer content-addressed approach, exploring its benefits and weaknesses in terms of persistence and discuss how IPFS and related approaches can both leverage PID infrastructure and in turn be leveraged by PID infrastructures to address particular content distribution and referencing requirements.