The Archive and Package (arcp) URI Scheme

The arcp URI scheme is introduced for location-independent identifiers to consume or reference hypermedia and linked data resources bundled inside a file archive, as well as to resolve archived resources within programmatic frameworks for Research Objects. The Research Object for this article is available at http://s11.no/2018/arcp.html#ro


I. BACKGROUND
Archive formats like BagIt [1] have been recognized as important for preservation and transferring of datasets and other digital resources [2]. More specific examples include COMBINE archives [3] for systems biology, CDF [4] for astronomy data, as well as the more general HDF5 [5] which is also used for meteorological data. For the purpose of this article an archive is a collection of data files with related metadata, typically packaged in a compressed file format like .zip or .tar.gz.
One challenge with regards to embedding Linked Data in such archives is how to reliably generate and resolve internal URLs, for instance <dataset13.zip> may contain an RDF Turtle file <metadata/description.ttl> to describe the CSV file <data/survey.csv> -but in order to correctly reference that file it will either have to use a relative path <../data/survey.csv> or some pre-existing Web URL like <http://example.com/dataset13/survey.csv>.
The Research Object Bundle [6] format suggested re-using the app URI scheme for minting absolute URIs from relative paths of resources within a ZIP file. The app URL scheme [7] was originally intended for packaged web applications, where each application would get their own namespace like <app://c6179148-3cde-4435-8e66-304453f89d59/> with paths resolved from the corresponding application package ZIP file. However the app URL scheme did not progress further on the W3C Recommendation track, and this approach was abandoned in favour of the combination of Web App Manifest [8] and Service Workers [9]. Together these technologies reuse the http/https origin URL of a downloaded application manifest together with relative links, while also allowing a web application to work offline.
II. THE ARCHIVE AND PACKAGE (ARCP) URI SCHEME Inspired by the app URL scheme we defined the Archive and Package (arcp) URI scheme [10], an IETF Internet-Draft which specifies how to mint URIs to reference resources within any archive or package, independent of archive format or location.
The primary use case for arcp is for consuming applications, which may receive an archive through various ways, like file upload from a web browser or by reference to a dataset in a repository like Zenodo or FigShare. In order to parse Linked Data resources (say to expose them for SPARQL queries), they will need to generate a base URL for the root of the archive.
It should be clear that using local file URIs [10] for extracted archives like <file:///tmp/tmp.cUK6ERfdBe/> do not serve well for this purpose, as they are not universally unique, are difficult to create consistently, and may introduce security risks of attacks like <../../etc/passwd>. Similarly it may be inappropriate to mint new web based URIs like <http://repo.example.com/cUK6ERfdBe/> as web presence should not be a requirement to process a linked data archive, in particular as processing may occur on a laptop or a cloud node with no public IP address.

A. Identifier structure
By definition an arcp identifier is an URI [12] with three parts, as shown in figure 1.
<arcp://prefix,namespace/path> The arcp Internet-Draft specifies three initial prefix values: uuid, ni and name, each which defines how to identify a particular archive by a corresponding namespace. These namespaces are not intended to be directly resolvable without prior knowledge of the corresponding archive.
The path is the folder and file path within the archive, represented as an URI path [12] e.g. /file.txt or /my%20project/about/intro.doc -using percentescaping if needed. The root folder / represent the archive itself.

B. UUID-based identifiers
The simplest case for temporary sandbox processing of an archive with arcp is to generate a new random UUIDv4 [13], e.g.: From this the corresponding arcp URI is: This base URI can be used when resolving relative URI references, e.g. if <metadata/description.ttl> references <../data/survey.csv> we find the absolute URIs: <arcp://uuid,c6179148-3cde-4435-8e66-304453f89d59 /metadata/description.ttl> <arcp://uuid,c6179148-3cde-4435-8e66-304453f89d59 /data/survey.csv> The application is then able to do translation from arcp to local paths using URI parsing libraries to select the URI path, and augment that to the locally extracted path. Such arcp identifiers are temporary in nature, but the application can maintain a mapping from the UUID to the archive and perform extraction on demand, or the archive can self-declare its UUID, such as the External-Identifier header in BagIt [1].
arcp also suggests how a UUID can be reliably created from the URL location of an archive. For instance, an application may be processing a file from: The application can calculate the name-based UUIDv5 [13] by SHA1 hashing the URL string and mint: <arcp://d9f0b57d-0504-5e9a-abae-f5f2b8c49b94/> With this method anyone processing that archive URL will always get the same arcp base URI, however the application will still need to maintain a mapping to find the original archive URL. Location-based arcp identifiers may also not be ideal for preservation purposes, as the archive might change upstream or move to a different location.

C. Hash-based identifiers
For this arcp defines a hash-based method, where the bytes of the archive file is used to find a checksum-based identifier based on the Naming Things With Hashes (ni) URI scheme [14]. For instance if the sha-256 checksum of a Zip file is in hexadecimal:

7f83b1657ff1fc53b92dc18148a1d65d fc2d4b1fa3d677284addd200126d9069
After base64 encoding the ni: uri would be: The corresponding arcp base URIs for resources within the archive is thus: With this method, anyone processing the byte-wise equal archive (using the same hash method) will get the same identifier.
Another advantage is that hash-identified archives can be retrieved from a NI resolver [14] using well known paths [15]: <http://repo.example.com/.well-known/ni/sha-256 /f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk> Clients can verify the checksum of the downloaded archive, so any accepting resolver endpoint can be used.

D. Name-based identifiers
Finally, paying homage to its origin in app URLs, arcp can use a system-based app name. This is a suggested mechanism for resolving resources of an application package installed in a runtime system like Android applicationId or Java package name, where an application identifier can be directly reused in arcp for URIs within that runtime system, e.g. to reference the resource styles/resource1.css within the installed package com.example.myapp one can use the URI: <arcp://name,com.example.myapp/styles/resource1.css> As application package content do not necessarily correspond to archive file listings, it is open-ended how name-based arcp identifiers can be resolved, and indeed package content may vary per operating system, device type or application version, and so name-based arcp identifiers should be treated as system-local identifiers similar to file:/// URIs [11], but within a particular programming framework.

A. Archive fragments
Without using arcp one could in theory still reference files within archives at an URL with # fragments: <http://example.com/download /archive13.zip#data/survey.csv> Unlike formats like text/html or application/pdf , most archive media formats like application/zip unfortunately do not define a fragment syntax, and some major types like tar.gz are not even listed in the IANA media types registry. Therefore this would be an ad-hoc approach which still needs to clarify details in order to be interoperable, for instance character escaping, if the root is # or #/, and how to reference nested fragment identifiers in hypermedia within archived resources.

B. File URIs
As argued above, file URLs [11] that represent local directories are fragile and not globally unique. It is perhaps less known that file URLs can specify a host name: <file://host.example.com /home/alice/extracted/archive13/> The above references a file path on the machine with the fully qualified domain name (FQDN) host.example.com. The usually empty hostname is equivalent to localhost.
This approach may be used if both the hostname and extracted path are stable (e.g. a repository file server), but this faces the same challenges as minting http/https URLs, which in many cases would be preferable as they are also globally resolvable.
An ad-hoc possibility here could be to use a UUID [13] as "hostname" to represent an archive's internal file system: file://8f26cb8c-617e-46b4-bc48-e650bf70f33d /data/survey.csv/> This is technically permittable as the file: URL scheme [11] do not define any particular connection protocols, and an UUID is unlikely to be a valid hostname in DNS. Such file: URIs could however cause confusion against file paths on localhost, for instance Firefox 62.0 opens file://8cd4ce0d-4a41-4b4e-bfdd-1e2d0495f714/ to browse the local file system.

C. JAR URLs
If we restrict usage to ZIP files at a known URL, then they are in theory also valid JAR files, and we can address files with the jar URL scheme: <jar:http://example.com /download/archive13.zip!/data/survey.csv> Here relative URIs may not parse well, as it is easy to accidentally climb out of !/, and technically the JAR URI scheme is missing the familiar :// to indicate for URI parser libraries that it is indeed an hierarchical URI scheme [12].

D. Object Reuse and Exchange proxies
OAI-ORE [16] defines proxies to represent a resource as aggregated in a collection; these can be used to model archives [17], but ORE proxies face two problems: How to represent the file path, and how to identify the proxy so it can be used as a reference in Linked Data. The resource must be identified using two triples of ore:proxyFor (the archived file) and ore:proxyIn (the archive); but this reduces to the same problem of identifying the file. The ni URI [14] for the file bytes can in theory be used to identify the file, but the other missing information is the file path and name, which usually convey meaning for users.
The Research Object ontology's FolderEntry specializes the ore:Proxy to add a property ro:entryName to indicate the filename, as exemplified in figure 2, but to find the full archive file path one would have to traverse the parent folder's ro:entryName. In either case there is no defined method to predictably generate unique identifiers for the ORE proxies themselves, although the RO Bundle specification recommend they should be randomly generated urn:uuid URIs, which would not be compatible with relative URIs within an archive.

E. Publishing file systems as Linked Data
F2R [18], using the Nepomuk File Ontology [19], defines a way to publish file systems as Linked Data, where a server endpoint exposes the files and their file system metadata.
The F2R approach have similar disadvantages as JAR and OAI-ORE; in that the URIs do not support relative path resolution, that a web endpoint must be set up, and that the file paths are hidden through multiple steps. In addition one would need to assigned a corresponding file system name like mysource, although one may use a single file system as exemplified above and use belongsToContainer to treat archive files as if they are folders.

F. EPUB canonical fragment identifiers
EPUB is a standard for hypermedia eBooks. RO Bundle [6] is based on the EPUB Open Container Format [21]. EPUB Canonical Fragment Identifiers [22] can link to nested XML elements of an publication using a variation of XPath with doubled indexes: The above example show an example to a paragraph with an ePub book. Here /6 refer to the 3rd element of the root manifest's <package> element (which in ePub is always <spine>), then /4[chap01ref] is the second element <itemref> with xml:id="chap01ref".
The ! character means the element's reference is followed to open the corresponding XML file, where /4[body01] is the 2nd element with id body01, traversed to find the 5th element with id para05.
While this is quite a powerful construct that can refer to any XML element of nested documents, even sentences or words, it seems rather contrived and inflexible. The major limitation is that ePub archive resources are not identified by file paths, but must be addressable through rather rigid XML structures (order can't change), thus this approach is not appropriate for archives without an XML manifest. Even if using a RDF/XML manifest it would be inadvisable to assume a fixed order of it's XML elements. It seems however an appropriate reference scheme for ePub documents, which generallyhave a fixed reading order.

IV. ARCP IMPLEMENTATIONS
The arcp Python library [23] was developed to help creating, parsing and validating arcp URIs. In particular it can generate arcp based on random UUIDs, URL locations, names and hashing archive bytes. The arcp parser recognize the arcp prefix and can extract UUIDs or hashes, and can generate the corresponding .well_known/ni URI for retrieving the archive. This library is meant to complement the Python 3 urlparse library, and so it is deemed out of scope for this library to do resolution of arcp based on archive or network access.
adding support for arcp URIs in its opening and creation of RO bundles, initially using the arcp UUID format as a replacement for app URIs, with planned support also for hash-based identifiers and opening RO Bundles from a .well-known/ni endpoint.
The CWLProv [24] approach for capturing provenance of executing Common Workflow Language is using arcp in its BagIt metadata bag-info.txt using External-Identifier to identify its research object: External-Identifier: arcp://uuid,d47d3d43-4830-44f0-aa32-4cda74849c63/ For CWLProv the use of arcp is crucial, as it assigns global identifiers for use across resources in the RO bag, including the RO manifest itself and in W3C PROV file formats like PROV-N and N-Triples, as neither format support relative URIs.
In this approach the UUID component of the RO arcp identifier d47d3d43-4830-44f0-aa32-4cda74849c63 also appears in the workflow provenance as the identifier of the top-level workflow run (a PROV Activity): This is showcasing how an RO that is the primary representation of a non-information resource (e.g. a process) can be identified using a directly derived arcp URI. While this could in theory also been achieved with an arcp UUIDv5 derived from hashing the URI "location" of the activity, that would be a confusing hack, as urn:uuid: references by design are not resolvable, and hence technically not URLs. UUIDv5 hashing could however be appropriate for non-information resource if they have a resolvable http/https permalink.

V. CONCLUSION
This article propose the arcp identifier scheme for resources within archives using formats like ZIP, tar and BagIt, and suggest arcp is useful for identifying standalone Research Objects and for processing Linked Data embedded in archives. The Internet-Draft draft-soilandreyes-arcp [10] is under consideration by IETF's Applications and Real-Time Area to progress towards Informational RFC status.