Published October 31, 2024 | Version v1
Presentation Open

Chasing rabbits: how ARGA greatly expands provenance using Conflict-free Data Types and a Hybrid Logical Clock

Authors/Creators

  • 1. Atlas Of Living Australia

Description

The Australian Reference Genome Atlas (ARGA) took its first steps as a publicly accessible service that allows researchers to easily find genomic data for all Australian relevant species. We now face new challenges in maintaining the ARGA index and keeping it as up to date as possible. One such challenge is in greatly increasing the transparency of how the data was generated, collected, and folded into the index. Another related challenge is in adding transparency and provenance to existing datasets as a function of our data processing pipeline. ARGA has implemented a novel approach to tracking and storing changes to publicly available datasets like the National Species List (NSL), OZCAM, and Plazi treatment bank. By combining the latest works on conflict-free replicated data types (CRDTs), operation logs stored in a PostgreSQL database, and an entity system, ARGA has the ability to show granular changes to every record column along with detailed attribution for each change. Furthermore, it has the potential to add high availability and eventual consistency to any subset of ARGA's index allowing for a richer collaborative experience amongst aggregators and data authors. In this presentation, we demonstrate how leveraging this Highly Available Logically Deterministic Entity System enables us to increase the transparency with our data aggregation pipelines and extend that ability to external datasets that do not provide that level of detail. As a result of this gestalt switch ARGA makes provenance substantially more searchable and actionable, a keystone of all sciences.

Files

Null.txt

Files (81 Bytes)

Name Size Download all
md5:4e1a42f460583a4d97552753d39b4d89
81 Bytes Preview Download