Published February 25, 2024 | Version v1
Report Open

Human Genomes Platform Project: Virtual Cohort Assembly Pilot Phase Report

Description

The Human Genomes Platform Project (HGPP) is a collaborative research project aiming to enhance secure and responsible human genomic data sharing for research purposes. National and international connectivity is important to maximise the utility of these sensitive and valuable assets. The project partners represent many of the largest human genome sequencing and analysis organisations in Australia. 

The goal of the Virtual Cohorts sub-project within the HGPP is to implement systems to identify cohorts of individuals and related genomic data assets across repositories located at each of the partner institutes. In the preceding Virtual Cohort Discovery Phase report we examined the current landscape of systems for data access requests and data sharing, and documented a set of problems, user stories and requirements for further exploration. For full context, this document should be read in conjunction with the Discovery Phase Report. In that report we described:

  • National community needs and stakeholder requirements.

  • The current state of processes and tools for virtual cohort querying.

  • Candidate solutions to enable cross-institutional virtual cohort querying.

  • Recommendations on preferred technology and proposed implementation architecture.

In this Pilot Phase report we describe:

  • Pilot implementations of the recommended technologies from the discovery phase (GA4GH Beacon v2) at the partner organisations.

  • Assessment of pilot performance against requirements and relative strengths against each other.

  • A novel user interface for Beacon Network queries developed by one of the partner organisations (CCIA).

  • Work on integration of the Virtual Cohorts and Federated Identity and Access Management (IAM) sub-projects to provide controlled access to data for authenticated users.

  • Technical challenges encountered and outstanding unmet needs.

  • Recommendations regarding best practices for Beacon ETL processes, data annotation, and software deployment.

The Pilot Phase of the HGPP represents a significant step forward in advancing secure and responsible human genomic data sharing. Partnering with major genome sequencing organisations in Australia, the Virtual Cohorts sub-project successfully implemented systems to identify cohorts of individuals and related genomic data assets across multiple repositories.

One of the key accomplishments in this work is the successful deployment of Beacon instances by three partners—CCIA, UMCCR, and QIMR—each populated with a shard of the CINECA synthetic dataset. The report evaluates different Beacon implementations, emphasising the strengths of the reference implementation, serverless Beacon (sBeacon) by CSIRO, and the Java-based jBeacon. The assessment informs a recommendation favouring the reference implementation for pilot phase deployment, acknowledging its alignment with GA4GH standards.

A pivotal aspect of the pilot phase is the development and deployment of the Beacon Network, enhancing the effectiveness of individual Beacons by enabling network-wide queries. This report assesses the Beacon Network implementation, highlighting CCIA's and Australian Biocommons' contributions. The associated Beacon Network UI, designed for user-friendly query interactions, was developed to make querying the network more intuitive for researchers. The report underscores the interface's success in abstracting complexities but notes limitations in query parameters and ontology choices.

In a technical evaluation we address critical aspects of the pilot such as the choice of data model, ontologies, reference genome, and ETL processes. While the default data model and prescribed ontologies receive positive assessments, challenges arise with aligning ontologies where they are not specified. The requirements evaluation highlights the alignment of Beacon v2 with user stories, but acknowledges limitations in evaluating specific cases due to the CINECA test dataset's narrow scope and missing features in Beacon and Beacon Network implementations. The report underscores the ongoing efforts toward integration with other HGPP sub-projects, notably the Federated IAM and REMS, with successful CILogon integration providing authenticated access to the Beacon Network.

Files

Virtual Cohorts_ Pilot Implementation Report.pdf

Files (521.7 kB)

Name Size Download all
md5:83b9e8762f49c66ac9a1edb50f0e1c5a
521.7 kB Preview Download

Additional details

Related works

References
Report: 10.5281/zenodo.7439885 (DOI)

Dates

Available
2024-02-28
Project Report