INDRA assembly Benchmark Corpus

John A. Bachman; Benjamin M. Gyori; Peter K. Sorger

doi:10.5281/zenodo.7559353

Published January 22, 2023 | Version v3

Dataset Open

INDRA assembly Benchmark Corpus

1. Harvard Medical School

This data set accompanies the manuscript "Automated assembly of molecular mechanisms at scale from text mining and curated databases" which describes assembly methodology implemented in the INDRA system (https://github.com/sorgerlab/indra). The manuscript uses an example assembly pipeline on ~570k publications as input to create the INDRA Benchmark Corpus.

This dataset provides INDRA Statements constituting the INDRA Benchmark Corpus as well as a set of curations on the corpus:

- indra_benchmark_corpus.pkl: A Python pickle file of INDRA Statement objects. It requires INDRA to be installed to load in a Python environment.

- indra_benchmark_corpus.json.gz: A gzipped JSON export of INDRA Statements.

- indra_assembly_curations.json: The Curated Corpus of curated mentions for Statements in Benchmark Corpus as a JSON file. The JSON file contains a list with each element corresponding to a curation. Each curation entry contains a `pa_hash` and a `source_hash` attribute. These can be used to find the Statement and mention (respectively) to which the curation applies in the Benchmark Corpus.

Files

indra_assembly_curations.json

Files (2.0 GB)

Name	Size	Download all
indra_assembly_curations.json md5:4a5b39066458e2112607bcb19e7a4d29	1.7 MB	Preview Download
indra_benchmark_corpus.json.gz md5:82c01e9e100b296ba5b8fbbb2e39f5dc	460.0 MB	Download
indra_benchmark_corpus.pkl md5:30725fad35c915f042d5103c65b82f0f	1.6 GB	Download

	All versions	This version
Views	686	305
Downloads	394	221
Data volume	323.9 GB	192.1 GB

INDRA assembly Benchmark Corpus

Authors/Creators

Description

Files

indra_assembly_curations.json

Files (2.0 GB)