Published January 22, 2023 | Version v3
Dataset Open

INDRA assembly Benchmark Corpus

  • 1. Harvard Medical School

Description

This data set accompanies the manuscript "Automated assembly of molecular mechanisms at scale from text mining and curated databases" which describes assembly methodology implemented in the INDRA system (https://github.com/sorgerlab/indra). The manuscript uses an example assembly pipeline on ~570k publications as input to create the INDRA Benchmark Corpus.

This dataset provides INDRA Statements constituting the INDRA Benchmark Corpus as well as a set of curations on the corpus:

- indra_benchmark_corpus.pkl: A Python pickle file of INDRA Statement objects. It requires INDRA to be installed to load in a Python environment.

- indra_benchmark_corpus.json.gz: A gzipped JSON export of INDRA Statements.

- indra_assembly_curations.json: The Curated Corpus of curated mentions for Statements in Benchmark Corpus as a JSON file. The JSON file contains a list with each element corresponding to a curation. Each curation entry contains a `pa_hash` and a `source_hash` attribute. These can be used to find the Statement and mention (respectively) to which the curation applies in the Benchmark Corpus.

Files

indra_assembly_curations.json

Files (2.0 GB)

Name Size Download all
md5:4a5b39066458e2112607bcb19e7a4d29
1.7 MB Preview Download
md5:82c01e9e100b296ba5b8fbbb2e39f5dc
460.0 MB Download
md5:30725fad35c915f042d5103c65b82f0f
1.6 GB Download