INDRA assembly Benchmark Corpus
Description
This data set accompanies the manuscript "Automated assembly of molecular mechanisms at scale from text mining and curated databases" which describes assembly methodology implemented in the INDRA system (https://github.com/sorgerlab/indra). The manuscript uses an example assembly pipeline on ~570k publications as input to create the INDRA Benchmark Corpus.
This dataset provides INDRA Statements constituting the INDRA Benchmark Corpus as well as a set of curations on the corpus:
- indra_benchmark_corpus.pkl: A Python pickle file of INDRA Statement objects. It requires INDRA to be installed to load in a Python environment.
- indra_benchmark_corpus.json.gz: A gzipped JSON export of INDRA Statements.
- indra_assembly_curations.json: The Curated Corpus of curated mentions for Statements in Benchmark Corpus as a JSON file. The JSON file contains a list with each element corresponding to a curation. Each curation entry contains a `pa_hash` and a `source_hash` attribute. These can be used to find the Statement and mention (respectively) to which the curation applies in the Benchmark Corpus.
Files
indra_assembly_curations.json
Files
(2.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:4a5b39066458e2112607bcb19e7a4d29
|
1.7 MB | Preview Download |
|
md5:82c01e9e100b296ba5b8fbbb2e39f5dc
|
460.0 MB | Download |
|
md5:30725fad35c915f042d5103c65b82f0f
|
1.6 GB | Download |