Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature"
Contributors
Data curator:
Researchers:
Description
RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system
RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/
The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:
gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r
pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - |
tagger/tagcorpus --threads=16 --autodetect
--types=dictionary/curated_types.tsv
--entities=dictionary/all_entities.tsv
--names=dictionary/all_names_textmining.tsv
--groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv
--local-stopwords=dictionary/all_local.tsv
--type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv
Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).
Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz
Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz
Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.
Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz
The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.
Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!
The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.
Files
Files
(55.7 GB)
Name | Size | Download all |
---|---|---|
md5:3be6dfe85826492e76823c6d5e3e452f
|
184.1 kB | Download |
md5:248b3dc539b70ff37589d4d0d642c340
|
45.9 GB | Download |
md5:20a9edc6931a964ddf579f385ca6f309
|
143.0 kB | Download |
md5:0e272a23583bbd393bfc53e50cd48707
|
2.3 MB | Download |
md5:473a44090242199a368f78b52766ba65
|
1.3 GB | Download |
md5:c39e643a1f8ff89427b17ea62271f6e9
|
1.8 GB | Download |
md5:c99d4f17031db47befe6bedf7bad6159
|
6.8 GB | Download |