Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature"

doi:10.5281/zenodo.10808330

Published April 23, 2024 | Version v1

Dataset Open

Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature"

Nastou, Katerina (Researcher)¹

1. Københavns Universitet

Contributors

Data curator:

Ohta, Tomoko³

Researchers:

1. University of Turku
2. University of Copenhagen
3. Textimi

RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system

RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/

The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:

gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).

Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz

Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz

Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz

The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!

The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.

Files

Files (55.7 GB)

Name	Size	Download all
annodoc+config.tar.gz md5:3be6dfe85826492e76823c6d5e3e452f	184.1 kB	Download
combined_input_for_re.tar.gz md5:248b3dc539b70ff37589d4d0d642c340	45.9 GB	Download
Error analysis full results - RegulaTome.xlsx md5:20a9edc6931a964ddf579f385ca6f309	143.0 kB	Download
RegulaTome-corpus.tar.gz md5:0e272a23583bbd393bfc53e50cd48707	2.3 MB	Download
relation_extraction_multi-label-best_model.tar.gz md5:473a44090242199a368f78b52766ba65	1.3 GB	Download
tagger_dictionary_files.tar.gz md5:c39e643a1f8ff89427b17ea62271f6e9	1.8 GB	Download
tagger_matches_ggp_only_gt_1_hit.tsv.gz md5:c99d4f17031db47befe6bedf7bad6159	6.8 GB	Download

Additional details

DeepTextNet – Deep learning-based text mining for interpretation of omics data 101023676: European Commission

	All versions	This version
Views	174	174
Downloads	110	110
Data volume	1.0 TB	1.0 TB

Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature"

Creators

Contributors

Data curator:

Researchers:

Description

Files

Files (55.7 GB)

Additional details

Funding