Published April 23, 2024 | Version v1
Dataset Open

Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature"

  • 1. Københavns Universitet
  • 1. ROR icon University of Turku
  • 2. ROR icon University of Copenhagen
  • 3. Textimi

Description

RegulaTome corpus: this file contains the RegulaTome corpus in BRAT format. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system

RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/ 

The tagger software can be found here: https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:

gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited format to be compatible with the RE system input (see below).

Input dictionary files: all the files necessary to execute the command above are available in tagger_dictionary_files.tar.gz 

Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz

Relation extraction system inputcombined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at relation_extraction_multi-label-best_model.tar.gz

The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

Relation extraction system output: the tab-delimited outputs of the relation extraction system are found at large_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!

The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.

Files

Files (55.7 GB)

Name Size Download all
md5:3be6dfe85826492e76823c6d5e3e452f
184.1 kB Download
md5:248b3dc539b70ff37589d4d0d642c340
45.9 GB Download
md5:20a9edc6931a964ddf579f385ca6f309
143.0 kB Download
md5:0e272a23583bbd393bfc53e50cd48707
2.3 MB Download
md5:473a44090242199a368f78b52766ba65
1.3 GB Download
md5:c39e643a1f8ff89427b17ea62271f6e9
1.8 GB Download
md5:c99d4f17031db47befe6bedf7bad6159
6.8 GB Download

Additional details

Funding

DeepTextNet – Deep learning-based text mining for interpretation of omics data 101023676
European Commission