Data for "STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature"

Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Jensen, Lars Juhl; Pyysalo, Sampo

doi:10.5281/zenodo.10693924

Published 2024 | Version v2

Journal article Open

Data for "STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature"

1. Novo Nordisk Foundation Center for Protein Research, Denmark
2. University of Turku, Finland
3. Textimi

ComplexTome.tar.gz: this file contains the corpus in BRAT format. The corpus is provided in two different directory organizations. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system, and the "data_source" directory has the corpus split based on the source of the data as described in the Methods section of the manuscript. The annotation guidelines along with the annotation configuration files for BRAT are provided in the root directory.

trigger_word_corpus.tar.gz: this file contains the corpus in BRAT format. The corpus is split in devel and test set. The annotation guidelines for trigger word detection are at the bottom of the relation annotation guidelines provided above.

The command used to run tagger before large-scale execution of the RE system is:

gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

Input documents for large-scale execution: all PubMed abstracts (as of August 2022) and all full-texts available in the PubmedCentral BioC text mining collection (as of April 2022). The files are converted to a tab-delimited format in order to convert the output to a format compatible with the RE system (see below).

Input dictionary files: all the files necessary to execute the command above are available in dictionary-files-tagger-STRINGv12.tar.gz

Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz

Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

Relation extraction models. The TensorFlow model used for large-scale relation extraction for STRING v12 is at relation_extraction_string_v12_best_model.tar.gz, while the PyTorch model used to do the relation extraction for trigger word detection is at relation_extraction_for_trigger_detection_best_model.tar.gz

The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learnt from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

Relation extraction system output: large_scale_relation_extraction_results.tar.gz: this is the output of the relation extraction system, which includes both negative and positive predictions. The file has 5 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, prediction (positive or negative) and a list of the positive and negative score coming from the relation extraction model. E.g.:

10092099 T1 T2 neg [1.0, 5.017947320683225e-13]

Trigger word detection system input: combined_input_for_triggers.tar.gz these are the directories with all the .ann and .txt files used as input for the large scale execution of the trigger word detection system. These are only pairs predicted as positive from the relation extraction system's large scale predictions.

Trigger word detection system output: trigger_word_model_predictions.tar.gz contains the output of the large scale execution pipeline, with 9 columns: PubMed id, Entity ID1, Entity ID2, Negative logit for complex formation relationship, Positive logit for complex formation relationship, trigger score, start offset, end offset, trigger word match.

If you have the input in CoNLL format please use this script conll2standoff.py from the BRAT GitHub repository to convert from CoNLL to BRAT so that you can run our system.

Files

Files (86.2 GB)

Name	Size	Download all
combined_input_for_re.tar.gz md5:4bfc3b39d223d1cd844e731f4b024a21	35.7 GB	Download
combined_input_for_triggers.tar.gz md5:591dfddbcc53f55e2d3ba4a101f65392	24.9 GB	Download
ComplexTome.tar.gz md5:9ee12d329c384abcaf4cab27cd58b0a3	2.6 MB	Download
dictionary-files-tagger-STRINGv12.tar.gz md5:e5a39c719739c0076d4e2b51f806c9d1	2.6 GB	Download
large_scale_relation_extraction_results.tar.gz md5:b4cd3df840ccee854d85237c3c869c45	13.6 GB	Download
relation_extraction_for_trigger_detection_best_model.tar.gz md5:a69ba9f738481a901be5fc06aac70b15	1.3 GB	Download
relation_extraction_string_v12_best_model.tar.gz md5:27603075446c3ef8efff4ac8096c1cb9	2.5 GB	Download
tagger_matches_ggp_only_gt_1_hit.tsv.gz md5:881d20995d89af1989e125c9a69e1bc3	5.4 GB	Download
trigger_word_corpus.tar.gz md5:bedc55fb2ad2dd89818f361b61093d28	290.4 kB	Download
trigger_word_model_predictions.tar.gz md5:1e054e880a4366162e0ffb70292d694c	173.1 MB	Download

Additional details

European Commission
DeepTextNet – Deep learning-based text mining for interpretation of omics data 101023676

	All versions	This version
Views	298	105
Downloads	882	167
Data volume	7.0 TB	1.9 TB

Data for "STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature"

Creators

Description

Files

Files (86.2 GB)

Additional details

Funding