There is a newer version of the record available.

Published 2024 | Version v2
Journal article Open

Data for "STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature"

  • 1. Novo Nordisk Foundation Center for Protein Research, Denmark
  • 2. University of Turku, Finland
  • 3. Textimi

Description

ComplexTome.tar.gz: this file contains the corpus in BRAT format. The corpus is provided in two different directory organizations. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system, and the "data_source" directory has the corpus split based on the source of the data as described in the Methods section of the manuscript. The annotation guidelines along with the annotation configuration files for BRAT are provided in the root directory.

trigger_word_corpus.tar.gz: this file contains the corpus in BRAT format. The corpus is split in devel and test set. The annotation guidelines for trigger word detection are at the bottom of the relation annotation guidelines provided above.


The command used to run tagger before large-scale execution of the RE system is:

gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

Input documents for large-scale execution: all PubMed abstracts (as of August 2022) and all full-texts available in the PubmedCentral BioC text mining collection (as of April 2022). The files are converted to a tab-delimited format in order to convert the output to a format compatible with the RE system (see below).

Input dictionary files: all the files necessary to execute the command above are available in dictionary-files-tagger-STRINGv12.tar.gz

Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz


Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

Relation extraction models. The TensorFlow model used for large-scale relation extraction for STRING v12 is at relation_extraction_string_v12_best_model.tar.gz, while the PyTorch model used to do the relation extraction for trigger word detection is at relation_extraction_for_trigger_detection_best_model.tar.gz

The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learnt from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

Relation extraction system output: large_scale_relation_extraction_results.tar.gz: this is the output of the relation extraction system, which includes both negative and positive predictions. The file has 5 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, prediction (positive or negative) and a list of the positive and negative score coming from the relation extraction model. E.g.:

10092099 T1 T2 neg [1.0, 5.017947320683225e-13]

 

Trigger word detection system input: combined_input_for_triggers.tar.gz these are the directories with all the .ann and .txt files used as input for the large scale execution of the trigger word detection system. These are only pairs predicted as positive from the relation extraction system's large scale predictions.

Trigger word detection system output: trigger_word_model_predictions.tar.gz contains the output of the large scale execution pipeline, with 9 columns: PubMed id, Entity ID1, Entity ID2, Negative logit for complex formation relationship, Positive logit for complex formation relationship, trigger score, start offset, end offset, trigger word match.

If you have the input in CoNLL format please use this script conll2standoff.py from the BRAT GitHub repository to convert from CoNLL to BRAT so that you can run our system.

Files

Files (86.2 GB)

Name Size Download all
md5:4bfc3b39d223d1cd844e731f4b024a21
35.7 GB Download
md5:591dfddbcc53f55e2d3ba4a101f65392
24.9 GB Download
md5:9ee12d329c384abcaf4cab27cd58b0a3
2.6 MB Download
md5:e5a39c719739c0076d4e2b51f806c9d1
2.6 GB Download
md5:b4cd3df840ccee854d85237c3c869c45
13.6 GB Download
md5:a69ba9f738481a901be5fc06aac70b15
1.3 GB Download
md5:27603075446c3ef8efff4ac8096c1cb9
2.5 GB Download
md5:881d20995d89af1989e125c9a69e1bc3
5.4 GB Download
md5:bedc55fb2ad2dd89818f361b61093d28
290.4 kB Download
md5:1e054e880a4366162e0ffb70292d694c
173.1 MB Download

Additional details

Funding

European Commission
DeepTextNet – Deep learning-based text mining for interpretation of omics data 101023676