Data for "STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature"
- 1. Novo Nordisk Foundation Center for Protein Research, Denmark
- 2. University of Turku, Finland
- 3. Textimi
Description
ComplexTome.tar.gz: this file contains the corpus in BRAT format. The corpus is provided in two different directory organizations. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system, and the "data_source" directory has the corpus split based on the source of the data as described in the Methods section of the manuscript. The annotation guidelines along with the annotation configuration files for BRAT are provided in the root directory.
trigger_word_corpus.tar.gz: this file contains the corpus in BRAT format. The corpus is split in devel and test set. The annotation guidelines for trigger word detection are at the bottom of the relation annotation guidelines provided above.
The command used to run tagger before large-scale execution of the RE system is:
gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv
Input documents for large-scale execution: all PubMed abstracts (as of August 2022) and all full-texts available in the PubmedCentral BioC text mining collection (as of April 2022). The files are converted to a tab-delimited format in order to convert the output to a format compatible with the RE system (see below).
Input dictionary files: all the files necessary to execute the command above are available in dictionary-files-tagger-STRINGv12.tar.gz
Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz
Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.
Relation extraction models. The TensorFlow model used for large-scale relation extraction for STRING v12 is at relation_extraction_string_v12_best_model.tar.gz, while the PyTorch model used to do the relation extraction for trigger word detection is at relation_extraction_for_trigger_detection_best_model.tar.gz
The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learnt from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.
Relation extraction system output: large_scale_relation_extraction_results.tar.gz: this is the output of the relation extraction system, which includes both negative and positive predictions. The file has 5 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, prediction (positive or negative) and a list of the positive and negative score coming from the relation extraction model. E.g.:
10092099 T1 T2 neg [1.0, 5.017947320683225e-13]
Trigger word detection system input: combined_input_for_triggers.tar.gz these are the directories with all the .ann and .txt files used as input for the large scale execution of the trigger word detection system. These are only pairs predicted as positive from the relation extraction system's large scale predictions.
Trigger word detection system output: trigger_word_model_predictions.tar.gz contains the output of the large scale execution pipeline, with 9 columns: PubMed id, Entity ID1, Entity ID2, Negative logit for complex formation relationship, Positive logit for complex formation relationship, trigger score, start offset, end offset, trigger word match.
If you have the input in CoNLL format please use this script conll2standoff.py from the BRAT GitHub repository to convert from CoNLL to BRAT so that you can run our system.
Files
Files
(86.2 GB)
Name | Size | Download all |
---|---|---|
md5:4bfc3b39d223d1cd844e731f4b024a21
|
35.7 GB | Download |
md5:591dfddbcc53f55e2d3ba4a101f65392
|
24.9 GB | Download |
md5:9ee12d329c384abcaf4cab27cd58b0a3
|
2.6 MB | Download |
md5:e5a39c719739c0076d4e2b51f806c9d1
|
2.6 GB | Download |
md5:b4cd3df840ccee854d85237c3c869c45
|
13.6 GB | Download |
md5:a69ba9f738481a901be5fc06aac70b15
|
1.3 GB | Download |
md5:27603075446c3ef8efff4ac8096c1cb9
|
2.5 GB | Download |
md5:881d20995d89af1989e125c9a69e1bc3
|
5.4 GB | Download |
md5:bedc55fb2ad2dd89818f361b61093d28
|
290.4 kB | Download |
md5:1e054e880a4366162e0ffb70292d694c
|
173.1 MB | Download |