Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published May 20, 2024 | Version v3
Journal article Open

Data for "STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature"

  • 1. Novo Nordisk Foundation Center for Protein Research, Denmark
  • 2. University of Turku, Finland
  • 3. Textimi

Description

ComplexTome.tar.gz: this file contains the corpus in BRAT format. The corpus is provided in two different directory organizations. The directory "splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system, and the "data_source" directory has the corpus split based on the source of the data as described in the Methods section of the manuscript. The annotation guidelines along with the annotation configuration files for BRAT are provided in the root directory.

trigger_word_corpus.tar.gz: this file contains the corpus in BRAT format. The corpus is split in devel and test set. The annotation guidelines for trigger word detection are at the bottom of the relation annotation guidelines provided above.


The command used to run tagger before large-scale execution of the RE system is:

gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv

Input documents for large-scale execution: all PubMed abstracts (as of August 2022) and all full-texts available in the PubmedCentral BioC text mining collection (as of April 2022). The files are converted to a tab-delimited format in order to convert the output to a format compatible with the RE system (see below).

Input dictionary files: all the files necessary to execute the command above are available in dictionary-files-tagger-STRINGv12.tar.gz

Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz


Relation extraction system input: combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using the tagger2standoff.py script from the string-db-tools repository.

Relation extraction models. The TensorFlow model used for large-scale relation extraction for STRING v12 is at relation_extraction_string_v12_best_model.tar.gz, while the PyTorch model used to do the relation extraction for trigger word detection is at relation_extraction_for_trigger_detection_best_model.tar.gz

The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learnt from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.

Relation extraction system output: large_scale_relation_extraction_results.tar.gz: this is the output of the relation extraction system, which includes both negative and positive predictions. The file has 5 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, prediction (positive or negative) and a list of the positive and negative score coming from the relation extraction model. E.g.:

10092099 T1 T2 neg [1.0, 5.017947320683225e-13]

Relation extraction system - ComplexTome test set predictions: test-set-predictions-RE.tar.gz: this file contains the BRAT formatted predictions of the model on the ComplexTome test set and can be viewed after setting up a BRAT server following the instructions here.

 

Trigger word detection system input: combined_input_for_triggers.tar.gz these are the directories with all the .ann and .txt files used as input for the large scale execution of the trigger word detection system. These are only pairs predicted as positive from the relation extraction system's large scale predictions.

Trigger word detection system output: trigger_word_model_predictions.tar.gz contains the output of the large scale execution pipeline, with 9 columns: PubMed id, Entity ID1, Entity ID2, Negative logit for complex formation relationship, Positive logit for complex formation relationship, trigger score, start offset, end offset, trigger word match.

Trigger word detection system - trigger word corpus test set predictions: test-set-predictions-trigger.tar.gz this file contains the BRAT formatted predictions of the model on the trigger word test set and can be viewed after setting up a BRAT server following the instructions here.

If you have the input in CoNLL format please use this script conll2standoff.py from the BRAT GitHub repository to convert from CoNLL to BRAT so that you can run our system.

 

BRAT screenshots of the ComplexTome test set predictions: RE_error_analysis_BRAT.zip

BRAT screenshots of the trigger word test set predictions: trigger_error_analysis_BRAT.zip

Evaluation of large-scale relation extraction: We randomly selected 1000 papers from the literature and assessed the acceptability of relation extraction in those papers. The documents and the results of this evaluation can be found in 1000_random_docs_check_ComplexTome.tar.gz

Files

RE_error_analysis_BRAT.zip

Files (86.3 GB)

Name Size Download all
md5:a5322fa0d401aa93358d40d3a162ac50
2.0 MB Download
md5:4bfc3b39d223d1cd844e731f4b024a21
35.7 GB Download
md5:591dfddbcc53f55e2d3ba4a101f65392
24.9 GB Download
md5:488bc98d368c4661d648fe302cae58b4
2.2 MB Download
md5:e5a39c719739c0076d4e2b51f806c9d1
2.6 GB Download
md5:b4cd3df840ccee854d85237c3c869c45
13.6 GB Download
md5:e92bdb6c6931c89a2ab20e44af99e527
37.8 MB Preview Download
md5:a69ba9f738481a901be5fc06aac70b15
1.3 GB Download
md5:27603075446c3ef8efff4ac8096c1cb9
2.5 GB Download
md5:881d20995d89af1989e125c9a69e1bc3
5.4 GB Download
md5:ba876df75ea0dfcdf8675906d51c1cb4
196.3 kB Download
md5:ef3efca46ca380dfefbd1abd43b07ee2
55.6 kB Download
md5:c02db32cc68d467e124c2ca4c53800c0
17.4 MB Preview Download
md5:bedc55fb2ad2dd89818f361b61093d28
290.4 kB Download
md5:1e054e880a4366162e0ffb70292d694c
173.1 MB Download

Additional details

Funding

DeepTextNet – Deep learning-based text mining for interpretation of omics data 101023676
European Commission