A global network of biomedical relationships derived from text

Percha, Bethany; Altman, Russ B.

doi:10.5281/zenodo.1495808

Published November 26, 2018 | Version v5

Dataset Open

A global network of biomedical relationships derived from text

1. Icahn School of Medicine at Mount Sinai
2. Stanford University

This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.

PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).

PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:

PubMed ID
Sentence number (0 = title)
First entity name, formatted
First entity name, location (characters from start of abstract)
Second entity name, formatted
Second entity name, location
First entity name, raw string
Second entity name, raw string
First entity name, database ID(s)
Second entity name, database ID(s)
First entity type (Chemical, Gene, Disease)
Second entity type (Chemical, Gene, Disease)
Dependency path
Sentence, tokenized

The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.

This release contains the annotated network for the October 19, 2018 version of PubTator. The version discussed in our paper, below, is an older one - from April 30, 2016. If you're interested in that network, it can be found in Version 1 of this repository. We will be releasing updated networks periodically, as the PubTator community continues to release new versions of named entity annotations for Medline each month or so.

------------------------------------------------------------------------------------
REFERENCES

Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. Bioinformatics, 34(15): 2614-2624.
Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.

This project depends on named entity annotations from the PubTator project:
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Reference:
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522.

Dependency parsing was provided by the Stanford CoreNLP toolkit (version 3.9.1):
https://stanfordnlp.github.io/CoreNLP/index.html

Reference:
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

------------------------------------------------------------------------------------
THEMES

chemical-gene
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(N) inhibits

gene-chemical
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity

chemical-disease
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis

disease-chemical
(Mp) biomarkers (of disease progression)

gene-disease
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression

disease-gene
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease

gene-gene
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Rg) regulation
(Q) production by cell population

------------------------------------------------------------------------------------
FORMATTING NOTE

A few users have mentioned that the dependency paths in the "part-i" files are all lowercase text, whereas those in the "part-ii" files maintain the case of the original sentence. This complicates mapping between the two sets of files.

We kept the part-ii files in the same case as the original sentence to facilitate downstream debugging - it's easier to tell which words in a particular sentence are contributing to the dependency path if their original case is maintained. When working with the part-ii "with-themes" files, if you simply convert the dependency path to lowercase, it is guaranteed to match to one of the paths in the corresponding part-i file and you'll be able to get the theme scores.

Apologies for the additional complexity, and please reach out to us if you have any questions (see correspondence information in the Bioinformatics manuscript, above).

Files

part-i-chemical-disease-path-theme-distributions.txt.zip

Files (7.7 GB)

Name	Size
part-i-chemical-disease-path-theme-distributions.txt.zip md5:6fa31b2ead783e2db11c877c7ef92d92	73.5 MB	Preview Download
part-i-chemical-gene-path-theme-distributions.txt.zip md5:3455ead931cedb502a42747bd257b3c3	26.0 MB	Preview Download
part-i-gene-disease-path-theme-distributions.txt.zip md5:8a3d4f0d640c6d010f2cae36a3e4310d	67.7 MB	Preview Download
part-i-gene-gene-path-theme-distributions.txt.zip md5:86c44e2eac2c571f3f0f74ce27938461	55.6 MB	Preview Download
part-ii-dependency-paths-chemical-disease-sorted-with-themes.txt.zip md5:63d7bb63b9b70ac558bb32b4fc7d290a	415.7 MB	Preview Download
part-ii-dependency-paths-chemical-disease-sorted.txt.zip md5:fb0896ce1a9194e5a691d5e463a9ad73	1.5 GB	Preview Download
part-ii-dependency-paths-chemical-gene-sorted-with-themes.txt.zip md5:623383d5af1b9d40121b3a8780331cba	157.5 MB	Preview Download
part-ii-dependency-paths-chemical-gene-sorted.txt.zip md5:7e32f969592f9826e2ec9f1b6ab70949	888.9 MB	Preview Download
part-ii-dependency-paths-gene-disease-sorted-with-themes.txt.zip md5:f8b58493ed8330157bdc26f33de3581f	332.6 MB	Preview Download
part-ii-dependency-paths-gene-disease-sorted.txt.zip md5:9ced2cdaefc69b544efab594abb3f10f	1.1 GB	Preview Download
part-ii-dependency-paths-gene-gene-sorted-with-themes.txt.zip md5:bef8039257e18712c1f403eb0cdb2e1e	404.0 MB	Preview Download
part-ii-dependency-paths-gene-gene-sorted.txt.zip md5:c4f246a9a556fac565c85238b07c64d4	2.6 GB	Preview Download

	All versions	This version
Views	31,581	2,870
Downloads	23,387	1,837
Data volume	84.3 TB	1.4 TB

A global network of biomedical relationships derived from text

Authors/Creators

Description

Files

part-i-chemical-disease-path-theme-distributions.txt.zip

Files (7.7 GB)