A global network of biomedical relationships derived from text

doi:10.5281/zenodo.1035253

Published October 23, 2017 | Version v1

Dataset Open

A global network of biomedical relationships derived from text

1. Icahn School of Medicine at Mount Sinai
2. Stanford University

This repository contains labeled, weighted networks of chemical-gene, gene-gene, gene-disease, and chemical-disease relationships based on single sentences in PubMed abstracts. All raw dependency paths are provided in addition to the labeled relationships.

PART I: Connects dependency paths to labels, or "themes". Each record contains a dependency path followed by its score for each theme, and indicators of whether or not the path is part of the flagship path set for each theme (meaning that it was manually reviewed and determined to reflect that theme). The themes themselves are listed below and are in our paper (reference below).

PART II: Connects sentences to dependency paths. It consists of sentences and associated metadata, entity pairs found in the sentences, and dependency paths connecting those entity pairs. Each record contains the following information:

PubMed ID
Sentence number (0 = title)
First entity name, formatted
First entity name, location (characters from start of abstract)
Second entity name, formatted
Second entity name, location
First entity name, raw string
Second entity name, raw string
First entity name, database ID(s)
Second entity name, database ID(s)
First entity type (Chemical, Gene, Disease)
Second entity type (Chemical, Gene, Disease)
Dependency path
Sentence, tokenized

The "with-themes.txt" files only contain dependency paths with corresponding theme assignments from Part I. The plain ".txt" files contain all dependency paths.

This release contains the annotated network for the April 30, 2016 version of PubTator, which is described in our paper (below). We will also be releasing an updated version of the network periodically, as the PubTator community continues to release new versions each month or so.

------------------------------------------------------------------------------------
REFERENCES

Percha B, Altman RBA (2017) A global network of biomedical relationships derived from text. (Submitted to Bioinformatics; currently in revision.)
Percha B, Altman RBA (2015) Learning the structure of biomedical relationships from unstructured text. PLoS Computational Biology, 11(7): e1004216.

This project depends on named entity annotations from the PubTator project:
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Reference:
Wei CH et. al., PubTator: a Web-based text mining tool for assisting Biocuration, Nucleic acids research, 2013, 41 (W1): W518-W522. doi: 10.1093/nar/gkt44

Dependency parsing was provided by the Stanford CoreNLP toolkit:
https://stanfordnlp.github.io/CoreNLP/index.html

Reference:
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

------------------------------------------------------------------------------------
THEMES

chemical-gene
(A+) agonism, activation
(A-) antagonism, blocking
(B) binding, ligand (esp. receptors)
(E+) increases expression/production
(E-) decreases expression/production
(E) affects expression/production (neutral)
(N) inhibits

gene-chemical
(O) transport, channels
(K) metabolism, pharmacokinetics
(Z) enzyme activity

chemical-disease
(T) treatment/therapy (including investigatory)
(C) inhibits cell growth (esp. cancers)
(Sa) side effect/adverse event
(Pr) prevents, suppresses
(Pa) alleviates, reduces
(J) role in disease pathogenesis

disease-chemical
(Mp) biomarkers (of disease progression)

gene-disease
(U) causal mutations
(Ud) mutations affecting disease course
(D) drug targets
(J) role in pathogenesis
(Te) possible therapeutic effect
(Y) polymorphisms alter risk
(G) promotes progression

disease-gene
(Md) biomarkers (diagnostic)
(X) overexpression in disease
(L) improper regulation linked to disease

gene-gene
(B) binding, ligand (esp. receptors)
(W) enhances response
(V+) activates, stimulates
(E+) increases expression/production
(E) affects expression/production (neutral)
(I) signaling pathway
(H) same protein or complex
(Rg) regulation
(Q) production by cell population

Files

part-i-chemical-disease-theme-distributions.txt

Files (9.1 GB)

Name	Size	Download all
part-i-chemical-disease-theme-distributions.txt md5:54e91b0951f79dc4a26ee22a04f185fe	392.9 MB	Preview Download
part-i-chemical-gene-theme-distributions.txt md5:59bc363a79a6b2fa3a4c3bdc4aaf300c	7.9 MB	Preview Download
part-i-gene-disease-theme-distributions.txt md5:fe4c741b45fc874e2f790bdf0ac2ee30	29.2 MB	Preview Download
part-i-gene-gene-theme-distributions.txt md5:c142a08cde56f5213ac052039cdaf7ac	8.2 MB	Preview Download
part-ii-dependency-paths-chemical-disease-sorted-with-themes.txt md5:cfd8555d5ba72ae601bcf72c618662ce	1.5 GB	Preview Download
part-ii-dependency-paths-chemical-disease-sorted.txt md5:e51f502d362e1b427e91bfa9c6964bbd	6.0 GB	Preview Download
part-ii-dependency-paths-chemical-gene-sorted-with-themes.txt md5:fbd14d2506ba769e4bafbc34e7ada823	29.1 MB	Preview Download
part-ii-dependency-paths-chemical-gene-sorted.txt md5:25d8f5e55d27b544f13722c0dac5888b	190.0 MB	Preview Download
part-ii-dependency-paths-gene-disease-sorted-with-themes.txt md5:7f03a03839927f28184b837eb5392bb0	107.3 MB	Preview Download
part-ii-dependency-paths-gene-disease-sorted.txt md5:d6c92a0f30c2654e9552f39a3777ee32	364.6 MB	Preview Download
part-ii-dependency-paths-gene-gene-sorted-with-themes.txt md5:397cc28e0f620481f09e8d2bc76c56fe	43.0 MB	Preview Download
part-ii-dependency-paths-gene-gene-sorted.txt md5:5c244a3ab1019a0d0a2e950a8e31682d	389.0 MB	Preview Download

	All versions	This version
Views	27,786	511
Downloads	13,041	1,374
Data volume	67.2 TB	1.9 TB

A global network of biomedical relationships derived from text

Creators

Description

Files

part-i-chemical-disease-theme-distributions.txt

Files (9.1 GB)