Biolinks, datasets and algorithms supporting semantic-based distribution and similarity for scientific publications

Garcia Castro, Leyla Jael; Berlanga, Rafael; Garcia, Alexander

doi:10.5281/zenodo.829920

Published July 15, 2017 | Version v2

Dataset Open

Biolinks, datasets and algorithms supporting semantic-based distribution and similarity for scientific publications

1. Universidad Jaume I
2. Universidad Politécnica de Madrid

Background: Finding articles related to a publication of interest remains a challenge in the Life Sciences domain as the number of scientific publications grows day by day. Publication repositories such as PubMed and Elsevier provides a list of similar articles. There, similarity is commonly calculated based on title, abstract and some keywords assigned to articles. Here we present the datasets and algorithms used in Biolinks. Biolinks uses ontological concepts extracted from publication and makes it possible to calculate a distribution score according to semantic groups as well as a semantic similarity based on either all identified annotations or narrowed to one or more particular semantic groups. Biolinks supports both title and abstract only as well as full-text.

Materials: In a previous work [1], 4,240 articles from the TREC-05 collection [2] were selected. The title-and-abstract for those 4,240 articles were annotated with Unified Medical Language System (UMLS) concepts, such annotations are refer to as our TA-dataset and correspond to the JSON files under the pubmed folder in the JSON-LD.zip file. From those 4,240 articles, full-text was available for only 62. The title-and-abstract annotations for those 62 articles, TAFT-dataset, are located under the pubmed-pmc folder in the JSON-LD.zip file, which also contains the full-text annotations under the folder pmc, FT-dataset. The list corresponding to articles with title-and-abstract is found in the genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv file, while those with full-text are recorded in the genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv file.

Here we include the annotations on title and abstract as well as those for full-text for all our datasets (profiles.zip). We also provide the global similarity matrices (similarity.zip).

Methods: The TA-dataset was used to calculate the Information Gain (IG) according to the UMLS semantic groups, see IG_umls_groups.PMID.xlsx. A new grouping is proposed for Biolinks, see biolinks_groups.tsv. The IG was calculated for Biolinks groups as well, IG_biolinks_groups.PMID.xlsx, showing a improvement around 5%.

In order to assess the similarity metric regarding the cohesion of TREC-05 groups, we used Silhouette Coefficient analyses. An additional dataset Stem-TAFT-dataset was used and compared to TAFT and FT datasets.

Biolinks groups were used to calculate a semantic group distribution score for each article in all our datasets. A semantic similarity metric based on PubMed related articles [3] is also provided; the Biolinks groups can be used to narrow the similarity to one or more selected groups. All the corresponding algorithms are open-access and available on GitHub under the license Apache-2.0, a frozen version, biotea-io-parser-master.zip, is provided here. In order to facilitate the analysis of our datasets based on the annotations as well as the distribution and similarity scores, some web-based visualization components were created. All of them open-access and available in GitHub under the license Apache-2.0; frozen versions are provided here, see files biotea-vis-annotation-master.zip, biotea-vis-similarity-master.zip, biotea-vis-tooltip-master.zip and biotea-vis-topicDistribution-master.zip. These components are brought together by biotea-vis-biolinks-master.zip. A demo is provided at http://ljgarcia.github.io/biotea-biolinks/; this demo was built on top of GitHub pages, a frozen version of the gh-pages branch is provided here, see biotea-biolinks-gh-pages.zip.

Conclusions: Biolinks assigns a weight to each semantic group based on the annotations extracted from either title-and-abstract or full-text articles. It also measures similarity for a pair of documents using the semantic information. The distribution and similarity metrics can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them.

[1] Garcia Castro, L.J., R. Berlanga, and A. Garcia, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical Informatics, 2015. 57: p. 204-218

[2] Text Retrieval Conference 2005 - Genomics Track. TREC-05 Genomics Track ad hoc relevance judgement. 2005 [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt

[3] Lin, J. and W.J. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 2007. 8(1): p. 423

Files

biotea-biolinks-gh-pages.zip

Files (281.1 MB)

Name	Size	Download all
biolinks_groups.tsv md5:b3426b530b5d64ec990fd24de7eeaf26	5.9 kB	Download
biotea-biolinks-gh-pages.zip md5:588fc16644bb01ec11d4b826672ebada	19.9 MB	Preview Download
biotea-biolinks-master.zip md5:7c25dd6f6cebf3ccbd0795820af9ea8c	3.4 MB	Preview Download
biotea-io-parser-master.zip md5:dafbb2de2087fe882b7676f2693ec39f	163.4 kB	Preview Download
biotea-vis-annotation-master.zip md5:e1474dd8214c7127c1bf43a91e719d5e	52.0 kB	Preview Download
biotea-vis-biolinks-master.zip md5:6f81f21e883c21fddd3461c2a2ad1782	240.7 kB	Preview Download
biotea-vis-similarity-master.zip md5:a9db641b43fe1362edaecba0d1d9cacd	119.9 kB	Preview Download
biotea-vis-tooltip-master.zip md5:22fcac0c98d58cec2ab4cfe20f613a4d	12.5 kB	Preview Download
biotea-vis-topicDistribution-master.zip md5:89bbd413c92c086378464227742064ea	38.2 kB	Preview Download
genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv md5:188424e2b3eab6cebe2ce5d972ae458f	66.4 kB	Download
genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv md5:5f835e696f342f84a99f9564d487a892	3.1 MB	Download
IG_biolinks_groups.PMID.xlsx md5:8351f2181d9758a76b1b94bc2a253602	105.1 kB	Download
IG_umls_groups.PMID.xlsx md5:acfb56d683995d98656dcaae99669a51	81.2 kB	Download
JSON-LD.zip md5:59362dd9d32b7eca4c25dbb03f6df2f9	11.6 MB	Preview Download
Profiles.zip md5:5ac7d261c84764140d5944dadda64f4b	2.9 MB	Preview Download
Silhouette for Stem TAFT FT similarity 62.xlsx md5:d4334d03bd4840116c8a22d95149ea84	556.0 kB	Download
Similarity.zip md5:324bb76b5252c26d2aa327015e45a7db	238.8 MB	Preview Download

Additional details

Garcia Castro, L.J., R. Berlanga, and A. Garcia, In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical Informatics, 2015. 57: p. 204-218
Text Retrieval Conference 2005 - Genomics Track. TREC-05 Genomics Track ad hoc relevance judgement. 2005 [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt
Lin, J. and W.J. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 2007. 8(1): p. 423

	All versions	This version
Views	444	206
Downloads	400	274
Data volume	5.8 GB	5.6 GB

Biolinks, datasets and algorithms supporting semantic-based distribution and similarity for scientific publications

Creators

Description

Files

biotea-biolinks-gh-pages.zip

Files (281.1 MB)

Additional details

References