Published December 19, 2014 | Version v1
Dataset Open

A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA

  • 1. Universitat Jaume I
  • 2. Linking Data LLC

Description

We have performed a series of in-silico experiments in order to determine a semantic similarity metric based on UMLS annotations for PubMed Central Open Access. Here we have stored the data used for and obtained from such experiments. We have worked with relevant and partially relevant articles from the TREC-2005 Genomics Track Collection, from now referred as the initial collection, including a total of 4240 unique PubMed articles. From those 4240 articles, only 62 had publicly available; those 62 articles correspond to the full-text collection.

Our data comprises flat files using tabs as separators and one Excel sheet. Tab separated values always include a first row with headings:

  • Stems extracted from title and abstract for articles in the initial collection. Each row contains a stem with its inverse-document-frequency (IDF) within the initial collection. Stems were calculated following the Porter algorithm (available at http://tartarus.org/martin/PorterStemmer/java.txt)
    • stems.TA.tsv
  • Article profiles, i.e., terms (either word stems or UMLS concepts) found in the articles with term frequency (TF) and IDF. The first two columns correspond to PubMed Identifier (PMID) and PubMed Central identifier (PMC). PMC identifier was set to 0 whenever full-text was not available.
    • profiles.TA.tsv: Profiles according word stems in title and abstract for the initial collection
    • profiles.PMID.tsv: Profiles according to UMLS concpets in title and abstract for the initial collection
    • profiles.PMC_TA.tsv: Profiles according to UMLS concepts in title and abstract for the full-text collection
    • profiles.PMC.tsv: Profiles according to UMLS concepts in the full-text for the full-text collection
  • Similarity matrixes calculated on the article profiles with PubMed Related Article metric (PMRA), BM25, and Cosine. There are matrixes for terms found in title-and-abstract as well as full-text. In a similarity matrix, a reference article (an interest has been already expressed for it) correspond to a row, while the columns correspond to all the other articles for which the similarity was calculated.
    • Matrixes for our initial collection
      • similarity.PMRA.TA.profiles.TA.tsv: Similarity matrix for profiles.TA.tsv following the algorithm PMRA. This matrix is considered the baseline for further analyses
      • similarity.PMRA.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm PMRA
      • similarity.BM25_1.2_0.75.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm BM25 with k=1.2 and b=0.75
      • similarity.COSINE.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm Cosine
    • Matrixes for our full-text collection
      • similarity.PMRA.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm PMRA
      • similarity.PMRA.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm PMRA
      • similarity.BM25.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm BM25 with k=1.2 and b=0.75
      • similarity.BM25.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm BM25 with k= 1.2 and b= 0.75
      • similarity.COSINE.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm Cosine
      • similarity.COSINE.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm Cosine
  • Correlation matrixes for similarities calculated for title-and-abstract taking as reference the similarity values obtained with PMRA for word stems on title-and-abstract.
    • pearsonCorrelation.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv
    • pearsonCorrelationTopic.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv discriminated by TREC topics
    • pearsonCorrelation.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv
    • pearsonCorrelationTopic.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv discriminated by TREC topics
    • pearsonCorrelation.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv
    • pearsonCorrelationTopic.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv discriminated by TREC topics
  • Precision and recall summaries for the similarities calculated based on title-and-abstract.
    • StatsAllSummary.xlsx: Precision and recall at a global level, i.e., without considering TREC topics. This file includes information for BM25 with multiples values for constants k and b

Visualization for correlation matrixes as well as scattered plots for full-text based similarity is available at http://ljgarcia.github.io/semsim.benchmark

Files

Files (1.4 GB)

Name Size Download all
md5:ca13a566e70af8a68ef77fe13d9e4db1
126.6 kB Download
md5:5d24e7005c4dcf5b751b58555faba05c
124.3 kB Download
md5:6442eedbfe935c652ae1bfc159dd1a9b
119.7 kB Download
md5:fce957cf6c4240016e4f592cdf66a928
123.5 kB Download
md5:3c7d2aab9e00ce5a82addef891650004
121.2 kB Download
md5:9812c2380d4854115735839ca279c8d4
116.7 kB Download
md5:48b7316b76fa49570b6fa23cd49273c0
378.3 kB Download
md5:0b2a63b0b6ec6332bbea96681f2470e3
68.7 kB Download
md5:8fbc4dd20c6ac0104ef488152164700d
5.3 MB Download
md5:cf9556dd8ba72819615b37b74a6f11cb
7.9 MB Download
md5:de10ed31ee1f69eb123021186dff00f4
74.7 kB Download
md5:287397ca6aede462ebed475ac4a339f9
73.0 kB Download
md5:4f35f8b02b15847bb8cc82cde4d7d066
329.1 MB Download
md5:70aba8d742dcc4605d909b998164c3d7
72.0 kB Download
md5:28c47ddb336b7c3c024d6a5561f386c8
72.2 kB Download
md5:2199edccde935c9723b3fdc6a967c415
336.9 MB Download
md5:34666087ad30f7dcca20ee60471697ea
75.3 kB Download
md5:1266bd071e12e16a52c2f64e173c81ca
73.6 kB Download
md5:f9f910ccda435b3fc3ec86e74313e4ea
346.1 MB Download
md5:a8c5f46a3e18b8e2b5bb130d774d7e39
345.7 MB Download
md5:d0291e49454f36cfdc681a73f888e4db
44.4 kB Download
md5:df0ac4829eecabd5d600a2020ae51b98
275.9 kB Download