A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA
- 1. Universitat Jaume I
- 2. Linking Data LLC
Description
We have performed a series of in-silico experiments in order to determine a semantic similarity metric based on UMLS annotations for PubMed Central Open Access. Here we have stored the data used for and obtained from such experiments. We have worked with relevant and partially relevant articles from the TREC-2005 Genomics Track Collection, from now referred as the initial collection, including a total of 4240 unique PubMed articles. From those 4240 articles, only 62 had publicly available; those 62 articles correspond to the full-text collection.
Our data comprises flat files using tabs as separators and one Excel sheet. Tab separated values always include a first row with headings:
- Stems extracted from title and abstract for articles in the initial collection. Each row contains a stem with its inverse-document-frequency (IDF) within the initial collection. Stems were calculated following the Porter algorithm (available at http://tartarus.org/martin/PorterStemmer/java.txt)
- stems.TA.tsv
- Article profiles, i.e., terms (either word stems or UMLS concepts) found in the articles with term frequency (TF) and IDF. The first two columns correspond to PubMed Identifier (PMID) and PubMed Central identifier (PMC). PMC identifier was set to 0 whenever full-text was not available.
- profiles.TA.tsv: Profiles according word stems in title and abstract for the initial collection
- profiles.PMID.tsv: Profiles according to UMLS concpets in title and abstract for the initial collection
- profiles.PMC_TA.tsv: Profiles according to UMLS concepts in title and abstract for the full-text collection
- profiles.PMC.tsv: Profiles according to UMLS concepts in the full-text for the full-text collection
- Similarity matrixes calculated on the article profiles with PubMed Related Article metric (PMRA), BM25, and Cosine. There are matrixes for terms found in title-and-abstract as well as full-text. In a similarity matrix, a reference article (an interest has been already expressed for it) correspond to a row, while the columns correspond to all the other articles for which the similarity was calculated.
- Matrixes for our initial collection
- similarity.PMRA.TA.profiles.TA.tsv: Similarity matrix for profiles.TA.tsv following the algorithm PMRA. This matrix is considered the baseline for further analyses
- similarity.PMRA.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm PMRA
- similarity.BM25_1.2_0.75.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm BM25 with k=1.2 and b=0.75
- similarity.COSINE.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm Cosine
- Matrixes for our full-text collection
- similarity.PMRA.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm PMRA
- similarity.PMRA.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm PMRA
- similarity.BM25.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm BM25 with k=1.2 and b=0.75
- similarity.BM25.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm BM25 with k= 1.2 and b= 0.75
- similarity.COSINE.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm Cosine
- similarity.COSINE.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm Cosine
- Matrixes for our initial collection
- Correlation matrixes for similarities calculated for title-and-abstract taking as reference the similarity values obtained with PMRA for word stems on title-and-abstract.
- pearsonCorrelation.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv
- pearsonCorrelationTopic.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv discriminated by TREC topics
- pearsonCorrelation.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv
- pearsonCorrelationTopic.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv discriminated by TREC topics
- pearsonCorrelation.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv
- pearsonCorrelationTopic.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv discriminated by TREC topics
- Precision and recall summaries for the similarities calculated based on title-and-abstract.
- StatsAllSummary.xlsx: Precision and recall at a global level, i.e., without considering TREC topics. This file includes information for BM25 with multiples values for constants k and b
Visualization for correlation matrixes as well as scattered plots for full-text based similarity is available at http://ljgarcia.github.io/semsim.benchmark
Files
Files
(1.4 GB)
Name | Size | Download all |
---|---|---|
md5:ca13a566e70af8a68ef77fe13d9e4db1
|
126.6 kB | Download |
md5:5d24e7005c4dcf5b751b58555faba05c
|
124.3 kB | Download |
md5:6442eedbfe935c652ae1bfc159dd1a9b
|
119.7 kB | Download |
md5:fce957cf6c4240016e4f592cdf66a928
|
123.5 kB | Download |
md5:3c7d2aab9e00ce5a82addef891650004
|
121.2 kB | Download |
md5:9812c2380d4854115735839ca279c8d4
|
116.7 kB | Download |
md5:48b7316b76fa49570b6fa23cd49273c0
|
378.3 kB | Download |
md5:0b2a63b0b6ec6332bbea96681f2470e3
|
68.7 kB | Download |
md5:8fbc4dd20c6ac0104ef488152164700d
|
5.3 MB | Download |
md5:cf9556dd8ba72819615b37b74a6f11cb
|
7.9 MB | Download |
md5:de10ed31ee1f69eb123021186dff00f4
|
74.7 kB | Download |
md5:287397ca6aede462ebed475ac4a339f9
|
73.0 kB | Download |
md5:4f35f8b02b15847bb8cc82cde4d7d066
|
329.1 MB | Download |
md5:70aba8d742dcc4605d909b998164c3d7
|
72.0 kB | Download |
md5:28c47ddb336b7c3c024d6a5561f386c8
|
72.2 kB | Download |
md5:2199edccde935c9723b3fdc6a967c415
|
336.9 MB | Download |
md5:34666087ad30f7dcca20ee60471697ea
|
75.3 kB | Download |
md5:1266bd071e12e16a52c2f64e173c81ca
|
73.6 kB | Download |
md5:f9f910ccda435b3fc3ec86e74313e4ea
|
346.1 MB | Download |
md5:a8c5f46a3e18b8e2b5bb130d774d7e39
|
345.7 MB | Download |
md5:d0291e49454f36cfdc681a73f888e4db
|
44.4 kB | Download |
md5:df0ac4829eecabd5d600a2020ae51b98
|
275.9 kB | Download |