Quantifying domain-specific relevance of computational biology Wikipedia articles using TF-IDF and cosine similarity

P Menon, Arya; Davis, Trenton; Hegde, Megha; Näpflin, Nicolas; Anjum, Audra; Barman, Sarah; DeBlasio, Dan; Eranti, Pradeep; Nebel, Jean-Christophe; Sélem-Mojica, Nelly; Shome, Sayane; Welch, Lonnie; Kilpatrick, Alastair; Rahman, Farzana

doi:10.5281/zenodo.18311878

Published January 20, 2026 | Version v1

Dataset Embargoed

Quantifying domain-specific relevance of computational biology Wikipedia articles using TF-IDF and cosine similarity

1. Kingston University
2. Ohio University
3. University of Zurich
4. Carnegie Mellon University
5. Université Paris Cité
6. Universidad Nacional Autónoma de México
7. Stanford University
8. University of Edinburgh

Wikipedia is one of the world’s most visited websites and serves as the principal open educational resource for computational biology. However, identifying which articles are most relevant to distinct sub‑disciplines of computational biology remains largely subjective.

This study collected short descriptions for 22 Communities of Special Interest (COSI) groups maintained by the International Society for Computational Biology and downloaded 1,536 computational biology articles from English Wikipedia. Following standard text preprocessing, COSI descriptions and Wikipedia articles were embedded in a common TF-IDF vector space. Semantic relatedness was quantified using cosine similarity, yielding a real-valued relevance matrix that maps each COSI to the most pertinent computational biology articles. The resulting scores, typically low in absolute value, captured nuanced differences: general-interest pages such as “Computational biology” and “Bioinformatics” ranked highest, whereas niche pages showed high relevance only for specific COSIs. Unsupervised analysis using principal component analysis, k‑nearest neighbours, and Leiden community detection revealed clusters of articles corresponding to the particular COSIs and highlighted inter‑COSI relationships. This automated pipeline reduces bias compared with manual tagging and enables more precise curation of domain‑specific educational resources.

Files

Embargoed

The files will be made publicly available on July 1, 2026.

Reason: Study under peer review

	All versions	This version
Views	31	31
Downloads	2	2
Data volume	1.3 MB	1.3 MB

Quantifying domain-specific relevance of computational biology Wikipedia articles using TF-IDF and cosine similarity

Authors/Creators

Description

Files

Embargoed