Quantifying domain-specific relevance of computational biology Wikipedia articles using TF-IDF and cosine similarity
Authors/Creators
Description
Wikipedia is one of the world’s most visited websites and serves as the principal open educational resource for computational biology. However, identifying which articles are most relevant to distinct sub‑disciplines of computational biology remains largely subjective.
This study collected short descriptions for 22 Communities of Special Interest (COSI) groups maintained by the International Society for Computational Biology and downloaded 1,536 computational biology articles from English Wikipedia. Following standard text preprocessing, COSI descriptions and Wikipedia articles were embedded in a common TF-IDF vector space. Semantic relatedness was quantified using cosine similarity, yielding a real-valued relevance matrix that maps each COSI to the most pertinent computational biology articles. The resulting scores, typically low in absolute value, captured nuanced differences: general-interest pages such as “Computational biology” and “Bioinformatics” ranked highest, whereas niche pages showed high relevance only for specific COSIs. Unsupervised analysis using principal component analysis, k‑nearest neighbours, and Leiden community detection revealed clusters of articles corresponding to the particular COSIs and highlighted inter‑COSI relationships. This automated pipeline reduces bias compared with manual tagging and enables more precise curation of domain‑specific educational resources.