Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study
Description
General-purpose text embedding models are trained on broad web-scale corpora, encoding semantic
variation across all human knowledge. When applied to retrieval in a narrow domain, most embedding
dimensions carry cross-domain noise irrelevant to the task. We investigate whether applying Principal
Component Analysis (PCA) to a domain-specific corpus — fitting the projection on document embed-
dings alone — recovers a subspace that improves retrieval performance. Using OpenAI text-embedding-
3-small (1536 dimensions) over a 20-topic medical corpus (300 documents, 20 queries), we find that
PCA-32 with corpus-only fitting achieves MAP 0.9203 versus a baseline of 0.8750 (+5.2%), while also
increasing similarity gap 2.5× and reducing storage 48×. Through five controlled experiments, we show
that domain-directed axes are essential (random projection fails), corpus-only PCA fitting outperforms
fitting on queries and corpus jointly, and PCA gain increases rather than decreases as corpus diversity
grows. Our findings suggest a simple, fine-tuning-free strategy for improving domain-specific retrieval
on top of any pre-trained embedding model.
Files
paper.pdf
Files
(551.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:847331d884f5abfda85e44f3d191624a
|
551.3 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/cloudxlab/pca_embeddings
- Programming language
- Python
- Development Status
- Active