Published May 21, 2026 | Version v1
Preprint Open

Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study

  • 1. Cloudxlab
  • 2. Terno AI

Description

General-purpose text embedding models are trained on broad web-scale corpora, encoding semantic

variation across all human knowledge. When applied to retrieval in a narrow domain, most embedding

dimensions carry cross-domain noise irrelevant to the task. We investigate whether applying Principal

Component Analysis (PCA) to a domain-specific corpus — fitting the projection on document embed-

dings alone — recovers a subspace that improves retrieval performance. Using OpenAI text-embedding-

3-small (1536 dimensions) over a 20-topic medical corpus (300 documents, 20 queries), we find that

PCA-32 with corpus-only fitting achieves MAP 0.9203 versus a baseline of 0.8750 (+5.2%), while also

increasing similarity gap 2.5× and reducing storage 48×. Through five controlled experiments, we show

that domain-directed axes are essential (random projection fails), corpus-only PCA fitting outperforms

fitting on queries and corpus jointly, and PCA gain increases rather than decreases as corpus diversity

grows. Our findings suggest a simple, fine-tuning-free strategy for improving domain-specific retrieval

on top of any pre-trained embedding model.

Files

paper.pdf

Files (551.3 kB)

Name Size Download all
md5:847331d884f5abfda85e44f3d191624a
551.3 kB Preview Download

Additional details

Software

Repository URL
https://github.com/cloudxlab/pca_embeddings
Programming language
Python
Development Status
Active