Impact of Retrieval Store Size Relative to Pretraining Corpus on Multimodal LATEX Benchmark Performance Under Fixed Data Budget
Description
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM
Research goal: What is the impact of retrieval store size relative to pretraining corpus size on the performance of multimodal models on the LATEX benchmarks under a fixed data budget?
Autonomous synthesis report generated by Assignee Research. Tribunal consensus score: 8.5/10.
Notes
Files
paper.pdf
Files
(94.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:74d46da4b1dd3eae2316c3fc19de7595
|
94.8 kB | Preview Download |