Tri-REx 1.0
Description
Instead of extracting text from Wikipedia, Tri-REx synthesizes short subject--predicate--object sentences using Mistral 7B in a few-shot prompting fashion. For example, a triple (Albert Einstein, facial hair, walrus moustache) might generate "Dr. Albert Einstein wore a bushy walrus moustache." Each generated sentence is automatically filtered for coherence, correct mention of both subject and object, and accurate preservation of the S-P-O relationships, resulting in high-quality synthetic data.
Tri-REx comprises 21.5 million training sentences, 0.9 million test sentences, and 1.7 million validation sentences, each of which is typically under 30 tokens. This collection stands out because it is intentionally free of pretraining overlap: models cannot simply rely on memorized Wikipedia text. Instead, they must learn or leverage newly provided knowledge sources to recover the correct object tokens during next-token prediction. Researchers can thus verify whether a RAG technique or knowledge-injection approach genuinely conveys facts to a model, rather than merely triggering recall of memorized text.
Files
Files
(8.1 GB)
Name | Size | Download all |
---|---|---|
md5:a2907b166454f80a194285eb2ad99fc3
|
8.0 GB | Download |
md5:30ef2828f869c2fa44c6745ab4f795b9
|
44.5 MB | Download |