Tri-REx 1.0

Barmettler, Joel; Bernstein, Abraham; Rossetto, Luca

doi:10.5281/zenodo.15166163

Published April 7, 2025 | Version v1

Dataset Open

Tri-REx 1.0

1. University of Zurich
2. Dublin City University

Instead of extracting text from Wikipedia, Tri-REx synthesizes short subject--predicate--object sentences using Mistral 7B in a few-shot prompting fashion. For example, a triple (Albert Einstein, facial hair, walrus moustache) might generate "Dr. Albert Einstein wore a bushy walrus moustache." Each generated sentence is automatically filtered for coherence, correct mention of both subject and object, and accurate preservation of the S-P-O relationships, resulting in high-quality synthetic data.

Tri-REx comprises 21.5 million training sentences, 0.9 million test sentences, and 1.7 million validation sentences, each of which is typically under 30 tokens. This collection stands out because it is intentionally free of pretraining overlap: models cannot simply rely on memorized Wikipedia text. Instead, they must learn or leverage newly provided knowledge sources to recover the correct object tokens during next-token prediction. Researchers can thus verify whether a RAG technique or knowledge-injection approach genuinely conveys facts to a model, rather than merely triggering recall of memorized text.

Files

Files (8.1 GB)

Name	Size	Download all
TriREx_v1.tar md5:a2907b166454f80a194285eb2ad99fc3	8.0 GB	Download
TriRExLite_v1.tar md5:30ef2828f869c2fa44c6745ab4f795b9	44.5 MB	Download

	All versions	This version
Views	79	79
Downloads	87	87
Data volume	339.9 GB	339.9 GB

Tri-REx 1.0

Creators

Description

Files

Files (8.1 GB)