Can you Predict the Data? A Workshop on Reproducibility for Language Modelling

Saleem, Qasid; Brandizzi, Nicolo'; Janz, Alicia; Sandfeld, Stefan; Leveling, Johannes

doi:10.5281/zenodo.19705473

Published April 23, 2026 | Version v1

Poster Open

Can you Predict the Data? A Workshop on Reproducibility for Language Modelling

1. Fraunhofer Institute for Intelligent Analysis and Information Systems
2. University of Cologne
3. Forschungszentrum Jülich, Institute for Advanced Simulation
4. Fraunhofer IAIS

The poster “Can You Predict the Data? A Workshop on Reproducibility for Language Modelling” presents an interactive challenge designed to address reproducibility issues in foundation model training, with a particular focus on the often-overlooked role of training data quality and filtering methodologies. It was developed by researchers from the Rhine-Ruhr Center for Scientific Data Literacy.

The central motivation is that public web-scale datasets such as Common Crawl contain substantial noise, which can significantly limit model performance. While much attention is typically given to model architectures and training strategies, the poster argues that undocumented or poorly reported data curation pipelines are a major source of irreproducibility in AI research. To make this issue tangible, the workshop introduces a hands-on challenge in which participants work with a 462 GB corpus containing 132 million documents and attempt to design effective filtering strategies that improve downstream benchmark performance.

The workflow mirrors a realistic large-scale language model pipeline: participants configure filters, process data, index and tokenize text, train models on HPC systems, evaluate performance on benchmarks such as HellaSwag, and document their methodology. The technical setup uses JUSUF for CPU-intensive preprocessing tasks, JURECA with NVIDIA H100 GPUs for training, SLURM for workload management, and software frameworks such as Modalities and DataTrove.

Files

Can you Predict the Data - A Workshop on Reproducibility for Language Modelling.pdf

Files (587.1 kB)

Name	Size	Download all
Can you Predict the Data - A Workshop on Reproducibility for Language Modelling.pdf md5:83f0a71316bc51b09926fc2fd86a3633	587.1 kB	Preview Download

	All versions	This version
Views	36	36
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Can you Predict the Data? A Workshop on Reproducibility for Language Modelling

Authors/Creators

Description

Files

Can you Predict the Data - A Workshop on Reproducibility for Language Modelling.pdf

Files (587.1 kB)