Published April 23, 2026 | Version v1
Poster Open

Can you Predict the Data? A Workshop on Reproducibility for Language Modelling

  • 1. ROR icon Fraunhofer Institute for Intelligent Analysis and Information Systems
  • 2. ROR icon University of Cologne
  • 3. Forschungszentrum Jülich, Institute for Advanced Simulation
  • 4. Fraunhofer IAIS

Description

 

The poster “Can You Predict the Data? A Workshop on Reproducibility for Language Modelling” presents an interactive challenge designed to address reproducibility issues in foundation model training, with a particular focus on the often-overlooked role of training data quality and filtering methodologies. It was developed by researchers from the Rhine-Ruhr Center for Scientific Data Literacy.

The central motivation is that public web-scale datasets such as Common Crawl contain substantial noise, which can significantly limit model performance. While much attention is typically given to model architectures and training strategies, the poster argues that undocumented or poorly reported data curation pipelines are a major source of irreproducibility in AI research. To make this issue tangible, the workshop introduces a hands-on challenge in which participants work with a 462 GB corpus containing 132 million documents and attempt to design effective filtering strategies that improve downstream benchmark performance.

The workflow mirrors a realistic large-scale language model pipeline: participants configure filters, process data, index and tokenize text, train models on HPC systems, evaluate performance on benchmarks such as HellaSwag, and document their methodology. The technical setup uses JUSUF for CPU-intensive preprocessing tasks, JURECA with NVIDIA H100 GPUs for training, SLURM for workload management, and software frameworks such as Modalities and DataTrove.

Files

Can you Predict the Data - A Workshop on Reproducibility for Language Modelling.pdf