Published January 9, 2025 | Version v1
Presentation Open

Croissant: Metadata for Machine Learning Systems

  • 1. ROR icon Royal Netherlands Academy of Arts and Sciences
  • 2. Universitat Oberta de Catalunya
  • 3. ROR icon Universitat Autònoma de Barcelona

Description

Data is vital for machine learning (ML), yet managing it remains a challenge. We present Croissant, a metadata format that standardizes dataset representation across ML tools, frameworks, and platforms. Croissant enhances dataset discoverability, portability, and interoperability, already supporting hundreds of thousands of datasets in popular repositories. It allows seamless integration with widely-used ML frameworks regardless of data storage location. Human evaluations confirm Croissant's metadata as readable, concise, and complete.

The vision is a shared Data Lake enabling federated search across platforms like Dataverse, Kaggle, and HuggingFace. A centralized approach focuses on standardization and repository-level harmonization, while a distributed approachadvocates agile, Linked Data-based solutions that empower diverse communities to integrate within a Distributed Data Network using Croissant ML and AI technologies.

Files

Croissant ML at Dagstuhl 2024.pptx.pdf

Files (9.6 MB)

Name Size Download all
md5:76a43fa766af1cfcf307284349ac389e
9.6 MB Preview Download