There is a newer version of the record available.

Published December 6, 2024 | Version v0.4.0
Software Open

DataTrove: large scale data processing

Description

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality. DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data.

Notes

If you use this software, please cite it using the metadata from this file.

Files

huggingface/datatrove-v0.4.0.zip

Files (17.3 MB)

Name Size Download all
md5:a4d3d17b3e67c8a09591ed117be4f4a5
17.3 MB Preview Download

Additional details

Related works