Published December 6, 2024
| Version v0.4.0
Software
Open
DataTrove: large scale data processing
Authors/Creators
Description
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality. DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data.
Notes
Files
huggingface/datatrove-v0.4.0.zip
Files
(17.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:a4d3d17b3e67c8a09591ed117be4f4a5
|
17.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/huggingface/datatrove/tree/v0.4.0 (URL)
Software
- Repository URL
- https://github.com/huggingface/datatrove