huggingface/datasets: 2.8.0

Quentin Lhoest; Albert Villanova del Moral; Patrick von Platen; Thomas Wolf; Mario Šaško; Yacine Jernite; Abhishek Thakur; Lewis Tunstall; Suraj Patil; Mariama Drame; Julien Chaumond; Julien Plu; Joe Davison; Simon Brandeis; Victor Sanh; Teven Le Scao; Kevin Canwen Xu; Nicolas Patry; Steven Liu; Angelina McMillan-Major; Philipp Schmid; Sylvain Gugger; Nathan Raw; Sylvain Lesage; Anton Lozhkov; Matthew Carrigan; Théo Matussière; Leandro von Werra; Lysandre Debut; Stas Bekman; Clément Delangue

doi:10.5281/zenodo.7457269

Published December 19, 2022 | Version 2.8.0

Software Open

huggingface/datasets: 2.8.0

1. Hugging Face

Important

Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in https://github.com/huggingface/datasets/pull/5287
- Datasets in streaming mode now update their features after column renaming or removal
Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239
- Use multiprocessing to load multiple files in parallel
Add features param to IterableDataset.map by @alvarobartt in https://github.com/huggingface/datasets/pull/5311
Sharded save_to_disk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268
- Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
- Pass num_proc to use multiprocessing.
Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252
Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
```
from datasets import load_dataset
ds = load_dataset("c4", "en", streaming=True, split="train")
dataloader = DataLoader(ds, batch_size=32, num_workers=4)
```

Docs

Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248

General improvements and bug fixes

typo by @WrRan in https://github.com/huggingface/datasets/pull/5253
typo by @WrRan in https://github.com/huggingface/datasets/pull/5254
remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257
fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256
Fix max_shard_size docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267
Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266
Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250
Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279
Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210
Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285
Use correct dataset type in from_generator docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307
Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294
Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297
Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299
Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065
Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211
Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310
Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313
Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319
[Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320
Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321
- This was affecting datasets like wikipedia or natural_questions
Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328
Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318
fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333
Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329
Close stream in ArrowWriter.finalize before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309
Use same num_proc for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300
Set IterableDataset.map param batch_size typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336
fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234
Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340
Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334
Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341
Support topdown parameter in xwalk by @mariosasko in https://github.com/huggingface/datasets/pull/5308
Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in https://github.com/huggingface/datasets/pull/5302
Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350
Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349
Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344
Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355
Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in https://github.com/huggingface/datasets/pull/5322
Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356
ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366
Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373
Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375

New Contributors

@WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
@eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
@vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...dfwe

New Contributors

@WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
@eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
@vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
@Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0

Files

huggingface/datasets-2.8.0.zip

Files (2.3 MB)

Name	Size	Download all
huggingface/datasets-2.8.0.zip md5:0253251f70971ffa645125faabcca3ec	2.3 MB	Preview Download

Additional details

Is supplement to: https://github.com/huggingface/datasets/tree/2.8.0 (URL)

	All versions	This version
Views	4,456	986
Downloads	283	45
Data volume	11.2 GB	108.0 MB

huggingface/datasets: 2.8.0

Creators

Description

Files

huggingface/datasets-2.8.0.zip

Files (2.3 MB)

Additional details

Related works