Published December 19, 2022
| Version 2.8.0
Software
Open
huggingface/datasets: 2.8.0
Creators
- Quentin Lhoest1
- Albert Villanova del Moral1
- Patrick von Platen1
- Thomas Wolf1
- Mario Šaško1
- Yacine Jernite1
- Abhishek Thakur1
- Lewis Tunstall1
- Suraj Patil1
- Mariama Drame1
- Julien Chaumond1
- Julien Plu1
- Joe Davison1
- Simon Brandeis1
- Victor Sanh1
- Teven Le Scao1
- Kevin Canwen Xu1
- Nicolas Patry1
- Steven Liu1
- Angelina McMillan-Major1
- Philipp Schmid1
- Sylvain Gugger1
- Nathan Raw1
- Sylvain Lesage1
- Anton Lozhkov1
- Matthew Carrigan1
- Théo Matussière1
- Leandro von Werra1
- Lysandre Debut1
- Stas Bekman1
- Clément Delangue1
- 1. Hugging Face
Description
Important
- Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of
datasets
are not able to reload datasets pushed with this new model, so we encourage everyone to update.
- Fix methods using
IterableDataset.map
that lead tofeatures=None
by @alvarobartt in https://github.com/huggingface/datasets/pull/5287- Datasets in streaming mode now update their
features
after column renaming or removal
- Datasets in streaming mode now update their
- Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239
- Use multiprocessing to load multiple files in parallel
- Add
features
param toIterableDataset.map
by @alvarobartt in https://github.com/huggingface/datasets/pull/5311 - Sharded save_to_disk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268
- Pass
num_shards
ormax_shard_size
tods.save_to_disk()
ords.push_to_hub()
- Pass
num_proc
to use multiprocessing.
- Pass
- Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252
- Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
- Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248
- typo by @WrRan in https://github.com/huggingface/datasets/pull/5253
- typo by @WrRan in https://github.com/huggingface/datasets/pull/5254
- remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257
- fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256
- Fix
max_shard_size
docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267 - Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266
- Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250
- Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279
- Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210
- Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285
- Use correct dataset type in
from_generator
docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307 - Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294
- Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297
- Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299
- Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065
- Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211
- Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310
- Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313
- Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319
- [Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320
- Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321
- This was affecting datasets like
wikipedia
ornatural_questions
- This was affecting datasets like
- Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328
- Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318
- fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333
- Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329
- Close stream in
ArrowWriter.finalize
before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309 - Use same
num_proc
for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300 - Set
IterableDataset.map
parambatch_size
typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336 - fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234
- Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340
- Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334
- Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341
- Support
topdown
parameter inxwalk
by @mariosasko in https://github.com/huggingface/datasets/pull/5308 - Improve
use_auth_token
docstring and deprecateuse_auth_token
indownload_and_prepare
by @mariosasko in https://github.com/huggingface/datasets/pull/5302 - Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350
- Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349
- Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344
- Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355
- Raise error for
.tar
archives in the same way as for.tar.gz
and.tgz
in_get_extraction_protocol
by @polinaeterna in https://github.com/huggingface/datasets/pull/5322 - Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356
- ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366
- Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373
- Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375
- @WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
- @eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
- @vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...dfwe
New Contributors- @WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
- @eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
- @vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
- @Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373
Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0
Files
huggingface/datasets-2.8.0.zip
Files
(2.3 MB)
Name | Size | Download all |
---|---|---|
md5:0253251f70971ffa645125faabcca3ec
|
2.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/datasets/tree/2.8.0 (URL)