Published November 26, 2021
| Version 1.16.0
Software
Open
huggingface/datasets: 1.16.0
Creators
- Quentin Lhoest1
- Albert Villanova del Moral1
- Patrick von Platen1
- Thomas Wolf1
- Mario Šaško1
- Yacine Jernite1
- Abhishek Thakur1
- Lewis Tunstall1
- Suraj Patil1
- Mariama Drame1
- Julien Chaumond1
- Julien Plu1
- Joe Davison1
- Simon Brandeis1
- Victor Sanh1
- Teven Le Scao1
- Kevin Canwen Xu1
- Nicolas Patry1
- Steven Liu1
- Angelina McMillan-Major1
- Philipp Schmid1
- Sylvain Gugger1
- Nathan Raw1
- Sylvain Lesage1
- Anton Lozhkov1
- Matthew Carrigan1
- Théo Matussière1
- Leandro von Werra1
- Lysandre Debut1
- Stas Bekman1
- Clément Delangue1
- 1. Hugging Face
Description
Datasets Changes
- New: riddle_sense by @ziyiwu9494 in https://github.com/huggingface/datasets/pull/3161
- New: Multi-Lingual LibriSpeech by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3198
- New: XCSR by @yangxqiao in https://github.com/huggingface/datasets/pull/3074
- New: CMU Hinglish DoG by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3149
- New: Multidoc2dial by @sivasankalpp in https://github.com/huggingface/datasets/pull/3205
- New: IndoNLI by @afaji in https://github.com/huggingface/datasets/pull/3307
- Update: DaNE - updated URL for download by @MalteHB in https://github.com/huggingface/datasets/pull/3203
- Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in https://github.com/huggingface/datasets/pull/3254
- Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in https://github.com/huggingface/datasets/pull/3225
- Update: KILT - update metadata JSON by @albertvillanova in https://github.com/huggingface/datasets/pull/3276
- Update: Covost 2 - update download instructions by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3281
- Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in https://github.com/huggingface/datasets/pull/3290
- Fix: tuple_ie - fix download url by @mariosasko in https://github.com/huggingface/datasets/pull/3213
- Fix: id_newspapers_2018 - fix streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3249
- Fix: bookcorpusopen - fix RAM usage by @lhoestq in https://github.com/huggingface/datasets/pull/3280
- Fix: Scielo - fix ConnectionError by @mariosasko in https://github.com/huggingface/datasets/pull/3260
- Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in https://github.com/huggingface/datasets/pull/3321
- Push to hub capabilities for
Dataset
andDatasetDict
by @LysandreJik in https://github.com/huggingface/datasets/pull/3098:- upload your dataset to the Hugging face Hub with the
push_to_hub()
method !
- upload your dataset to the Hugging face Hub with the
- Dataset streaming improvements:
- Stream TAR-based dataset using iter_archive by @lhoestq in https://github.com/huggingface/datasets/pull/3110
- Stream from Google Drive and other hosts by @lhoestq in https://github.com/huggingface/datasets/pull/3248
- Support Audio feature in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in https://github.com/huggingface/datasets/pull/3129
- Resolve data_files by split name automatically by @lhoestq in https://github.com/huggingface/datasets/pull/3221
- Filter method for batched=True by @thomasw21 in https://github.com/huggingface/datasets/pull/3244
- Adding
with_rank
arg to pass process rank tomap
by @TevenLeScao in https://github.com/huggingface/datasets/pull/3314
- Add full tagset to conll2003 README by @BramVanroy in https://github.com/huggingface/datasets/pull/3230
- Fix some contact information formats by @lhoestq in https://github.com/huggingface/datasets/pull/3274
- Add wikipedia tags by @lhoestq in https://github.com/huggingface/datasets/pull/3301
- Updating details of IRC disentanglement data by @jkkummerfeld in https://github.com/huggingface/datasets/pull/3259
- New: OpenAI's pass@k code evaluation metric by @lvwerra in https://github.com/huggingface/datasets/pull/2916
- Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in https://github.com/huggingface/datasets/pull/3235
- Update: CER - update to support latest release by @mariosasko in https://github.com/huggingface/datasets/pull/3252
- Update: WER - update to the documentation by @wooters in https://github.com/huggingface/datasets/pull/3278
- Add docs for
to_tf_dataset
by @stevhliu in https://github.com/huggingface/datasets/pull/3175 - Small updates to to_tf_dataset documentation by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3215
- Update link to Datasets Tagging app in Spaces by @albertvillanova in https://github.com/huggingface/datasets/pull/3194
- Improve repository structure docs by @lhoestq in https://github.com/huggingface/datasets/pull/3233
- Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3241
- Add docs for audio processing by @stevhliu in https://github.com/huggingface/datasets/pull/3222
- Add push_to_hub docs by @lhoestq in https://github.com/huggingface/datasets/pull/3319
- Catch token invalid error in CI by @lhoestq in https://github.com/huggingface/datasets/pull/3200
- Pin keras version until TF fixes its release by @albertvillanova in https://github.com/huggingface/datasets/pull/3208
- Fix disable_nullable default value to False by @lhoestq in https://github.com/huggingface/datasets/pull/3211
- Fix code quality in riddle_sense dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3218
- Better error msg if
len(predictions)
doesn't matchlen(references)
in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160 - Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3121
- Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in https://github.com/huggingface/datasets/pull/3216
- Group tests in multiprocessing workers by test file by @albertvillanova in https://github.com/huggingface/datasets/pull/3231
- Fix load_from_disk temporary directory by @lhoestq in https://github.com/huggingface/datasets/pull/3245
- [tiny] fix typo in stream docs by @nollied in https://github.com/huggingface/datasets/pull/3246
- Avoid PyArrow type optimization if it fails by @mariosasko in https://github.com/huggingface/datasets/pull/3234
- Remove redundant isort module placement by @mariosasko in https://github.com/huggingface/datasets/pull/3243
- asserts replaced by exception for text classification task with test. by @manisnesan in https://github.com/huggingface/datasets/pull/3256
- Add os.listdir for streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3270
- asserts replaced with exception for image classification task, csv, json by @manisnesan in https://github.com/huggingface/datasets/pull/3262
- Force data files extraction if download_mode='force_redownload' by @mariosasko in https://github.com/huggingface/datasets/pull/3275
- Minor Typo Fix - Precision to Recall by @SebastinSanty in https://github.com/huggingface/datasets/pull/3279
- Decode audio from remote by @lhoestq in https://github.com/huggingface/datasets/pull/3271
- Fix build_docs CI by @lhoestq in https://github.com/huggingface/datasets/pull/3286
- Allow datasets with indices table when concatenating along axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3288
- f-string formatting by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3277
- Unpin markdown for build_docs now that it's fixed by @lhoestq in https://github.com/huggingface/datasets/pull/3289
- Pin version exclusion for Markdown by @albertvillanova in https://github.com/huggingface/datasets/pull/3293
- Use f-strings in the dataset scripts by @Carlosbogo in https://github.com/huggingface/datasets/pull/3291
- fix old_val typo in f-string by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3302
- asserts replaced with exception for
fingerprint.py
,search.py
,arrow_writer.py
andmetric.py
by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305 - fix: files counted twice in inferred structure by @borisdayma in https://github.com/huggingface/datasets/pull/3309
- Finish transition to PyArrow 3.0.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3318
- Removing query params for dynamic URL caching by @anton-l in https://github.com/huggingface/datasets/pull/3315
- Update BibTeX entry by @albertvillanova in https://github.com/huggingface/datasets/pull/3223
- Fix paper BibTeX citation with proceedings reference by @albertvillanova in https://github.com/huggingface/datasets/pull/3226
- Add CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3228
- Fix URL in CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3229
- Deprecate prepare_module by @albertvillanova in https://github.com/huggingface/datasets/pull/3166
Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0
Files
huggingface/datasets-1.16.0.zip
Files
(43.3 MB)
Name | Size | Download all |
---|---|---|
md5:263b4a9bdb9e34f74508a1f9b66fa5c4
|
43.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/datasets/tree/1.16.0 (URL)