Published December 21, 2021
| Version 1.17.0
Software
Open
huggingface/datasets: 1.17.0
Authors/Creators
- Quentin Lhoest1
-
Albert Villanova del Moral1
- Patrick von Platen1
- Thomas Wolf1
- Mario Šaško1
- Yacine Jernite1
- Abhishek Thakur1
- Lewis Tunstall1
- Suraj Patil1
- Mariama Drame1
- Julien Chaumond1
- Julien Plu1
- Joe Davison1
- Simon Brandeis1
- Victor Sanh1
- Teven Le Scao1
- Kevin Canwen Xu1
- Nicolas Patry1
- Steven Liu1
- Angelina McMillan-Major1
- Philipp Schmid1
- Sylvain Gugger1
- Nathan Raw1
- Sylvain Lesage1
- Anton Lozhkov1
- Matthew Carrigan1
- Théo Matussière1
- Leandro von Werra1
- Lysandre Debut1
- Stas Bekman1
- Clément Delangue1
- 1. Hugging Face
Description
Dataset Changes
- New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3287
- Add The Pile Free Law subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3359
- Add The Pile USPTO subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3360
- Add The Pile subsets by @albertvillanova in https://github.com/huggingface/datasets/pull/3378
- Add The Pile Enron Emails subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3427
- New: British Library Books Genre by @davanstrien in https://github.com/huggingface/datasets/pull/3312
- New: Americas NLI by @fdschmidt93 in https://github.com/huggingface/datasets/pull/3371
- New: Speech commands by @polinaeterna in https://github.com/huggingface/datasets/pull/3335
- New: eli5_category by @jingshenSN2 in https://github.com/huggingface/datasets/pull/3420
- New: OneStopQa by @scaperex in https://github.com/huggingface/datasets/pull/3436
- Update: LABR - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3352
- Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in https://github.com/huggingface/datasets/pull/3376
- Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in https://github.com/huggingface/datasets/pull/3362
- Update: CC100- add Georgian data by @AnzorGozalishvili in https://github.com/huggingface/datasets/pull/3383
- Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in https://github.com/huggingface/datasets/pull/3426
- Update: swahili_news - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3463
- Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in https://github.com/huggingface/datasets/pull/3266
- Fix: QED - fix type of bridge field by @mariosasko in https://github.com/huggingface/datasets/pull/3417
- Fix: ASSET - fix dataset data URLs by @tianjianjiang in https://github.com/huggingface/datasets/pull/3342
- Add Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/3163
- to_tf_dataset() refactor by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3356
- More robust
Nonehandling by @mariosasko in https://github.com/huggingface/datasets/pull/3195 - Add
cast_columntoIterableDatasetby @mariosasko in https://github.com/huggingface/datasets/pull/3439 - Support streaming zipped dataset repo by passing only repo name by @albertvillanova in https://github.com/huggingface/datasets/pull/3375
- Extend support for streaming datasets that use pd.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/3355
- Extend iter_archive to support file object input by @albertvillanova in https://github.com/huggingface/datasets/pull/3443
- Extend text to support yielding lines, paragraphs or documents by @albertvillanova in https://github.com/huggingface/datasets/pull/3442
- Push dataset infos.json to Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3467
- Change TriviaQA license (#3313) by @avinashsai in https://github.com/huggingface/datasets/pull/3330
- Add missing tags to XTREME by @mariosasko in https://github.com/huggingface/datasets/pull/3322
- Remove duplicate name from dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3354
- Fix typos in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3386
- Fix duplicated tag in wikicorpus dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/3458
- Create Language Modeling task by @albertvillanova in https://github.com/huggingface/datasets/pull/3387
- BLEURT: Match key names to correspond with filename by @jaehlee in https://github.com/huggingface/datasets/pull/3348
- Fix links in metrics description by @albertvillanova in https://github.com/huggingface/datasets/pull/3461
- Fix METEOR missing NLTK's omw-1.4 by @lhoestq in https://github.com/huggingface/datasets/pull/3469
- Add ArrayXD docs by @stevhliu in https://github.com/huggingface/datasets/pull/3344
- Document a training loop for streaming dataset by @lhoestq in https://github.com/huggingface/datasets/pull/3370
- Fix formatting in IterableDataset.map docs by @mariosasko in https://github.com/huggingface/datasets/pull/3395
- Correctly indent builder config in dataset script docs by @mariosasko in https://github.com/huggingface/datasets/pull/3432
- Update BLEURT hyperlink by @lewtun in https://github.com/huggingface/datasets/pull/3437
- Quick fix error formatting by @NouamaneTazi in https://github.com/huggingface/datasets/pull/3328
- Fix error message and add extension fallback by @mariosasko in https://github.com/huggingface/datasets/pull/3332
- Avoid content-encoding issue while streaming datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/3350
- Fix JSON ClassLabel casting for integers by @lhoestq in https://github.com/huggingface/datasets/pull/3340
- Better error message when download fails by @lhoestq in https://github.com/huggingface/datasets/pull/3343
- Fix dict source_datasets tagset validator by @albertvillanova in https://github.com/huggingface/datasets/pull/3368
- Fix typo in other-structured-to-text task tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3367
- Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in https://github.com/huggingface/datasets/pull/3296
- Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in https://github.com/huggingface/datasets/pull/3388
- More robust first elem check in encode/cast example by @mariosasko in https://github.com/huggingface/datasets/pull/3402
- Fix module inference for archive with a directory by @albertvillanova in https://github.com/huggingface/datasets/pull/3406
- Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in https://github.com/huggingface/datasets/pull/3410
- Pass new_fingerprint in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/3409
- Fix flaky test again for s3 serialization by @lhoestq in https://github.com/huggingface/datasets/pull/3412
- Skip None encoding (line deleted by accident in #3195) by @mariosasko in https://github.com/huggingface/datasets/pull/3414
- Clean squad dummy data by @lhoestq in https://github.com/huggingface/datasets/pull/3428
- #3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in https://github.com/huggingface/datasets/pull/3382
- Make cast cacheable (again) on Windows by @mariosasko in https://github.com/huggingface/datasets/pull/3429
- Use max number of data files to infer module by @albertvillanova in https://github.com/huggingface/datasets/pull/3407
- Fix iter_archive generator by @albertvillanova in https://github.com/huggingface/datasets/pull/3454
- [Staging] Update dataset repos automatically on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3451
- Update supported versions of Python in setup.py by @mariosasko in https://github.com/huggingface/datasets/pull/3438
- raise exception instead of using assertions. by @manisnesan in https://github.com/huggingface/datasets/pull/3349
- @avinashsai made their first contribution in https://github.com/huggingface/datasets/pull/3330
- @NouamaneTazi made their first contribution in https://github.com/huggingface/datasets/pull/3328
- @davanstrien made their first contribution in https://github.com/huggingface/datasets/pull/3312
- @francisco-perez-sorrosal made their first contribution in https://github.com/huggingface/datasets/pull/3296
- @LashaO made their first contribution in https://github.com/huggingface/datasets/pull/3266
- @fdschmidt93 made their first contribution in https://github.com/huggingface/datasets/pull/3371
- @polinaeterna made their first contribution in https://github.com/huggingface/datasets/pull/3335
- @AnzorGozalishvili made their first contribution in https://github.com/huggingface/datasets/pull/3383
- @tianjianjiang made their first contribution in https://github.com/huggingface/datasets/pull/3342
- @jingshenSN2 made their first contribution in https://github.com/huggingface/datasets/pull/3420
- @scaperex made their first contribution in https://github.com/huggingface/datasets/pull/3436
Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0
Files
huggingface/datasets-1.17.0.zip
Files
(50.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:8dc8fc15e8b67e8c7500567eae7a8ae3
|
50.2 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/datasets/tree/1.17.0 (URL)