Published September 21, 2022
| Version 2.5.0
Software
Open
huggingface/datasets: 2.5.0
Authors/Creators
- Quentin Lhoest1
-
Albert Villanova del Moral1
- Patrick von Platen1
- Thomas Wolf1
- Mario Šaško1
- Yacine Jernite1
- Abhishek Thakur1
- Lewis Tunstall1
- Suraj Patil1
- Mariama Drame1
- Julien Chaumond1
- Julien Plu1
- Joe Davison1
- Simon Brandeis1
- Victor Sanh1
- Teven Le Scao1
- Kevin Canwen Xu1
- Nicolas Patry1
- Steven Liu1
- Angelina McMillan-Major1
- Philipp Schmid1
- Sylvain Gugger1
- Nathan Raw1
- Sylvain Lesage1
- Anton Lozhkov1
- Matthew Carrigan1
- Théo Matussière1
- Leandro von Werra1
- Lysandre Debut1
- Stas Bekman1
- Clément Delangue1
- 1. Hugging Face
Description
Important
- Drop Python 3.6 support by @mariosasko in https://github.com/huggingface/datasets/pull/4460
- Deprecate metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4739
- Metrics are now deprecated and have been moved to evaluate:
!pip install evaluate import evaluate metric = evaluate.load("accuracy")
- Metrics are now deprecated and have been moved to evaluate:
- Load GitHub datasets from Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
- Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
- latest version of torchaudio 0.12 now requires ffmpeg to read MP3 files, please downgrade to 0.12 for now or use librosa
- Use HTTP requests to access data and metadata through the Datasets REST API (docs here)
- Add AudioFolder packaged loader by @polinaeterna in https://github.com/huggingface/datasets/pull/4530
- Add support for CSV metadata files to ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/4837
- Add support for parsing JSON files in array form by @mariosasko in https://github.com/huggingface/datasets/pull/4997
- add
Dataset.from_listby @sanderland in https://github.com/huggingface/datasets/pull/4890 - Add
Dataset.from_generatorby @mariosasko in https://github.com/huggingface/datasets/pull/4957 - Add oversampling strategies to interleave datasets by @ylacombe in https://github.com/huggingface/datasets/pull/48314901
- Preserve non-
input_columsinDataset.mapifinput_columnsare specified by @mariosasko in https://github.com/huggingface/datasets/pull/4971 - Add
fn_kwargsparam toIterableDataset.mapby @mariosasko in https://github.com/huggingface/datasets/pull/4975 - More rigorous shape inference in to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4763
- Download and prepare as Parquet for cloud storage by @lhoestq in https://github.com/huggingface/datasets/pull/4724
- Shard parquet in download_and_prepare by @lhoestq in https://github.com/huggingface/datasets/pull/4747
- Embed image/audio data in dl_and_prepare parquet by @lhoestq in https://github.com/huggingface/datasets/pull/4987
- Update: natural questions - Add long answer candidates by @seirasto in https://github.com/huggingface/datasets/pull/4368
- Update: opus_paracrawl - update version by @albertvillanova in https://github.com/huggingface/datasets/pull/4816
- Update: ReCoRD - Include entity positions as feature by @richarddwang in https://github.com/huggingface/datasets/pull/4479
- Update: swda - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4914
- Update: Enwik8 - update broken link and information by @mtanghu in https://github.com/huggingface/datasets/pull/4
- Update: compguesswhat - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4968
- Update: nli_tr - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4970
- Update: IndicGLUE - update download links by @sumanthd17 in https://github.com/huggingface/datasets/pull/4978
- Update: iwslt2017 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4992
- Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4788
- Fix: mkqa - Update data URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4823
- Fix: exams - fix bug and checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/4853
- Fix: trec - use fine classes by @albertvillanova in https://github.com/huggingface/datasets/pull/4801
- Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in https://github.com/huggingface/datasets/pull/4871
- Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4904950
- Fix: compguesswhat - fix data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/4959
- Fix: vivos - fix data URL and metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4969
- Fix: MBPP - Add splits by @cwarny in https://github.com/huggingface/datasets/pull/4943
- Add
language_bcp47tag by @lhoestq in https://github.com/huggingface/datasets/pull/4753 - Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in https://github.com/huggingface/datasets/pull/4701
- Remove "unkown" language tags by @lhoestq in https://github.com/huggingface/datasets/pull/4754
- Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in https://github.com/huggingface/datasets/pull/4712
- Added dataset information in clinic oos dataset card by @Arnav-Ladkat in https://github.com/huggingface/datasets/pull/4751
- Fix opus_gnome dataset card by @gojiteji in https://github.com/huggingface/datasets/pull/4806
- Complete the mlqa dataset card by @eldhoittangeorge in https://github.com/huggingface/datasets/pull/4809
- Fix loading example in opus dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4813
- Add missing language tags to resources by @albertvillanova in https://github.com/huggingface/datasets/pull/4819
- Fix titles in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4824
- Fix language tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4826
- Add license metadata to pg19 by @julien-c in https://github.com/huggingface/datasets/pull/4827
- Fix task tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4830
- Fix tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4832
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4833
- Fix documentation card of recipe_nlg dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4834
- Fix documentation card of ethos dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4835
- Update documentation card of miam dataset by @PierreColombo in https://github.com/huggingface/datasets/pull/4846
- Update stackexchange license by @cakiki in https://github.com/huggingface/datasets/pull/4842
- Update ted_talks_iwslt license to include ND by @cakiki in https://github.com/huggingface/datasets/pull/4841
- Fix documentation card of adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4838
- Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
- Fix license tag and Source Data section in billsum dataset card by @kashif in https://github.com/huggingface/datasets/pull/4851
- Fix documentation card of covid_qa_castorini dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4877
- Fix Citation Information section in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4879
- Fix documentation card of math_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4884
- Added names of less-studied languages by @BenjaminGalliot in https://github.com/huggingface/datasets/pull/4880
- Fix language tags resource file by @albertvillanova in https://github.com/huggingface/datasets/pull/4882
- Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4892
- Add citation information to makhzan dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4894
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4891
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4896
- Re-add code and und language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/4899
- Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
- Update GLUE evaluation metadata by @lewtun in https://github.com/huggingface/datasets/pull/4909
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4908
- Add license and citation information to cosmos_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4913
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4921
- Add cc-by-nc-2.0 to list of licenses by @albertvillanova in https://github.com/huggingface/datasets/pull/4930
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4931
- Add Papers with Code ID to scifact dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4941
- Fix license information in qasc dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4951
- Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4940
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4979
- Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4991
- Update map docs by @stevhliu in https://github.com/huggingface/datasets/pull/4743
- Add image classification processing guide by @stevhliu in https://github.com/huggingface/datasets/pull/4748
- Fix train_test_split docs by @NielsRogge in https://github.com/huggingface/datasets/pull/4821
- Update local loading script docs by @stevhliu in https://github.com/huggingface/datasets/pull/4778
- Docs for creating a loading script for image datasets by @stevhliu in https://github.com/huggingface/datasets/pull/4783
- Docs for creating an audio dataset by @stevhliu in https://github.com/huggingface/datasets/pull/4872
- Use CI unit/integration tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4738
- Fix multiprocessing in map_nested by @albertvillanova in https://github.com/huggingface/datasets/pull/4740
- Add 2.4.0 version added to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4767
- Update CI badge by @mariosasko in https://github.com/huggingface/datasets/pull/4764
- Fix version in map_nested docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4765
- fix typo by @xwwwwww in https://github.com/huggingface/datasets/pull/4770
- Unpin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4768
- Remove apache_beam import from module level in natural_questions dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4780
- Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in https://github.com/huggingface/datasets/pull/4777
- Remove dummy data generation docs by @stevhliu in https://github.com/huggingface/datasets/pull/4771
- Require torchaudio<0.12.0 in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/4785
- Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/4812
- Fix typo in streaming docs by @flozi00 in https://github.com/huggingface/datasets/pull/4843
- Fix test of _get_extraction_protocol for TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/4850
- Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/
- Mark CI tests as xfail if Hub HTTP error by @albertvillanova in https://github.com/huggingface/datasets/pull/4845
- [Windows] Fix Access Denied when using os.rename() by @DougTrajano in https://github.com/huggingface/datasets/pull/4825
- [docs] Some tiny doc tweaks by @julien-c in https://github.com/huggingface/datasets/pull/4874
- Document loading from relative path by @stevhliu in https://github.com/huggingface/datasets/pull/4773
- Fix CI reporting by @albertvillanova in https://github.com/huggingface/datasets/pull/
- Add 'val' to VALIDATION_KEYWORDS. by @akt42 in https://github.com/huggingface/datasets/pull/4844
- Raise ManualDownloadError from get_dataset_config_info by @albertvillanova in https://github.com/huggingface/datasets/pull/
- feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in https://github.com/huggingface/datasets/pull/4919
- Fixes a typo in loading documentation by @sighingnow in https://github.com/huggingface/datasets/pull/4929
- Remove main branch rename notice by @lhoestq in https://github.com/huggingface/datasets/pull/4938
- Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4939
- Remove deprecated identical_ok by @lhoestq in https://github.com/huggingface/datasets/pull/4937
- Pin TensorFlow temporarily by @albertvillanova in https://github.com/huggingface/datasets/pull/4954
- Fix minor typo in error message for missing imports by @mariosasko in https://github.com/huggingface/datasets/pull/4948
- Fix TF tests for 2.10 by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4956
- fix BLEU metric card by @antoniolanza1996 in https://github.com/huggingface/datasets/pull/4927
- Update doc upload_dataset.mdx by @mishig25 in https://github.com/huggingface/datasets/pull/4789
- Improve features resolution in streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4762
- Fix label renaming and add a battery of tests by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4781
- Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in https://github.com/huggingface/datasets/pull/4967
- Introduce regex check when pushing as well by @LysandreJik in https://github.com/huggingface/datasets/pull/4946
- [doc] Fix broken snippet that had too many quotes by @tomaarsen in https://github.com/huggingface/datasets/pull/4986
- Fix map batched with torch output by @lhoestq in https://github.com/huggingface/datasets/pull/4972
- fix: avoid casting tuples after Dataset.map by @szmoro in https://github.com/huggingface/datasets/pull/4993
- decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923
- Don't add a tag on the Hub on release by @lhoestq in https://github.com/huggingface/datasets/pull/4998
- Add EmptyDatasetError by @lhoestq in https://github.com/huggingface/datasets/pull/4999
- @seirasto made their first contribution in https://github.com/huggingface/datasets/pull/4368
- @sbroadhurst-hf made their first contribution in https://github.com/huggingface/datasets/pull/4712
- @nawarhalabi made their first contribution in https://github.com/huggingface/datasets/pull/4701
- @Arnav-Ladkat made their first contribution in https://github.com/huggingface/datasets/pull/4751
- @xwwwwww made their first contribution in https://github.com/huggingface/datasets/pull/4770
- @gojiteji made their first contribution in https://github.com/huggingface/datasets/pull/4806
- @eldhoittangeorge made their first contribution in https://github.com/huggingface/datasets/pull/4809
- @flozi00 made their first contribution in https://github.com/huggingface/datasets/pull/4843
- @fl-lo made their first contribution in https://github.com/huggingface/datasets/pull/4869
- @BenjaminGalliot made their first contribution in https://github.com/huggingface/datasets/pull/4880
- @DougTrajano made their first contribution in https://github.com/huggingface/datasets/pull/4825
- @ylacombe made their first contribution in https://github.com/huggingface/datasets/pull/4831
- @osanseviero made their first contribution in https://github.com/huggingface/datasets/pull/4887
- @akt42 made their first contribution in https://github.com/huggingface/datasets/pull/4844
- @sanderland made their first contribution in https://github.com/huggingface/datasets/pull/4890
- @sighingnow made their first contribution in https://github.com/huggingface/datasets/pull/4929
- @mtanghu made their first contribution in https://github.com/huggingface/datasets/pull/4950
- @antoniolanza1996 made their first contribution in https://github.com/huggingface/datasets/pull/4927
- @apohllo made their first contribution in https://github.com/huggingface/datasets/pull/4967
- @cwarny made their first contribution in https://github.com/huggingface/datasets/pull/4943
- @tomaarsen made their first contribution in https://github.com/huggingface/datasets/pull/4986
- @szmoro made their first contribution in https://github.com/huggingface/
Full Changelog: https://github.com/huggingface/datasets/compare/2.4.0...2.5.0
Files
huggingface/datasets-2.5.0.zip
Files
(55.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:cf929b3e51e4495d1ef918f5086584fd
|
55.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/datasets/tree/2.5.0 (URL)