Published October 13, 2022
| Version 2.6.0
Software
Open
huggingface/datasets: 2.6.0
Authors/Creators
- Quentin Lhoest1
-
Albert Villanova del Moral1
- Patrick von Platen1
- Thomas Wolf1
- Mario Šaško1
- Yacine Jernite1
- Abhishek Thakur1
- Lewis Tunstall1
- Suraj Patil1
- Mariama Drame1
- Julien Chaumond1
- Julien Plu1
- Joe Davison1
- Simon Brandeis1
- Victor Sanh1
- Teven Le Scao1
- Kevin Canwen Xu1
- Nicolas Patry1
- Steven Liu1
- Angelina McMillan-Major1
- Philipp Schmid1
- Sylvain Gugger1
- Nathan Raw1
- Sylvain Lesage1
- Anton Lozhkov1
- Matthew Carrigan1
- Théo Matussière1
- Leandro von Werra1
- Lysandre Debut1
- Stas Bekman1
- Clément Delangue1
- 1. Hugging Face
Description
Important
- [GH->HF] Remove all dataset scripts from github by @lhoestq in https://github.com/huggingface/datasets/pull/4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on
- Add ability to read-write to SQL databases. by @Dref360 in https://github.com/huggingface/datasets/pull/4928
- Read from sqlite file:
from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db") - Allow connection objects in
from_sql+ small doc improvement by @mariosasko in https://github.com/huggingface/datasets/pull/5091from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
- Read from sqlite file:
- Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in https://github.com/huggingface/datasets/pull/5072
- return numpy/torch/tf/jax tensors with
from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
- return numpy/torch/tf/jax tensors with
- Fast dataset iter by @mariosasko in https://github.com/huggingface/datasets/pull/5030
- speed up by a factor of 2 using the Arrow Table reader
- Dataset infos in yaml by @lhoestq in https://github.com/huggingface/datasets/pull/4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
- Add
kwargstoDataset.from_generatorby @mariosasko in https://github.com/huggingface/datasets/pull/5049 - Support
convertersinCsvBuilderby @mariosasko in https://github.com/huggingface/datasets/pull/5057 - added from_generator method to
IterableDatasetclass. by @hamid-vakilzadeh in https://github.com/huggingface/datasets/pull/5052 - Restore saved format state in
load_from_diskby @asofiaoliveira in https://github.com/huggingface/datasets/pull/5073
- Update: hendrycks_test - support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/5041
- Update: swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5019
- Update swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5042
- Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in https://github.com/huggingface/datasets/pull/5022
- Fix: sbu_captions - fix URLs by @donglixp in https://github.com/huggingface/datasets/pull/5020
- Fix: xcsr - fix string features by @albertvillanova in https://github.com/huggingface/datasets/pull/5024
- Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/5040
- Fix: cats_vs_dogs - fix number of samples by @lhoestq in https://github.com/huggingface/datasets/pull/5047
- Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in https://github.com/huggingface/datasets/pull/5048
- Fix: msr_sqa - fix dataset generation by @Timothyxxx in https://github.com/huggingface/datasets/pull/3715
- Add description to hellaswag dataset by @julien-c in https://github.com/huggingface/datasets/pull/4810
- Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5010
- Update languages in aeslc dataset card by @apergo-ai in https://github.com/huggingface/datasets/pull/3357
- Update license to bookcorpus dataset card by @meg-huggingface in https://github.com/huggingface/datasets/pull/3526
- Update paper link in medmcqa dataset card by @monk1337 in https://github.com/huggingface/datasets/pull/4290
- Add oversampling strategy iterable datasets interleave by @ylacombe in https://github.com/huggingface/datasets/pull/5036
- Fix license/citation information of squadshifts dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5054
- Fix missing use_auth_token in streaming docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/5003
- Add some note about running the transformers ci before a release by @lhoestq in https://github.com/huggingface/datasets/pull/5007
- Remove license tag file and validation by @albertvillanova in https://github.com/huggingface/datasets/pull/5004
- Re-apply input columns change by @mariosasko in https://github.com/huggingface/datasets/pull/5008
- patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in https://github.com/huggingface/datasets/pull/5026
- Fix typo in error message by @severo in https://github.com/huggingface/datasets/pull/5027
- Fix import in
ClassLabeldocstring example by @alvarobartt in https://github.com/huggingface/datasets/pull/5029 - Remove redundant code from some dataset module factories by @albertvillanova in https://github.com/huggingface/datasets/pull/5033
- Fix typos in load docstrings and comments by @albertvillanova in https://github.com/huggingface/datasets/pull/5035
- Prefer split patterns from directories over split patterns from filenames by @polinaeterna in https://github.com/huggingface/datasets/pull/4985
- Fix tar extraction vuln by @lhoestq in https://github.com/huggingface/datasets/pull/5016
- Support hfh 0.10 implicit auth by @lhoestq in https://github.com/huggingface/datasets/pull/5031
- Fix
flatten_indiceswith empty indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5043 - Improve CI performance speed of PackagedDatasetTest by @albertvillanova in https://github.com/huggingface/datasets/pull/5037
- Revert task removal in folder-based builders by @mariosasko in https://github.com/huggingface/datasets/pull/5051
- Fix backward compatibility for dataset_infos.json by @lhoestq in https://github.com/huggingface/datasets/pull/5055
- Fix typo by @stevhliu in https://github.com/huggingface/datasets/pull/5059
- Fix CI hfh token warning by @albertvillanova in https://github.com/huggingface/datasets/pull/5062
- Mark CI tests as xfail when 502 error by @albertvillanova in https://github.com/huggingface/datasets/pull/5058
- Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in https://github.com/huggingface/datasets/pull/5077
- Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5067
- Fix header level in Audio docs by @stevhliu in https://github.com/huggingface/datasets/pull/5078
- Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in https://github.com/huggingface/datasets/pull/5071
- Support streaming gzip.open by @albertvillanova in https://github.com/huggingface/datasets/pull/5066
- adding keep in memory by @Mustapha-AJEGHRIR in https://github.com/huggingface/datasets/pull/5082
- refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in https://github.com/huggingface/datasets/pull/5079
- fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in https://github.com/huggingface/datasets/pull/5076
- Align signature of list_repo_files with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5063
- Align signature of create/delete_repo with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5064
- Fix filter with empty indices by @Mouhanedg56 in https://github.com/huggingface/datasets/pull/5087
- Fix tutorial (#5093) by @riccardobucco in https://github.com/huggingface/datasets/pull/5095
- Use HTML relative paths for tiles in the docs by @lewtun in https://github.com/huggingface/datasets/pull/5092
- Fix loading how to guide (#5102) by @riccardobucco in https://github.com/huggingface/datasets/pull/5104
- url encode hub url (#5099) by @riccardobucco in https://github.com/huggingface/datasets/pull/5103
- Free the "hf" filesystem protocol for
hffsby @lhoestq in https://github.com/huggingface/datasets/pull/5101 - Fix task template reload from dict by @lhoestq in https://github.com/huggingface/datasets/pull/5106
- @Wauplin made their first contribution in https://github.com/huggingface/datasets/pull/5026
- @donglixp made their first contribution in https://github.com/huggingface/datasets/pull/5020
- @Timothyxxx made their first contribution in https://github.com/huggingface/datasets/pull/3715
- @hamid-vakilzadeh made their first contribution in https://github.com/huggingface/datasets/pull/5052
- @Mustapha-AJEGHRIR made their first contribution in https://github.com/huggingface/datasets/pull/5082
- @galbwe made their first contribution in https://github.com/huggingface/datasets/pull/5079
- @rahulXs made their first contribution in https://github.com/huggingface/datasets/pull/5076
- @Mouhanedg56 made their first contribution in https://github.com/huggingface/datasets/pull/5087
- @riccardobucco made their first contribution in https://github.com/huggingface/datasets/pull/5095
- @asofiaoliveira made their first contribution in https://github.com/huggingface/datasets/pull/5073
Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.6.0
Files
huggingface/datasets-2.6.0.zip
Files
(2.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b2e3ec8d54a196408a1da5cf1869c065
|
2.2 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/datasets/tree/2.6.0 (URL)