Published July 25, 2022
| Version 2.4.0
Software
Open
huggingface/datasets: 2.4.0
Authors/Creators
- Quentin Lhoest1
-
Albert Villanova del Moral1
- Patrick von Platen1
- Thomas Wolf1
- Mario Šaško1
- Yacine Jernite1
- Abhishek Thakur1
- Lewis Tunstall1
- Suraj Patil1
- Mariama Drame1
- Julien Chaumond1
- Julien Plu1
- Joe Davison1
- Simon Brandeis1
- Victor Sanh1
- Teven Le Scao1
- Kevin Canwen Xu1
- Nicolas Patry1
- Steven Liu1
- Angelina McMillan-Major1
- Philipp Schmid1
- Sylvain Gugger1
- Nathan Raw1
- Sylvain Lesage1
- Anton Lozhkov1
- Matthew Carrigan1
- Théo Matussière1
- Leandro von Werra1
- Lysandre Debut1
- Stas Bekman1
- Clément Delangue1
- 1. Hugging Face
Description
Dataset Features
- Add
concatenate_datasetsfor iterable datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4500 - Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in https://github.com/huggingface/datasets/pull/4625
- Support using PCM audio files (#4323) by @YooSungHyun in https://github.com/huggingface/datasets/pull/4409
- [data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in https://github.com/huggingface/datasets/pull/4633
- Support extract 7-zip compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4672
- Support extract lz4 compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4700
- Support
metadata.jsonlfrom parent directories inimagefolder@mariosasko in https://github.com/huggingface/datasets/pull/4576
- Update: allocine - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4563
- Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4585
- Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4586
- Update: financial_phrasebank - Host data on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4598
- Update: cfq - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4579
- Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4588
- Update: bookcorpus - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4564
- Update: fever - Refactor and add metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4503
- Update: mlsum - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4574
- Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/4523
- Fix: conll2003 - fix empty example by @lhoestq in https://github.com/huggingface/datasets/pull/4662
- Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in https://github.com/huggingface/datasets/pull/4554
- Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in https://github.com/huggingface/datasets/pull/4706
- Fix: crd3 - fix splits that were containing the same data by @lhoestq in https://github.com/huggingface/datasets/pull/4705
- Add action names in schema_guided_dstc8 dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/4559
- Add evaluation data to acronym_identification by @lewtun in https://github.com/huggingface/datasets/pull/4561
- Update WinoBias README by @sashavor in https://github.com/huggingface/datasets/pull/4631
- Support "tags" yaml tag by @lhoestq in https://github.com/huggingface/datasets/pull/4716
- Fix POS tags by @lhoestq in https://github.com/huggingface/datasets/pull/4715
- AESLC dataset: Add summarization tags by @hobson in https://github.com/huggingface/datasets/pull/4517
- Update docs around audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4440
- Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in https://github.com/huggingface/datasets/pull/4513
- Remove multiple config section by @stevhliu in https://github.com/huggingface/datasets/pull/4600
- Create new sections for audio and vision in guides by @stevhliu in https://github.com/huggingface/datasets/pull/4519
- Document installation of sox OS dependency for audio by @albertvillanova in https://github.com/huggingface/datasets/pull/4713
- Add regression test for
ArrowWriter.write_batchwhen batch is empty by @alvarobartt in https://github.com/huggingface/datasets/pull/4510 - Support all negative values in ClassLabel by @lhoestq in https://github.com/huggingface/datasets/pull/4511
- Add uppercased versions of image file extensions for automatic module inference by @mariosasko in https://github.com/huggingface/datasets/pull/4515
- Patch tests for hfh v0.8.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/4518
- Replace deprecated logging.warn with logging.warning by @hugovk in https://github.com/huggingface/datasets/pull/4539
- [CI] Fix upstream hub test url by @lhoestq in https://github.com/huggingface/datasets/pull/4543
- Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4541
- [CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in https://github.com/huggingface/datasets/pull/4546
- Tell users to upload on the hub directly by @lhoestq in https://github.com/huggingface/datasets/pull/4552
- Add
batch_sizeparameter when callingadd_faiss_indexandadd_faiss_index_from_external_arraysby @alvarobartt in https://github.com/huggingface/datasets/pull/4535 - Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4545
- Properly raise FileNotFound even if the dataset is private by @lhoestq in https://github.com/huggingface/datasets/pull/4536
- Fix hashing for python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/4516
- [CI] Fix some warnings by @lhoestq in https://github.com/huggingface/datasets/pull/4547
- Validate new_fingerprint passed by user by @lhoestq in https://github.com/huggingface/datasets/pull/4587
- Update CI Windows orb by @albertvillanova in https://github.com/huggingface/datasets/pull/4604
- Perform hidden file check on relative data file path by @mariosasko in https://github.com/huggingface/datasets/pull/4551
- Align more metadata with other repo types (models,spaces) by @julien-c in https://github.com/huggingface/datasets/pull/4607
- Align/fix license metadata info by @julien-c in https://github.com/huggingface/datasets/pull/4613
- Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in https://github.com/huggingface/datasets/pull/4611
- Add authentication tip to
load_datasetby @mariosasko in https://github.com/huggingface/datasets/pull/4577 - Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4553
- fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in https://github.com/huggingface/datasets/pull/4630
- Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in https://github.com/huggingface/datasets/pull/4608
- Rename master to main by @lhoestq in https://github.com/huggingface/datasets/pull/4643
- Set HF_SCRIPTS_VERSION to main by @lhoestq in https://github.com/huggingface/datasets/pull/4645
- [Minor fix] Typo correction by @cakiki in https://github.com/huggingface/datasets/pull/4644
- fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in https://github.com/huggingface/datasets/pull/4627
- Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4590
- Fix time type
_arrow_to_datasets_dtypeconversion by @mariosasko in https://github.com/huggingface/datasets/pull/4628 - Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in https://github.com/huggingface/datasets/pull/4660
- Replace
assertEqualwithassertTupleEqualin unit tests for verbosity by @alvarobartt in https://github.com/huggingface/datasets/pull/4496 - Fix
embed_storageon features inside lists/sequences by @mariosasko in https://github.com/huggingface/datasets/pull/4615 - Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in https://github.com/huggingface/datasets/pull/4512
- Transfer CI to GitHub Actions by @albertvillanova in https://github.com/huggingface/datasets/pull/4659
- Fix mock fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/4685
- Trigger CI also on push to main by @albertvillanova in https://github.com/huggingface/datasets/pull/4687
- Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in https://github.com/huggingface/datasets/pull/4622
- Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4688
- Test extractors for all compression formats by @albertvillanova in https://github.com/huggingface/datasets/pull/4689
- Refactor base extractors by @albertvillanova in https://github.com/huggingface/datasets/pull/4690
- Update create dataset card docs by @stevhliu in https://github.com/huggingface/datasets/pull/4683
- Add text decorators by @stevhliu in https://github.com/huggingface/datasets/pull/4663
- Skip tests only for lz4/zstd params if not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4704
- Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in https://github.com/huggingface/datasets/pull/4614
- Docs: Fix same-page haslinks by @mishig25 in https://github.com/huggingface/datasets/pull/4722
- Fix broken link to the Hub by @stevhliu in https://github.com/huggingface/datasets/pull/4726
- Refactor conftest fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/4723
- Add object detection processing tutorial by @nateraw in https://github.com/huggingface/datasets/pull/4710
- Fix require torchaudio and refactor test requirements by @albertvillanova in https://github.com/huggingface/datasets/pull/4708
- docs: ✏️ fix TranslationVariableLanguages example by @severo in https://github.com/huggingface/datasets/pull/4731
- Pin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4735
- Fix named split sorting and remove unnecessary casting by @albertvillanova in https://github.com/huggingface/datasets/pull/4714
- Make cast in
from_pandasmore robust by @mariosasko in https://github.com/huggingface/datasets/pull/4703 - Make Extractor accept Path as input by @albertvillanova in https://github.com/huggingface/datasets/pull/4718
- Refactor Hub tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4729
- Fix to dict conversion of
DatasetInfo/Featuresby @mariosasko in https://github.com/huggingface/datasets/pull/4741
- @hugovk made their first contribution in https://github.com/huggingface/datasets/pull/4539
- @VijayKalmath made their first contribution in https://github.com/huggingface/datasets/pull/4545
- @gugarosa made their first contribution in https://github.com/huggingface/datasets/pull/4630
- @benlipkin made their first contribution in https://github.com/huggingface/datasets/pull/4627
- @YooSungHyun made their first contribution in https://github.com/huggingface/datasets/pull/4409
- @hobson made their first contribution in https://github.com/huggingface/datasets/pull/4517
- @khushmeeet made their first contribution in https://github.com/huggingface/datasets/pull/4554
- @dtuit made their first contribution in https://github.com/huggingface/datasets/pull/4614
Full Changelog: https://github.com/huggingface/datasets/compare/2.3.2...2.4.0
Files
huggingface/datasets-2.4.0.zip
Files
(54.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:dc2bf8b814d446021214bc3807b7937c
|
54.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/huggingface/datasets/tree/2.4.0 (URL)