huggingface/datasets: 1.12.0

doi:10.5281/zenodo.5504237

Published September 13, 2021 | Version 1.12.0

Software Open

huggingface/datasets: 1.12.0

1. Hugging Face

New documentation

New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference

Datasets changes

New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
New: The Pile books3 #2801 (@richarddwang)
New: The Pile stack exchange #2803 (@richarddwang)
New: The Pile openwebtext2 #2802 (@richarddwang)
New: Food-101 #2804 (@nateraw)
New: Beans #2809 (@nateraw)
New: cedr #2796 (@naumov-al)
New: cats_vs_dogs #2807 (@nateraw)
New: MultiEURLEX #2865 (@iliaschalkidis)
New: BIOSSES #2881 (@bwang482)
Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
Update: SUPERB - Add SD task #2661 (@albertvillanova)
Update: SUPERB - Add KS task #2783 (@anton-l)
Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
Update: Openwebtext - update size #2857 (@lhoestq)
Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
Fix: linnaeus - fix url #2852 (@lhoestq)
Fix ToTTo - fix data URL #2864 (@albertvillanova)
Fix: wikicorpus - fix keys #2844 (@lhoestq)
Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
add multi-proc in to_json #2747 (@bhavitvyamalik)
Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

Fix streaming zip files #2798 (@albertvillanova)
Support streaming tar files #2800 (@albertvillanova)
Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
Add url prefix convention for many compression formats #2822 (@lhoestq)
Support streaming datasets that use pathlib #2874 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

Update release instructions #2740 (@albertvillanova)
Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
Allow PyArrow from source #2769 (@patrickvonplaten)
fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
Fix typo in test_dataset_common #2790 (@nateraw)
Fix type hint for data_files #2793 (@albertvillanova)
Bump tqdm version #2814 (@mariosasko)
Use packaging to handle versions #2777 (@albertvillanova)
Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
Rename The Pile subsets #2817 (@lhoestq)
Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
Fix extraction protocol inference from urls with params #2843 (@lhoestq)
Fix caching when moving script #2854 (@lhoestq)
Fix windows CI CondaError #2855 (@lhoestq)
fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
Fix s3fs version in CI #2858 (@lhoestq)
Fix three typos in two files for documentation #2870 (@leny-mi)
Move checks from _map_single to map #2660 (@mariosasko)
fix regex to accept negative timezone #2847 (@jadermcs)
Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
Fix null sequence encoding #2900 (@lhoestq)

Files

huggingface/datasets-1.12.0.zip

Files (41.3 MB)

Name	Size	Download all
huggingface/datasets-1.12.0.zip md5:d063aafad8b8cefe011755804eaafcd6	41.3 MB	Preview Download

Additional details

Is supplement to: https://github.com/huggingface/datasets/tree/1.12.0 (URL)

	All versions	This version
Views	3,461	25
Downloads	201	8
Data volume	8.9 GB	330.1 MB

huggingface/datasets: 1.12.0

Creators

Description

Files

huggingface/datasets-1.12.0.zip

Files (41.3 MB)

Additional details

Related works