There is a newer version of the record available.

Published September 13, 2021 | Version 1.12.0
Software Open

huggingface/datasets: 1.12.0

Description

New documentation

  • New documentation structure #2718 (@stevhliu):
    • New: Tutorials
    • New: Hot-to guides
    • New: Conceptual guides
    • Update: Reference

Datasets changes

  • New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
  • New: The Pile books3 #2801 (@richarddwang)
  • New: The Pile stack exchange #2803 (@richarddwang)
  • New: The Pile openwebtext2 #2802 (@richarddwang)
  • New: Food-101 #2804 (@nateraw)
  • New: Beans #2809 (@nateraw)
  • New: cedr #2796 (@naumov-al)
  • New: cats_vs_dogs #2807 (@nateraw)
  • New: MultiEURLEX #2865 (@iliaschalkidis)
  • New: BIOSSES #2881 (@bwang482)
  • Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
  • Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
  • Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
  • Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
  • Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
  • Update: SUPERB - Add SD task #2661 (@albertvillanova)
  • Update: SUPERB - Add KS task #2783 (@anton-l)
  • Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
  • Update: Openwebtext - update size #2857 (@lhoestq)
  • Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
  • Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
  • Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
  • Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
  • Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
  • Fix: linnaeus - fix url #2852 (@lhoestq)
  • Fix ToTTo - fix data URL #2864 (@albertvillanova)
  • Fix: wikicorpus - fix keys #2844 (@lhoestq)
  • Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
  • Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

  • Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
  • Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
  • add multi-proc in to_json #2747 (@bhavitvyamalik)
  • Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

  • Fix streaming zip files #2798 (@albertvillanova)
  • Support streaming tar files #2800 (@albertvillanova)
  • Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
  • Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
  • Add url prefix convention for many compression formats #2822 (@lhoestq)
  • Support streaming datasets that use pathlib #2874 (@albertvillanova)
  • Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
  • Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

  • Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
  • Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

  • Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
  • Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

  • Update release instructions #2740 (@albertvillanova)
  • Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
  • Allow PyArrow from source #2769 (@patrickvonplaten)
  • fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
  • Fix typo in test_dataset_common #2790 (@nateraw)
  • Fix type hint for data_files #2793 (@albertvillanova)
  • Bump tqdm version #2814 (@mariosasko)
  • Use packaging to handle versions #2777 (@albertvillanova)
  • Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
  • Rename The Pile subsets #2817 (@lhoestq)
  • Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
  • Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
  • Fix extraction protocol inference from urls with params #2843 (@lhoestq)
  • Fix caching when moving script #2854 (@lhoestq)
  • Fix windows CI CondaError #2855 (@lhoestq)
  • fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
  • Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
  • Fix s3fs version in CI #2858 (@lhoestq)
  • Fix three typos in two files for documentation #2870 (@leny-mi)
  • Move checks from _map_single to map #2660 (@mariosasko)
  • fix regex to accept negative timezone #2847 (@jadermcs)
  • Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
  • Fix null sequence encoding #2900 (@lhoestq)

Files

huggingface/datasets-1.12.0.zip

Files (41.3 MB)

Name Size Download all
md5:d063aafad8b8cefe011755804eaafcd6
41.3 MB Preview Download

Additional details

Related works