There is a newer version of the record available.

Published May 27, 2021 | Version 1.7.0
Software Open

huggingface/datasets: 1.7.0

Description

Dataset Changes

  • New: NLU evaluation data #2238 (@dkajtoch)
  • New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
  • New: Bbaw egyptian #2290 (@phiwi)
  • New: GooAQ #2260 (@bhavitvyamalik)
  • New: SubjQA #2302 (@lewtun)
  • New: Ascent KB #2341, #2349 (@phongnt570)
  • New: HLGD #2325 (@tingofurro)
  • New: Qasper #2346 (@cceyda)
  • New: ConvQuestions benchmark #2372 (@PhilippChr)
  • Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
  • Update multi_woz_v22 - update checksum #2281 (@lhoestq)
  • Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
  • Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
  • Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
  • Update: web_science - fixed download link #2338 (@bhavitvyamalik)
  • Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
  • Update: conll2003 - correct labels #2369 (@philschmid)
  • Update: offenseval_dravidian - update citations #2385 (@adeepH)
  • Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
  • Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
  • Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
  • Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
  • Fix: head_qa - Fix keys #2408 (@lhoestq)

Dataset Features

  • Implement Dataset add_item #1870 (@albertvillanova)
  • Implement Dataset add_column #2145 (@albertvillanova)
  • Implement Dataset to JSON #2248, #2352 (@albertvillanova)
  • Add rename_columnS method #2312 (@SBrandeis)
  • add desc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)
  • Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)

Metric Changes

  • New: CUAD metrics #2273 (@bhavitvyamalik)
  • New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
  • Update: CER - Docs, CER above 1 #2342 (@borisdayma)

General improvements and bug fixes

  • Update black #2265 (@lhoestq)
  • Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
  • Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
  • Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
  • Fix query table with iterable #2269 (@lhoestq)
  • Perform minor refactoring: use config #2253 (@albertvillanova)
  • Update format, fingerprint and indices after add_item #2254 (@lhoestq)
  • Always update metadata in arrow schema #2274 (@lhoestq)
  • Make tests run faster #2266 (@lhoestq)
  • Fix metadata validation with config names #2286 (@lhoestq)
  • Fixed typo seperate->separate #2292 (@laksh9950)
  • Allow collaborators to self-assign issues #2289 (@albertvillanova)
  • Mapping in the distributed setting #2298 (@TevenLeScao)
  • Fix conda release #2309 (@lhoestq)
  • Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
  • Set default name in init_dynamic_modules #2320 (@albertvillanova)
  • Fix duplicate keys #2333 (@lhoestq)
  • Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
  • Metadata validation #2107 (@theo-m)
  • Add Validation For README #2121 (@gchhablani)
  • Fix overflow issue in interpolation search #2336 (@mariosasko)
  • Datasets cli improvements #2315 (@mariosasko)
  • Add key type and duplicates verification with hashing #2245 (@NikhilBartwal)
  • More consistent copy logic #2340 (@mariosasko)
  • Update README vallidation rules #2353 (@gchhablani)
  • normalized TOCs and titles in data cards #2355 (@yjernite)
  • simpllify faiss index save #2351 (@Guitaricet)
  • Allow "other-X" in licenses #2368 (@gchhablani)
  • Improve ReadInstruction logic and update docs #2261 (@mariosasko)
  • Disallow duplicate keys in yaml tags #2379 (@lhoestq)
  • maintain YAML structure reading from README #2380 (@bhavitvyamalik)
  • add dataset card title #2381 (@bhavitvyamalik)
  • Add tests for dataset cards #2348 (@gchhablani)
  • Improve example in rounding docs #2383 (@mariosasko)
  • Paperswithcode dataset mapping #2404 (@julien-c)
  • Free datasets with cache file in temp dir on exit #2403 (@mariosasko)

Experimental and work in progress: Cast a dataset for specific tasks

  • Task casting for text classification & question answering #2255 (@SBrandeis)
  • Add check for task templates on dataset load #2390 (@lewtun)
  • Add args description to DatasetInfo #2384 (@lewtun)
  • Improve task api code quality #2376 (@mariosasko)

Files

huggingface/datasets-1.7.0.zip

Files (34.2 MB)

Name Size Download all
md5:5f56ea7bc252c8ec0d728ead71c6d1ef
34.2 MB Preview Download

Additional details

Related works