EleutherAI/lm-evaluation-harness: Major refactor

doi:10.5281/zenodo.10256836

Published December 4, 2023 | Version v0.4.0

Software Open

EleutherAI/lm-evaluation-harness: Major refactor

1. @EleutherAI
2. EleutherAI
3. Booz Allen Hamilton, EleutherAI
4. @ClarosAI
5. Indraprastha Institute of Information Technology Delhi
6. Peking University
7. Hugging Face
8. @azurro
9. Hitz Zentroa UPV/EHU
10. @ufal
11. Ivy Natal

What's Changed

Replace stale triviaqa dataset link by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/364
Update actions/setup-pythonin CI workflows by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365
Bump triviaqa version by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/366
Update lambada_openai multilingual data source by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370
Update Pile Test/Val Download URLs by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373
Added ToxiGen task by @Thartvigsen in https://github.com/EleutherAI/lm-evaluation-harness/pull/377
Added CrowSPairs by @aflah02 in https://github.com/EleutherAI/lm-evaluation-harness/pull/379
Add accuracy metric to crows-pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380
hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/384
Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in https://github.com/EleutherAI/lm-evaluation-harness/pull/390
Upstream hf-causal and hf-seq2seq model implementations by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381
Hosting arithmetic dataset on HuggingFace by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/391
Hosting wikitext on HuggingFace by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/396
Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403
Update README installation instructions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407
feat: evaluation using peft models with CLM by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/414
Update setup.py dependencies by @ret2libc in https://github.com/EleutherAI/lm-evaluation-harness/pull/416
fix: add seq2seq peft by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418
Add support for load_in_8bit and trust_remote_code model params by @philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422
Hotfix: patch issues with the huggingface.py model classes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427
Continuing work on refactor [WIP] by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/425
Document task name wildcard support in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/435
Add non-programmatic BIG-bench-hard tasks by @yurodiviy in https://github.com/EleutherAI/lm-evaluation-harness/pull/406
Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in https://github.com/EleutherAI/lm-evaluation-harness/pull/447
[WIP, Refactor] Staging more changes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/465
[Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467
Configurable-Tasks by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/438
single GPU automatic batching logic by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/394
Fix bugs introduced in #394 #406 and max length bug by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/472
Sort task names to keep the same order always by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/474
Set PAD token to EOS token by @nikhilpinnaparaju in https://github.com/EleutherAI/lm-evaluation-harness/pull/448
[Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486
fix adaptive batch crash when there are no new requests by @jquesnelle in https://github.com/EleutherAI/lm-evaluation-harness/pull/490
Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/426
Create output path directory if necessary by @janEbert in https://github.com/EleutherAI/lm-evaluation-harness/pull/483
Add results of various models in json and md format by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/477
Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501
P3 prompt task by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/493
Evaluation Against Portion of Benchmark Data by @kenhktsui in https://github.com/EleutherAI/lm-evaluation-harness/pull/480
Add option to dump prompts and completions to a JSON file by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/492
Add perplexity task on arbitrary JSON data by @janEbert in https://github.com/EleutherAI/lm-evaluation-harness/pull/481
Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/520
Data Parallelism by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/488
Fix mgpt fewshot by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/522
Extend dtype command line flag to HFLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/523
Add support for loading GPTQ models via AutoGPTQ by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/519
Change type signature of quantized and its default value for python < 3.11 compatibility by @passaglia in https://github.com/EleutherAI/lm-evaluation-harness/pull/532
Fix LLaMA tokenization issue by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/531
[Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/542
Move spaces from context to continuation by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/546
Use max_length in AutoSeq2SeqLM by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/551
Fix typo by @kwikiel in https://github.com/EleutherAI/lm-evaluation-harness/pull/557
Add load_in_4bit and fix peft loading by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/556
Update task_guide.md by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/564
[Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/559
Dataset metric log [WIP] by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/560
Add Anthropic support by @zphang in https://github.com/EleutherAI/lm-evaluation-harness/pull/562
Add MultipleChoiceExactTask by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/537
Revert "Add MultipleChoiceExactTask" by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/568
[Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/567
Remove the registration of "GPT2" as a model type by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/574
[Refactor] Docs update by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577
Better docs by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/576
Update evaluator.py cache_db argument str if model is not str by @poedator in https://github.com/EleutherAI/lm-evaluation-harness/pull/575
Add --max_batch_size and --batch_size auto:N by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/572
[Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/581
Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/582
Fix non-callable attributes in CachingLM by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/584
Add error handling for calling .to(device) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585
fixes some minor issues on tasks. by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/580
Add - 4bit-related args by @SONG-WONHO in https://github.com/EleutherAI/lm-evaluation-harness/pull/579
Fix triviaqa task by @seopbo in https://github.com/EleutherAI/lm-evaluation-harness/pull/525
[Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/578
Logging Samples by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/563
Merge master into big-refactor by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/590
[Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/596
fixes for multiple_choice by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/598
add openbookqa config by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/600
[Refactor] Model guide docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/606
[Refactor] More MCQA fixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/599
[Refactor] Hellaswag by @nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/608
[Refactor] Seq2Seq Models with Multi-Device Support by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/565
[Refactor] CachingLM support via --use_cache by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/619
[Refactor] batch generation better for hf model ; deprecate hf-causal in new release by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/613
[Refactor] Update task statuses on tracking list by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/629
[Refactor] device_map options for hf model type by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/625
[Refactor] Misc. cleanup of dead code by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/609
[Refactor] Log request arguments to per-sample json by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/624
[Refactor] HellaSwag YAML fix by @nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/639
[Refactor] Add caveats to parallelize=True docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/638
fixed super_glue and removed unused yaml config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/645
[Refactor] Fix sample logging by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/646
Add PEFT, quantization, remote code, LLaMA fix by @gakada in https://github.com/EleutherAI/lm-evaluation-harness/pull/644
[Refactor] Handle cuda:0 device assignment by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/647
[refactor] Add prost config by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/640
[Refactor] Misc. bugfixes ; edgecase quantized models by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/648
Update init.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/650
[Refactor] Add Lambada Multilingual by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/658
[Refactor] Add: SWAG,RACE,Arithmetic,Winogrande,PubmedQA by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/627
[refactor] Add qa4mre config by @farzanehnakhaee70 in https://github.com/EleutherAI/lm-evaluation-harness/pull/651
Update generation_kwargs by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/657
[Refactor] Move race dataset on HF to EleutherAI group by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/661
[Refactor] Add Headqa by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/659
[Refactor] Add Unscramble ; Toxigen ; Hendrycks_Ethics ; MathQA by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/660
[Refactor] Port TruthfulQA (mc1 only) by @nopperl in https://github.com/EleutherAI/lm-evaluation-harness/pull/666
[Refactor] Miscellaneous fixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/676
[Refactor] Patch to revamp-process by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/678
Revamp process by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/671
[Refactor] Fix padding ranks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/679
[Refactor] minor edits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/680
[Refactor] Migrate ANLI tasks to yaml by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/682
edited output_path and added help to args by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/684
[Refactor] Minor changes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/685
[Refactor] typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/687
[Test] fix test_evaluator.py by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/675
Fix dummy model not invoking super class constructor by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/688
[Refactor] Migrate webqs task to yaml by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/689
[Refactor] Fix tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/693
[Refactor] Migrate xwinograd tasks to yaml by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/695
Early stop bug of greedy_until (primary_until should be a list of str) by @ZZR0 in https://github.com/EleutherAI/lm-evaluation-harness/pull/700
Remove condition to check for winograd_schema by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/690
[Refactor] Use console script by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/703
[Refactor] Fixes for when using num_fewshot by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/702
[Refactor] Updated anthropic to new API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/710
[Refactor] Cleanup for big-refactor by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/686
Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/720
[Refactor] Benchmark scripts by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/612
[Refactor] Fix Max Length arg by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/723
Add note about MPS by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/728
Update huggingface.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/730
Update README.md by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/732
[Refactor] Port over Autobatching by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/673
[Refactor] Fix Anthropic Import and other fixes by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/724
[Refactor] Remove Unused Variable in Make-Table by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/734
[Refactor] logiqav2 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/711
[Refactor] Fix task packaging by @yeoedward in https://github.com/EleutherAI/lm-evaluation-harness/pull/739
[Refactor] fixed openai by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/736
[Refactor] added some typehints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/742
[Refactor] Port Babi task by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/752
[Refactor] CrowS-Pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/751
Update README.md by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/745
[Refactor] add xcopa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/749
Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/764
[Refactor] Add Blimp by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/763
[Refactor] Use evaluation mode for accelerate to prevent OOM by @tju01 in https://github.com/EleutherAI/lm-evaluation-harness/pull/770
Patch Blimp by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/768
[Refactor] Speedup hellaswag context building by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/774
[Refactor] Patch crowspairs higher_is_better by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/766
[Refactor] XNLI by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/776
[Refactor] Update Benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/777
[WIP] Update API docs in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/747
[Refactor] Real Toxicity Prompts by @aflah02 in https://github.com/EleutherAI/lm-evaluation-harness/pull/725
[Refactor] XStoryCloze by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/759
[Refactor] Glue by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/761
[Refactor] Add triviaqa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/758
[Refactor] Paws-X by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/779
[Refactor] MC Taco by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/783
[Refactor] Truthfulqa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/782
[Refactor] fix doc_to_target processing by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/786
[Refactor] Add README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/757
[Refactor] Don't always require Perspective API key to run by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/788
[Refactor] Added HF model test by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/791
[Big refactor] HF test fixup by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/793
[Refactor] Process Whitespace for greedy_until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/781
[Refactor] Fix metrics in Greedy Until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/780
Update README.md by @Wehzie in https://github.com/EleutherAI/lm-evaluation-harness/pull/803
Merge Fix metrics branch by @uSaiPrashanth in https://github.com/EleutherAI/lm-evaluation-harness/pull/802
[Refactor] Update docs by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/744
[Refactor] Superglue T5 Parity by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/769
Update main.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/817
[Refactor] Coqa by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/820
[Refactor] drop by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/821
[Refactor] Asdiv by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/813
[Refactor] Fix IndexError by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/819
[Refactor] toxicity: API inside function by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/822
[Refactor] wsc273 by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/807
[Refactor] Bump min accelerate version and update documentation by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/812
Add mypy baseline config by @ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/809
[Refactor] Fix wikitext task by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/833
[Refactor] Add WMT tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/775
[Refactor] consolidated tasks tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/831
Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/838
[Refactor] mgsm by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/784
[Refactor] Add top-level import by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/830
Add pyproject.toml by @ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/810
[Refactor] Additions to docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/799
[Refactor] Fix MGSM by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/845
[Refactor] float16 MPS works in torch nightly by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/853
[Refactor] Update benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/850
Switch to pyproject.toml based project metadata by @ethanhs in https://github.com/EleutherAI/lm-evaluation-harness/pull/854
Use Dict to make the code python 3.8 compatible by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/857
[Refactor] NQopen by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/859
[Refactor] NQ-open by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/798
Fix "local variable 'docs' referenced before assignment" error in write_out.py by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/856
[Refactor] 3.8 test compatibility by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/863
[Refactor] Cleanup dependencies by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/860
[Refactor] Qasper, MuTual, MGSM (Native CoT) by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/840
undefined type and output_type when using promptsource fixed by @Hojjat-Mokhtarabadi in https://github.com/EleutherAI/lm-evaluation-harness/pull/842
[Refactor] Deactivate select GH Actions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/871
[Refactor] squadv2 by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/785
[Refactor] Set python3.8 as allowed version by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/862
Fix positional arguments in HF model generate by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/877
[Refactor] MATH by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/861
Create cot_yaml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/870
[Refactor] Port CSATQA to refactor by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/865
[Refactor] CMMLU, C-Eval port ; Add fewshot config by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/864
[Refactor] README.md for Asdiv by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/878
[Refactor] Hotfixes to big-refactor by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/880
Change Python Version to 3.8 in .pre-commit-config.yaml and GitHub Actions by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/895
[Refactor] Fix PubMedQA by @tmabraham in https://github.com/EleutherAI/lm-evaluation-harness/pull/890
[Refactor] Fix error when calling lm-eval by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/899
[Refactor] bigbench by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/852
[Refactor] Fix wildcards by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/900
Add transformation filters by @chrisociepa in https://github.com/EleutherAI/lm-evaluation-harness/pull/883
[Refactor] Flan benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/816
[Refactor] WIP: Add MMLU by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/753
Added notable contributors to the citation block by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/907
[Refactor] Improve error logging by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/908
[Refactor] Add _batch_scheduler in greedy_until by @AndyWolfZwei in https://github.com/EleutherAI/lm-evaluation-harness/pull/912
add belebele by @ManuelFay in https://github.com/EleutherAI/lm-evaluation-harness/pull/885
Update README.md by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/917
[Refactor] Precommit formatting for Belebele by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/926
[Refactor] change all mentions of greedy_until to generate_until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/927
[Refactor] Squadv2 updates by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/923
[Refactor] Verbose by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/910
[Refactor] Fix Unit Tests by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/905
Fix generate_until rename by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/929
[Refactor] Generate_until rename by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/931
Fix 'tqdm' object is not subscriptable" error in huggingface.py when batch size is auto by @jasonkrone in https://github.com/EleutherAI/lm-evaluation-harness/pull/916
[Refactor] Fix Default Metric Call by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/935
Big refactor write out adaption by @MicPie in https://github.com/EleutherAI/lm-evaluation-harness/pull/937
Update pyproject.toml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/915
[Refactor] Fix whitespace warning by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/949
[Refactor] Update documentation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/954
[Refactor]fix two bugs when ran with qasper_bool and toxigen by @AndyWolfZwei in https://github.com/EleutherAI/lm-evaluation-harness/pull/934
[Refactor] Describe local dataset usage in docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/956
[Refactor] Update README, documentation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/955
[Refactor] Don't load MMLU auxiliary_train set by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/953
[Refactor] Patch for Generation Until by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/957
[Refactor] Model written eval by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/815
[Refactor] Bugfix: AttributeError: 'Namespace' object has no attribute 'verbose' by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/966
[Refactor] Mmlu subgroups and weight avg by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/922
[Refactor] Remove deprecated gold_alias task YAML option by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/965
[Refactor] Logging fixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/952
[Refactor] fixes for alternative MMLU tasks. by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/981
[Refactor] Alias fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/987
[Refactor] Minor cleanup on base Task subclasses by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/996
[Refactor] add squad from master by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/971
[Refactor] Squad misc by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/999
[Refactor] Fix CI tests by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/997
[Refactor] will check if group_name is None by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1001
[Refactor] Bugfixes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1002
[Refactor] Verbosity rework by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/958
add description on task/group alias by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/979
[Refactor] Upstream ggml from big-refactor branch by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/967
[Refactor] Improve Handling of Stop-Sequences for HF Batched Generation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1009
[Refactor] Update README by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1020
[Refactor] Remove examples/ folder by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1018
[Refactor] vllm support by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1011
Allow Generation arguments on greedy_until reqs by @uSaiPrashanth in https://github.com/EleutherAI/lm-evaluation-harness/pull/897
Social iqa by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1030
[Refactor] BBH fixup by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1029
Rename bigbench.yml to default.yml by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1032
[Refactor] Num_fewshot process by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/985
[Refactor] Use correct HF model type for MBart-like models by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1024
[Refactor] Urgent fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1033
[Refactor] Versioning by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1031
fixes for sampler by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1038
[Refactor] Update README.md by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1046
[refactor] mps requirement by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1037
[Refactor] Additions to example notebook by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1048
Miscellaneous documentation updates by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1047
[Refactor] add notebook for overview by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1025
Update README.md by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1049
[Refactor] Openai completions by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1008
[Refactor] Added support for OpenAI ChatCompletions by @DaveOkpare in https://github.com/EleutherAI/lm-evaluation-harness/pull/839
[Refactor] Update docs ToC by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1051
[Refactor] Fix fewshot cot mmlu descriptions by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1060

New Contributors

@fattorib made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/373
@Thartvigsen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/377
@aflah02 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/379
@sxjscience made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/390
@Jeffwan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/403
@zanussbaum made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/414
@ret2libc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/416
@philwee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/422
@yurodiviy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/406
@nikhilpinnaparaju made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/447
@lintangsutawika made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/438
@juletx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/472
@janEbert made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/483
@kenhktsui made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/480
@passaglia made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/532
@kwikiel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/557
@poedator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/575
@SONG-WONHO made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/579
@seopbo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/525
@farzanehnakhaee70 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/563
@nopperl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/608
@yeoedward made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/682
@ZZR0 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/700
@tju01 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/770
@Wehzie made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/803
@uSaiPrashanth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/802
@ethanhs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/809
@chrisociepa made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/857
@Hojjat-Mokhtarabadi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/842
@AndyWolfZwei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/912
@ManuelFay made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/885
@jasonkrone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/916
@MicPie made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/937
@DaveOkpare made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/839

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.3.0...v0.4.0

Files

EleutherAI/lm-evaluation-harness-v0.4.0.zip

Files (1.8 MB)

Name	Size	Download all
EleutherAI/lm-evaluation-harness-v0.4.0.zip md5:1edd09dc16ff3deda2f77b9efdddd670	1.8 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.0 (URL)

	All versions	This version
Views	13,440	4,074
Downloads	284	60
Data volume	312.2 MB	115.6 MB

EleutherAI/lm-evaluation-harness: Major refactor

Creators

Description

What's Changed

New Contributors

Files

EleutherAI/lm-evaluation-harness-v0.4.0.zip

Files (1.8 MB)

Additional details

Related works