EleutherAI/lm-evaluation-harness: v0.4.8
Creators
- Lintang Sutawika1
- Hailey Schoelkopf
- Leo Gao
- Baber Abbasi
- Stella Biderman2
- Jonathan Tow
- ben fattori
- Charles Lovering
- farzanehnakhaee70
- Jason Phang
- Anish Thite3
- Fazz
- Aflah4
- Niklas Muennighoff
- Thomas Wang5
- sdtblck
- nopperl
- gakada
- tttyuntian
- researcher2
- Julen Etxaniz6
- Chris7
- Hanwool Albert Lee8
- Leonid Sinev
- Zdeněk Kasner9
- Khalid
- KonradSzafer
- Jeffrey Hsu10
- Anjor Kanekar11
- Pawan Sasanka Ammanamanchi
- 1. @EleutherAI
- 2. Booz Allen Hamilton, EleutherAI
- 3. sitebrew.ai
- 4. Max Planck Institute for Software Systems: MPI SWS
- 5. MistralAI
- 6. Hitz Zentroa UPV/EHU
- 7. @azurro
- 8. Shinhan Securities Co.
- 9. Charles University
- 10. Ivy Natal
- 11. Platypus Tech
Description
lm-eval v0.4.8 Release Notes
Key Improvements
New Backend Support:
- Added SGLang as new evaluation backend!
- Enabled model steering with vector support via
sparsify
orsae_lens
Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.
Added Support for
gen_prefix
in config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models
New Benchmarks & Tasks
Code Evaluation
- HumanEval by @hjlee1371 in #1992
- MBPP by @hjlee1371 in #2247
- HumanEval+ and MBPP+ by @bzantium in #2734
Multilingual Expansion
Global Coverage:
- Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
- MLQA multilingual question answering by @KahnSvaer in #2622
Asian Languages:
- HRM8K benchmark for Korean and English by @bzantium in #2627
- Updated KorMedMCQA to version 2.0 by @GyoukChu in #2540
- Fixed TMLU Taiwan-specific tasks tag by @nike00811 in #2420
European Languages:
- Added Evalita-LLM benchmark by @m-resta in #2681
- BasqueBench with Basque translations of ARC and PAWS by @naiarapm in #2732
- Updated Turkish MMLU configuration by @ArdaYueksel in #2678
Middle Eastern Languages:
- Arabic MMLU by @bodasadallah in #2541
- AraDICE task by @firojalam in #2507
Ethics & Reasoning
- Moral Stories by @upunaprosk in #2653
- Histoires Morales by @upunaprosk in #2662
Others
- MMLU Pro Plus by @asgsaeid in #2366
- GroundCocoa by @HarshKohli in #2724
We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- drop python 3.8 support by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2575
- Add Global MMLU Lite by @shivalika-singh in https://github.com/EleutherAI/lm-evaluation-harness/pull/2567
- add warning for truncation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2585
- Wandb step handling bugfix and feature by @sjmielke in https://github.com/EleutherAI/lm-evaluation-harness/pull/2580
- AraDICE task config file by @firojalam in https://github.com/EleutherAI/lm-evaluation-harness/pull/2507
- fix extra_match low if batch_size > 1 by @sywangyi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2595
- fix model tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2604
- update scrolls by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2602
- some minor logging nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2609
- Fix gguf loading via Transformers by @CL-ModelCloud in https://github.com/EleutherAI/lm-evaluation-harness/pull/2596
- Fix Zeno visualizer on tasks like GSM8k by @pasky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2599
- Fix the format of mgsm zh and ja. by @timturing in https://github.com/EleutherAI/lm-evaluation-harness/pull/2587
- Add HumanEval by @hjlee1371 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1992
- Add MBPP by @hjlee1371 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2247
- Add MLQA by @KahnSvaer in https://github.com/EleutherAI/lm-evaluation-harness/pull/2622
- assistant prefill by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2615
- fix gen_prefix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2630
- update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2632
- add hrm8k benchmark for both Korean and English by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2627
- New arabicmmlu by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/2541
- Add
global_mmlu
full version by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2636 - Update KorMedMCQA: ver 2.0 by @GyoukChu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2540
- fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2420
- fixed mmlu generative response extraction by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/2503
- revise mbpp prompt by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2645
- aggregate by group (total and categories) by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2643
- Fix max_tokens handling in vllm_vlms.py by @jkaniecki in https://github.com/EleutherAI/lm-evaluation-harness/pull/2637
- separate category for
global_mmlu
by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2652 - Add Moral Stories by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2653
- add TransformerLens example by @nickypro in https://github.com/EleutherAI/lm-evaluation-harness/pull/2651
- fix multiple input chat tempalte by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2576
- Add Aggregation for Kobest Benchmark by @tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/2446
- update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2660
- remove
group
from bigbench task configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2663 - Add Histoires Morales task by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2662
- MMLU Pro Plus by @asgsaeid in https://github.com/EleutherAI/lm-evaluation-harness/pull/2366
- fix early return for multiple dict in task process_results by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2673
- Turkish mmlu Config Update by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2678
- Fix typos by @omahs in https://github.com/EleutherAI/lm-evaluation-harness/pull/2679
- remove cuda device assertion by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2680
- Adding the Evalita-LLM benchmark by @m-resta in https://github.com/EleutherAI/lm-evaluation-harness/pull/2681
- Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2687
- Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2684
- change ensure_ascii to False for JsonChatStr by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2691
- Set defaults for BLiMP scores by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/2692
- Update remaining references to
assistant_prefill
in docs togen_prefix
by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2683 - Update README.md by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2694
- fix
construct_requests
kwargs in python tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2700 arithmetic
: set target delimiter to empty string by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2701- fix vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2708
- add math_verify to some tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2686
- Logging by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2203
- Replace missing
lighteval/MATH-Hard
dataset withDigitalLearningGmbH/MATH-lighteval
by @f4str in https://github.com/EleutherAI/lm-evaluation-harness/pull/2719 - remove unused import by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2728
- README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2729
- add o3-mini support by @HelloJocelynLu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2697
- add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2732
- Add cocoteros_es task in spanish_bench by @sgs97ua in https://github.com/EleutherAI/lm-evaluation-harness/pull/2721
- Fix the import source for eval_logger by @kailashbuki in https://github.com/EleutherAI/lm-evaluation-harness/pull/2735
- add humaneval+ and mbpp+ by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2734
- Support SGLang as Potential Backend for Evaluation by @Monstertail in https://github.com/EleutherAI/lm-evaluation-harness/pull/2703
- fix log condition on main by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2737
- fix vllm data parallel by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2746
- [Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in https://github.com/EleutherAI/lm-evaluation-harness/pull/2738
- Groundcocoa by @HarshKohli in https://github.com/EleutherAI/lm-evaluation-harness/pull/2724
- fix doc: generate_until only outputs the generated text! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2755
- Enable steering HF models by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/2749
- Add test for a simple Unitxt task by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2742
- add debug log by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2757
- increment version to 0.4.8 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2760
New Contributors
- @shivalika-singh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2567
- @sjmielke made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2580
- @firojalam made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2507
- @CL-ModelCloud made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2596
- @pasky made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2599
- @timturing made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2587
- @hjlee1371 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1992
- @KahnSvaer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2622
- @bzantium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2627
- @bodasadallah made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2541
- @GyoukChu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2540
- @nike00811 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2420
- @RawthiL made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2503
- @jkaniecki made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2637
- @upunaprosk made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2653
- @nickypro made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2651
- @asgsaeid made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2366
- @omahs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2679
- @m-resta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2681
- @f4str made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2719
- @HelloJocelynLu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2697
- @sgs97ua made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2721
- @kailashbuki made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2735
- @Monstertail made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2703
- @HarshKohli made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2724
- @luciaquirke made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2749
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.7...v0.4.8
Files
EleutherAI/lm-evaluation-harness-v0.4.8.zip
Files
(5.1 MB)
Name | Size | Download all |
---|---|---|
md5:cb854773e84e40e1f342a1d14b8518c6
|
5.1 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.8 (URL)
Software
- Repository URL
- https://github.com/EleutherAI/lm-evaluation-harness