EleutherAI/lm-evaluation-harness: v0.4.8

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas Muennighoff; Thomas Wang; sdtblck; nopperl; gakada; tttyuntian; researcher2; Julen Etxaniz; Chris; Hanwool Albert Lee; Leonid Sinev; Zdeněk Kasner; Khalid; KonradSzafer; Jeffrey Hsu; Anjor Kanekar; Pawan Sasanka Ammanamanchi

doi:10.5281/zenodo.14970487

Published March 5, 2025 | Version v0.4.8

Software Open

EleutherAI/lm-evaluation-harness: v0.4.8

1. @EleutherAI
2. Booz Allen Hamilton, EleutherAI
3. sitebrew.ai
4. Max Planck Institute for Software Systems: MPI SWS
5. MistralAI
6. Hitz Zentroa UPV/EHU
7. @azurro
8. Shinhan Securities Co.
9. Charles University
10. Ivy Natal
11. Platypus Tech

lm-eval v0.4.8 Release Notes

Key Improvements

New Backend Support:
- Added SGLang as new evaluation backend!
- Enabled model steering with vector support via sparsify or sae_lens
Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.
Added Support for gen_prefix in config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models

New Benchmarks & Tasks

Code Evaluation

HumanEval by @hjlee1371 in #1992
MBPP by @hjlee1371 in #2247
HumanEval+ and MBPP+ by @bzantium in #2734

Multilingual Expansion

Global Coverage:
- Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
- MLQA multilingual question answering by @KahnSvaer in #2622
Asian Languages:
- HRM8K benchmark for Korean and English by @bzantium in #2627
- Updated KorMedMCQA to version 2.0 by @GyoukChu in #2540
- Fixed TMLU Taiwan-specific tasks tag by @nike00811 in #2420
European Languages:
- Added Evalita-LLM benchmark by @m-resta in #2681
- BasqueBench with Basque translations of ARC and PAWS by @naiarapm in #2732
- Updated Turkish MMLU configuration by @ArdaYueksel in #2678
Middle Eastern Languages:
- Arabic MMLU by @bodasadallah in #2541
- AraDICE task by @firojalam in #2507

Ethics & Reasoning

Moral Stories by @upunaprosk in #2653
Histoires Morales by @upunaprosk in #2662

Others

MMLU Pro Plus by @asgsaeid in #2366
GroundCocoa by @HarshKohli in #2724

We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

drop python 3.8 support by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2575
Add Global MMLU Lite by @shivalika-singh in https://github.com/EleutherAI/lm-evaluation-harness/pull/2567
add warning for truncation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2585
Wandb step handling bugfix and feature by @sjmielke in https://github.com/EleutherAI/lm-evaluation-harness/pull/2580
AraDICE task config file by @firojalam in https://github.com/EleutherAI/lm-evaluation-harness/pull/2507
fix extra_match low if batch_size > 1 by @sywangyi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2595
fix model tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2604
update scrolls by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2602
some minor logging nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2609
Fix gguf loading via Transformers by @CL-ModelCloud in https://github.com/EleutherAI/lm-evaluation-harness/pull/2596
Fix Zeno visualizer on tasks like GSM8k by @pasky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2599
Fix the format of mgsm zh and ja. by @timturing in https://github.com/EleutherAI/lm-evaluation-harness/pull/2587
Add HumanEval by @hjlee1371 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1992
Add MBPP by @hjlee1371 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2247
Add MLQA by @KahnSvaer in https://github.com/EleutherAI/lm-evaluation-harness/pull/2622
assistant prefill by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2615
fix gen_prefix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2630
update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2632
add hrm8k benchmark for both Korean and English by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2627
New arabicmmlu by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/2541
Add global_mmlu full version by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2636
Update KorMedMCQA: ver 2.0 by @GyoukChu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2540
fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2420
fixed mmlu generative response extraction by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/2503
revise mbpp prompt by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2645
aggregate by group (total and categories) by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2643
Fix max_tokens handling in vllm_vlms.py by @jkaniecki in https://github.com/EleutherAI/lm-evaluation-harness/pull/2637
separate category for global_mmlu by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2652
Add Moral Stories by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2653
add TransformerLens example by @nickypro in https://github.com/EleutherAI/lm-evaluation-harness/pull/2651
fix multiple input chat tempalte by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2576
Add Aggregation for Kobest Benchmark by @tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/2446
update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2660
remove group from bigbench task configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2663
Add Histoires Morales task by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2662
MMLU Pro Plus by @asgsaeid in https://github.com/EleutherAI/lm-evaluation-harness/pull/2366
fix early return for multiple dict in task process_results by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2673
Turkish mmlu Config Update by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2678
Fix typos by @omahs in https://github.com/EleutherAI/lm-evaluation-harness/pull/2679
remove cuda device assertion by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2680
Adding the Evalita-LLM benchmark by @m-resta in https://github.com/EleutherAI/lm-evaluation-harness/pull/2681
Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2687
Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2684
change ensure_ascii to False for JsonChatStr by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2691
Set defaults for BLiMP scores by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/2692
Update remaining references to assistant_prefill in docs to gen_prefix by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2683
Update README.md by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2694
fix construct_requests kwargs in python tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2700
arithmetic: set target delimiter to empty string by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2701
fix vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2708
add math_verify to some tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2686
Logging by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2203
Replace missing lighteval/MATH-Hard dataset with DigitalLearningGmbH/MATH-lighteval by @f4str in https://github.com/EleutherAI/lm-evaluation-harness/pull/2719
remove unused import by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2728
README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2729
add o3-mini support by @HelloJocelynLu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2697
add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2732
Add cocoteros_es task in spanish_bench by @sgs97ua in https://github.com/EleutherAI/lm-evaluation-harness/pull/2721
Fix the import source for eval_logger by @kailashbuki in https://github.com/EleutherAI/lm-evaluation-harness/pull/2735
add humaneval+ and mbpp+ by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2734
Support SGLang as Potential Backend for Evaluation by @Monstertail in https://github.com/EleutherAI/lm-evaluation-harness/pull/2703
fix log condition on main by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2737
fix vllm data parallel by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2746
[Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in https://github.com/EleutherAI/lm-evaluation-harness/pull/2738
Groundcocoa by @HarshKohli in https://github.com/EleutherAI/lm-evaluation-harness/pull/2724
fix doc: generate_until only outputs the generated text! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2755
Enable steering HF models by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/2749
Add test for a simple Unitxt task by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2742
add debug log by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2757
increment version to 0.4.8 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2760

New Contributors

@shivalika-singh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2567
@sjmielke made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2580
@firojalam made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2507
@CL-ModelCloud made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2596
@pasky made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2599
@timturing made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2587
@hjlee1371 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1992
@KahnSvaer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2622
@bzantium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2627
@bodasadallah made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2541
@GyoukChu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2540
@nike00811 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2420
@RawthiL made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2503
@jkaniecki made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2637
@upunaprosk made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2653
@nickypro made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2651
@asgsaeid made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2366
@omahs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2679
@m-resta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2681
@f4str made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2719
@HelloJocelynLu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2697
@sgs97ua made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2721
@kailashbuki made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2735
@Monstertail made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2703
@HarshKohli made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2724
@luciaquirke made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2749

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.7...v0.4.8

Files

EleutherAI/lm-evaluation-harness-v0.4.8.zip

Files (5.1 MB)

Name	Size	Download all
EleutherAI/lm-evaluation-harness-v0.4.8.zip md5:cb854773e84e40e1f342a1d14b8518c6	5.1 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.8 (URL)

Repository URL: https://github.com/EleutherAI/lm-evaluation-harness

	All versions	This version
Views	31,543	905
Downloads	824	45
Data volume	2.0 GB	230.8 MB

EleutherAI/lm-evaluation-harness: v0.4.8

Creators

Description

lm-eval v0.4.8 Release Notes

Key Improvements

New Benchmarks & Tasks

Code Evaluation

Multilingual Expansion

Ethics & Reasoning

Others

What's Changed

New Contributors

Files

EleutherAI/lm-evaluation-harness-v0.4.8.zip

Files (5.1 MB)

Additional details

Related works

Software