EleutherAI/lm-evaluation-harness: v0.4.12

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas; Thomas Wang; sdtblck; nopperl; gakada; researcher2; tttyuntian; Julen Etxaniz; James A. Michaelov; Chris; Chessing234; Hanwool Albert Lee; Janna; Leonid Sinev; Khalid; Kiersten Stokes; Zdeněk Kasner

doi:10.5281/zenodo.20122284

Published May 11, 2026 | Version v0.4.12

Software Open

EleutherAI/lm-evaluation-harness: v0.4.12

1. Language Technologies Institute, CMU
2. Booz Allen Hamilton, EleutherAI
3. playscape.gg
4. Max Planck Institute for Software Systems: MPI SWS
5. MistralAI
6. Hitz Zentroa EHU
7. MIT
8. @azurro
9. Shinhan Securities Co.
10. Open Source Developer @ IBM
11. Charles University

New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.

Highlights

New Model Backends

TensorRT-LLM (trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628
Megatron-LM (megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607)
Intel Gaudi — Gaudi support via optimum-habana by @12010486 in #3550
LiteLLM AI gateway (litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721
Native Tensor Parallelism for HF backend — multi-GPU TP for transformers models via tp_plan by @YangKai0616 in #3692

`TaskManager` Refactor (#3549)

TaskManager.load(...) returns a flat {tasks, groups} dict instead of the legacy nested {ConfigurableGroup: {name: Task}}. evaluate() accepts both shapes; load_task_or_group(...) and get_task_dict(...) are deprecated shims that return the old shape.
New Group class directly holds its child tasks; ConfigurableGroup is now a deprecated wrapper around it.
Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom include_path entries still override defaults.)

Breaking Changes

SteeredHF renamed to SteeredModel — update imports if you're using the steering backend by @adrian-sauter in #3592
vLLM minimum bumped to >=0.18 as part of the data-parallel-with-Ray fixes by @baberabb in #3725
enable_thinking is now disallowed for multiple_choice / loglikelihood tasks, and think_end_token is now required when enable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675

New Logger

Trackio logger with per-sample Trace logging by @abidlabs in #3733

New Benchmarks & Tasks

InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570

Fixes & Improvements

Task Fixes

Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
Fixed RACE doc_to_text keeping a blank marker and dropping the question body by @Chessing234 in #3716
Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
Fixed HeadQA doc_to_decontamination_query pointing at a nonexistent query field by @Chessing234 in #3718
Fixed french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent texte field by @Chessing234 in #3719
Fixed TruthfulQA-gen dataset_path by @zhngstl in #3723
Fixed NorEval/NorIdiom !function imports to use absolute module paths by @Anai-Guo in #3731
Fixed IFEval RephraseChecker.strip_changes greedy-regex bug by @Chessing234 in #3737
Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
Updated BLiMP dataset path by @jmichaelov in #3596
Replaced all references to the CohereForAI org with CohereLabs by @juliafalcao in #3631

What's Changed

refactor(Taskmanager)! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3549
fix(cli): --cache_requests always fails due to argparse type/choices conflict by @maxidl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588
feat: Add Megatron-LM backend with TP/EP/DP support by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) by @lucafossen in https://github.com/EleutherAI/lm-evaluation-harness/pull/3601
[fix] Add missing tokenization progress bar by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3605
fix: improve model_args type coercion in handle_arg_string by @ManasVardhan in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
fix: harden Megatron GPT layer spec setup for eval by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3607
Update vLLM import of resolve_hf_chat_template by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3595
Add docstring for HFLM init keyword arguments by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
Update all mentions of the CohereForAI organization to CohereLabs by @juliafalcao in https://github.com/EleutherAI/lm-evaluation-harness/pull/3631
Skip caching None responses in async generation path by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3633
Fix correctness issues in Arabic normalization and prompt loading by @RinZ27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
fix(evaluate tests) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3634
fix: propagate custom aggregation to dict-valued metric result keys by @s-zx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
chore(ci-updates) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3635
Update BLiMP dataset path by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3596
Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by @ajtgjmdjp in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
Rename SteeredHF to SteeredModel in lm_eval/models/init.py by @adrian-sauter in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
fix: Update WatsonxLLM class mapping and errors by @Rafal-Chrzanowski-IBM in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591
Add Intel Gaudi support by @12010486 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
[fix] Disallow enable_thinking with output_type: multiple_choice tasks / loglikelihood tasks; raise error in case think_end_token is not provided with enable_thinking=True by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3675
fix(vllm): fix dp with ray. remove mp distribution; pin vllm >=0.18 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3725
refactor(utils): fix mistral tokenizer error; improve doc-strings by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3728
fix(vllm): fix vllm tokenizer for Mistral; rm default gpu_memory_utilization=0.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3732
Fix GPQA preprocess stripping mathematical bracket expressions by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
Guard vLLM tok_encode against prefix_token_id being None by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3724
fix(ifeval): use non-greedy regex in RephraseChecker.strip_changes by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3737
fix: bound request cache filename length by @princepal9120 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
fix codeowners by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3738
Fix dataset_path for truthfulqa_gen by @zhngstl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
fix(vllm): disallow data_parallel with enable_expert_parallel by @FazeelUsmani in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
Add Trackio logger with per-sample Trace logging by @abidlabs in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
Fix headqa doc_to_decontamination_query pointing at nonexistent 'query' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3718
Fix french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent 'texte' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3719
fix(noreval/noridiom): use absolute module paths for !function imports (#3624) by @Anai-Guo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
Fix DummyLM.generate_until printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3711
Fix MultiChoiceRegexFilter.find_match IndexError on all-empty capture groups by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3708
fix(model_comparator): fix ImportError from scipy.stats.norm import by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3742
Fix zeno_visualize discarding tasks intersection result by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3739
fix: don't pass task stop sequences to vLLM for reasoning models by @jwmacd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
feat: Add [ LiteLLM AI gateway ] as model backend by @RheagalFire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
Fix RACE doc_to_text keeping blank marker and dropping the question body by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3716
Fix BigBench multiple-choice crash on mixed-format tasks by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3702
Fix GPQA preprocessing: remove bracket-stripping regex that corrupts answer text by @Robby955 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
Fix mmlu_pro fewshot answers leaking into user role under chat template by @kiwaku in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
fix(mmlu_pro_plus): sync fixes from mmlu_pro by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3747
chore: cleap up deps; fix ci lint by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3748
Fix DummyLM.generate_until write_out printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3714
Fix median aggregation returning arbitrary element instead of median by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3696
fix(api): chat payload leaking top-level text type by @felixmr1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
[BUGFIX] Consistent handling of None answers and cache by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3656
Adding Cruxeval by @ThomasHeap in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
[Task] NEREL-bench by @bond005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3650
Added Toksuite Benchmark by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3669
Add InfiniteBench: long-context evaluation beyond 100K tokens by @siddhant-rajhans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
fix: Reset batch_sizes cache before each _loglikelihood_tokens call by @nevertmr in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
feat: add TRT-LLM backend. by @Tracin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
[Feat] Add native Tensor Parallelism support for HF backend by @YangKai0616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692
feat(release): 0.4.12 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3763

New Contributors

@maxidl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588
@shangxiaokang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
@ManasVardhan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
@joshuaswanson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
@RinZ27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
@s-zx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
@ajtgjmdjp made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
@adrian-sauter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
@Rafal-Chrzanowski-IBM made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591
@12010486 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
@Chessing234 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
@princepal9120 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
@zhngstl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
@FazeelUsmani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
@abidlabs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
@Anai-Guo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
@jwmacd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
@RheagalFire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
@Robby955 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
@kiwaku made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
@felixmr1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
@ThomasHeap made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
@siddhant-rajhans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
@nevertmr made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
@Tracin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
@YangKai0616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.11...v0.4.12

Files

EleutherAI/lm-evaluation-harness-v0.4.12.zip

Files (10.8 MB)

Name	Size	Download all
EleutherAI/lm-evaluation-harness-v0.4.12.zip md5:7224b30588305def8be28e57a1493b9a	10.8 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.12 (URL)

Repository URL: https://github.com/EleutherAI/lm-evaluation-harness

	All versions	This version
Views	56,256	332
Downloads	1,541	38
Data volume	5.5 GB	410.1 MB

EleutherAI/lm-evaluation-harness: v0.4.12

Authors/Creators

Description

Highlights

New Model Backends

TaskManager Refactor (#3549)

Breaking Changes

New Logger

New Benchmarks & Tasks

Fixes & Improvements

Task Fixes

What's Changed

New Contributors

Files

EleutherAI/lm-evaluation-harness-v0.4.12.zip

Files (10.8 MB)

Additional details

Related works

Software

`TaskManager` Refactor (#3549)