Published May 11, 2026
| Version v0.4.12
Software
Open
EleutherAI/lm-evaluation-harness: v0.4.12
Authors/Creators
- Lintang Sutawika1
- Hailey Schoelkopf
- Leo Gao
- Baber Abbasi
- Stella Biderman2
- Jonathan Tow
- ben fattori
- Charles Lovering
- farzanehnakhaee70
- Jason Phang
- Anish Thite3
- Fazz
- Aflah4
- Niklas
- Thomas Wang5
- sdtblck
- nopperl
- gakada
- researcher2
- tttyuntian
- Julen Etxaniz6
- James A. Michaelov7
- Chris8
- Chessing234
- Hanwool Albert Lee9
- Janna
- Leonid Sinev
- Khalid
- Kiersten Stokes10
- Zdeněk Kasner11
- 1. Language Technologies Institute, CMU
- 2. Booz Allen Hamilton, EleutherAI
- 3. playscape.gg
- 4. Max Planck Institute for Software Systems: MPI SWS
- 5. MistralAI
- 6. Hitz Zentroa EHU
- 7. MIT
- 8. @azurro
- 9. Shinhan Securities Co.
- 10. Open Source Developer @ IBM
- 11. Charles University
Description
New release with four new model backends, tensor parallel support for transformers based models (hf), new benchmarks, a TaskManager refactor, and a long tail of task correctness fixes.
Highlights
New Model Backends
- TensorRT-LLM (
trt-llm) — NVIDIA TensorRT-LLM backend for optimized GPU inference by @Tracin in #3628 - Megatron-LM (
megatron-lm) — Megatron-LM backend with TP/EP/DP support by @shangxiaokang in #3521 (with follow-up hardening in #3607) - Intel Gaudi — Gaudi support via
optimum-habanaby @12010486 in #3550 - LiteLLM AI gateway (
litellm) — Use LiteLLM as a unified API gateway for 100+ providers by @RheagalFire in #3721 - Native Tensor Parallelism for HF backend — multi-GPU TP for
transformersmodels viatp_planby @YangKai0616 in #3692
TaskManager Refactor (#3549)
TaskManager.load(...)returns a flat{tasks, groups}dict instead of the legacy nested{ConfigurableGroup: {name: Task}}.evaluate()accepts both shapes;load_task_or_group(...)andget_task_dict(...)are deprecated shims that return the old shape.- New
Groupclass directly holds its child tasks;ConfigurableGroupis now a deprecated wrapper around it. - Duplicate task/group configs within the same root are skipped with a log message instead of silently overwritten. (Custom
include_pathentries still override defaults.)
Breaking Changes
SteeredHFrenamed toSteeredModel— update imports if you're using the steering backend by @adrian-sauter in #3592- vLLM minimum bumped to
>=0.18as part of the data-parallel-with-Ray fixes by @baberabb in #3725 enable_thinkingis now disallowed formultiple_choice/ loglikelihood tasks, andthink_end_tokenis now required whenenable_thinking=True. Configurations that combined these previously failed silently by @fxmarty-amd in #3675
New Logger
- Trackio logger with per-sample
Tracelogging by @abidlabs in #3733
New Benchmarks & Tasks
- InfiniteBench — long-context evaluation beyond 100K tokens (12 sub-tasks: code debug/run, KV retrieval, longbook QA/summarization, math find, passkey, etc.) by @siddhant-rajhans in #3662
- CRUXEval — Python code reasoning benchmark with input/output prediction variants (incl. CoT and pass@k variants) by @ThomasHeap in #3699
- Toksuite — multilingual tokenization-robustness benchmark (Chinese, English, and more) by @gsaltintas in #3669
- NEREL-bench — Russian named-entity / relation-extraction benchmark by @bond005 in #3650
- JFinQA — Japanese Financial Numerical Reasoning QA (1000 questions, with consistency / numerical / temporal splits) by @ajtgjmdjp in #3570
Fixes & Improvements
Task Fixes
- Fixed GPQA preprocessing regex that corrupted answer text containing brackets by @Robby955 in #3691 and @Chessing234 in #3735
- Fixed MMLU-Pro and MMLU-Pro-Plus few-shot answers leaking into the user role under chat templates by @kiwaku in #3693, #3747
- Fixed RACE
doc_to_textkeeping a blank marker and dropping the question body by @Chessing234 in #3716 - Fixed BigBench multiple-choice tasks crashing on mixed-format examples (filtered out free-form examples) by @Chessing234 in #3702
- Fixed HeadQA
doc_to_decontamination_querypointing at a nonexistentqueryfield by @Chessing234 in #3718 - Fixed french_bench_topic_based_nli
doc_to_decontamination_querypointing at nonexistenttextefield by @Chessing234 in #3719 - Fixed TruthfulQA-gen
dataset_pathby @zhngstl in #3723 - Fixed NorEval/NorIdiom
!functionimports to use absolute module paths by @Anai-Guo in #3731 - Fixed IFEval
RephraseChecker.strip_changesgreedy-regex bug by @Chessing234 in #3737 - Fixed correctness issues in Arabic normalization and prompt loading by @RinZ27 in #3589
- Updated BLiMP dataset path by @jmichaelov in #3596
- Replaced all references to the
CohereForAIorg withCohereLabsby @juliafalcao in #3631
What's Changed
- refactor(Taskmanager)! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3549
- fix(cli):
--cache_requestsalways fails due to argparsetype/choicesconflict by @maxidl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588 - feat: Add Megatron-LM backend with TP/EP/DP support by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
- Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) by @lucafossen in https://github.com/EleutherAI/lm-evaluation-harness/pull/3601
- [fix] Add missing tokenization progress bar by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3605
- fix: improve model_args type coercion in handle_arg_string by @ManasVardhan in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
- fix: harden Megatron GPT layer spec setup for eval by @shangxiaokang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3607
- Update vLLM import of
resolve_hf_chat_templateby @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3595 - Add docstring for HFLM init keyword arguments by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
- Update all mentions of the
CohereForAIorganization toCohereLabsby @juliafalcao in https://github.com/EleutherAI/lm-evaluation-harness/pull/3631 - Skip caching None responses in async generation path by @joshuaswanson in https://github.com/EleutherAI/lm-evaluation-harness/pull/3633
- Fix correctness issues in Arabic normalization and prompt loading by @RinZ27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
- fix(evaluate tests) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3634
- fix: propagate custom aggregation to dict-valued metric result keys by @s-zx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
- chore(ci-updates) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3635
- Update BLiMP dataset path by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3596
- Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by @ajtgjmdjp in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
- Rename SteeredHF to SteeredModel in lm_eval/models/init.py by @adrian-sauter in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
- fix: Update
WatsonxLLMclass mapping and errors by @Rafal-Chrzanowski-IBM in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591 - Add Intel Gaudi support by @12010486 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
- [fix] Disallow
enable_thinkingwithoutput_type: multiple_choicetasks / loglikelihood tasks; raise error in casethink_end_tokenis not provided withenable_thinking=Trueby @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3675 - fix(vllm): fix dp with ray. remove mp distribution; pin vllm >=0.18 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3725
- refactor(utils): fix mistral tokenizer error; improve doc-strings by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3728
- fix(vllm): fix vllm tokenizer for Mistral; rm default
gpu_memory_utilization=0.9by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3732 - Fix GPQA preprocess stripping mathematical bracket expressions by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
- Guard vLLM tok_encode against prefix_token_id being None by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3724
- fix(ifeval): use non-greedy regex in RephraseChecker.strip_changes by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3737
- fix: bound request cache filename length by @princepal9120 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
- fix codeowners by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3738
- Fix dataset_path for truthfulqa_gen by @zhngstl in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
- fix(vllm): disallow data_parallel with enable_expert_parallel by @FazeelUsmani in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
- Add Trackio logger with per-sample Trace logging by @abidlabs in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
- Fix headqa doc_to_decontamination_query pointing at nonexistent 'query' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3718
- Fix french_bench_topic_based_nli doc_to_decontamination_query pointing at nonexistent 'texte' field by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3719
- fix(noreval/noridiom): use absolute module paths for !function imports (#3624) by @Anai-Guo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
- Fix DummyLM.generate_until printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3711
- Fix MultiChoiceRegexFilter.find_match IndexError on all-empty capture groups by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3708
- fix(model_comparator): fix ImportError from scipy.stats.norm import by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3742
- Fix zeno_visualize discarding tasks intersection result by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3739
- fix: don't pass task stop sequences to vLLM for reasoning models by @jwmacd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
- feat: Add [ LiteLLM AI gateway ] as model backend by @RheagalFire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
- Fix RACE doc_to_text keeping blank marker and dropping the question body by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3716
- Fix BigBench multiple-choice crash on mixed-format tasks by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3702
- Fix GPQA preprocessing: remove bracket-stripping regex that corrupts answer text by @Robby955 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
- Fix mmlu_pro fewshot answers leaking into user role under chat template by @kiwaku in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
- fix(mmlu_pro_plus): sync fixes from
mmlu_proby @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3747 - chore: cleap up deps; fix ci lint by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3748
- Fix DummyLM.generate_until write_out printing context as gen_kwargs by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3714
- Fix median aggregation returning arbitrary element instead of median by @Chessing234 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3696
- fix(api): chat payload leaking top-level text type by @felixmr1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
- [BUGFIX] Consistent handling of None answers and cache by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3656
- Adding Cruxeval by @ThomasHeap in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
- [Task] NEREL-bench by @bond005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3650
- Added Toksuite Benchmark by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3669
- Add InfiniteBench: long-context evaluation beyond 100K tokens by @siddhant-rajhans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
- fix: Reset batch_sizes cache before each _loglikelihood_tokens call by @nevertmr in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
- feat: add TRT-LLM backend. by @Tracin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
- [Feat] Add native Tensor Parallelism support for HF backend by @YangKai0616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692
- feat(release): 0.4.12 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3763
New Contributors
- @maxidl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3588
- @shangxiaokang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3521
- @ManasVardhan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3608
- @joshuaswanson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3630
- @RinZ27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3589
- @s-zx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3626
- @ajtgjmdjp made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3570
- @adrian-sauter made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3592
- @Rafal-Chrzanowski-IBM made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3591
- @12010486 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3550
- @Chessing234 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3735
- @princepal9120 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3729
- @zhngstl made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3723
- @FazeelUsmani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3734
- @abidlabs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3733
- @Anai-Guo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3731
- @jwmacd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3700
- @RheagalFire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3721
- @Robby955 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3691
- @kiwaku made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3693
- @felixmr1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3745
- @ThomasHeap made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3699
- @siddhant-rajhans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3662
- @nevertmr made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3654
- @Tracin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3628
- @YangKai0616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3692
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.11...v0.4.12
Files
EleutherAI/lm-evaluation-harness-v0.4.12.zip
Files
(10.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7224b30588305def8be28e57a1493b9a
|
10.8 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.12 (URL)
Software
- Repository URL
- https://github.com/EleutherAI/lm-evaluation-harness