EleutherAI/lm-evaluation-harness: v0.4.9.1

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas; Thomas Wang; sdtblck; nopperl; gakada; tttyuntian; researcher2; Julen Etxaniz; Chris; Hanwool Albert Lee; Leonid Sinev; Zdeněk Kasner; Kiersten Stokes; Khalid; KonradSzafer; Jeffrey Hsu; Anjor Kanekar

doi:10.5281/zenodo.16737642

Published August 4, 2025 | Version v0.4.9.1

Software Open

EleutherAI/lm-evaluation-harness: v0.4.9.1

1. Language Technologies Institute, CMU
2. Booz Allen Hamilton, EleutherAI
3. sitebrew.ai
4. Max Planck Institute for Software Systems: MPI SWS
5. MistralAI
6. Hitz Zentroa UPV/EHU
7. @azurro
8. Shinhan Securities Co.
9. Charles University
10. Open Source Developer @ IBM
11. Ivy Natal
12. Platypus Tech

lm-eval v0.4.9.1 Release Notes

This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!

Enhanced Reasoning Model Handling

Better support for reasoning models with a think_end_token argument to strip intermediate reasoning from outputs for the hf, vllm, and sglang model backends. A related enable_thinking argument was also added for specific models that support it (e.g., Qwen).

New Benchmarks & Tasks

EgyMMLU and EgyHellaSwag by @houdaipha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
MultiBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3155
LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062

Fixes & Improvements

Tasks & Benchmarks:

Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (https://github.com/EleutherAI/lm-evaluation-harness/pull/3201. #3092, #3102)
Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
Removed redundant "Let's think step by step" text from bbh_cot_fewshot prompts by @philipdoldo. (#3140)
Increased max_gen_toks to 2048 for HRM8K math benchmarks by @shing100. (#3124)

Backend & Stability:

Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced DISABLE_MULTIPROC envar by @ankitgola005 and @neel04. (#3135, #3106)
add image hashing and LMEVAL_HASHMM envar by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2973
TaskManager: include-path precedence handling to prioritize custom dir over default by @parkhs21 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068

Housekeeping:

Pinned datasets < 4.0.0 temporarily to maintain compatibility with trust_remote_code by @baberabb. (#3172)
Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)

What's Changed

llama3 task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3074
Fix Anthropic API compatibility issues in chat completions by @NourFahmy in https://github.com/EleutherAI/lm-evaluation-harness/pull/3054
Ensure backwards compatibility in fewshot_context by using kwargs by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3079
[vllm] remove system message if TemplateError for chat_template by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3076
feat / fix: Properly make use of subfolder from HF models by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3072
[HF] fix quantization config by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3039
FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in https://github.com/EleutherAI/lm-evaluation-harness/pull/3092
Truthfulqa multi harness by @BlancaCalvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in https://github.com/EleutherAI/lm-evaluation-harness/pull/3099
Humaneval - fix regression by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3102
Bugfix/hf tokenizer gguf override by @ankush13r in https://github.com/EleutherAI/lm-evaluation-harness/pull/3098
[FIX] Initial code to disable multi-proc for stderr by @neel04 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3106
fix deps; update hooks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3107
delete unneeded files by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3108
Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in https://github.com/EleutherAI/lm-evaluation-harness/pull/3097
add image hashing and LMEVAL_HASHMM envar by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2973
removal of Neural Magic models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3112
Neuralmagic by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3113
check pil dep when hashing images by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3114
warning for "chat" pretrained; disable buggy evalita configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3127
fix: remove warning by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3128
Adding EgyMMLU and EgyHellaSwag by @houdaipha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3138
Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3135
when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in https://github.com/EleutherAI/lm-evaluation-harness/pull/3132
truncate thinking tags in generations by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3145
bbh_cot_fewshot: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3140
Fix medical benchmarks import by @idantene in https://github.com/EleutherAI/lm-evaluation-harness/pull/3151
fix request hanging when request api by @mmmans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3090
Custom request headers | trust_remote_code param fix by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3069
Bugfix: update path for GLUE by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3159
Add the MultiBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3155
multiblimp - readme by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3162
[tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3163
Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3124
feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
Added chat_template_args to vllm by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3164
Pin datasets < 4.0.0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3172
Remove "device" from vllm_causallms.py by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3176
remove trust-remote-code in configs; fix escape sequences by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3180
Fix vllm test issue that call pop() from None by @weireweire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3182
[hotfix] vllm: pop device from kwargs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3181
Update vLLM compatibility by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3024
Fix mmlu_continuation subgroup names to fit Readme and other variants by @lamalunderscore in https://github.com/EleutherAI/lm-evaluation-harness/pull/3137
Fix humaneval_instruct by @idantene in https://github.com/EleutherAI/lm-evaluation-harness/pull/3201
Update README.md for mlqa by @newme616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3117
improve include-path precedence handling by @parkhs21 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068
Bump version to 0.4.9.1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3208

New Contributors

@NourFahmy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3054
@userljz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3092
@BlancaCalvo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
@stakodiak made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3099
@ankush13r made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3098
@neel04 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3106
@DebjyotiRay made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3097
@houdaipha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
@ankitgola005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3135
@Jacky-MYQ made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3132
@philipdoldo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3140
@idantene made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3151
@mmmans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3090
@shing100 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3124
@karimovaSvetlana made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
@weireweire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3182
@DarkLight1337 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3024
@lamalunderscore made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3137
@newme616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3117
@parkhs21 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1

Files

EleutherAI/lm-evaluation-harness-v0.4.9.1.zip

Files (9.2 MB)

Name	Size	Download all
EleutherAI/lm-evaluation-harness-v0.4.9.1.zip md5:c9963fc62b221f792bfdbdc69681b33f	9.2 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.9.1 (URL)

Repository URL: https://github.com/EleutherAI/lm-evaluation-harness

	All versions	This version
Views	34,650	704
Downloads	1,022	42
Data volume	2.7 GB	440.9 MB

EleutherAI/lm-evaluation-harness: v0.4.9.1

Creators

Description

lm-eval v0.4.9.1 Release Notes

Enhanced Reasoning Model Handling

New Benchmarks & Tasks

Fixes & Improvements

Tasks & Benchmarks:

Backend & Stability:

Housekeeping:

What's Changed

New Contributors

Files

EleutherAI/lm-evaluation-harness-v0.4.9.1.zip

Files (9.2 MB)

Additional details

Related works

Software