Published August 4, 2025 | Version v0.4.9.1
Software Open

EleutherAI/lm-evaluation-harness: v0.4.9.1

  • 1. Language Technologies Institute, CMU
  • 2. Booz Allen Hamilton, EleutherAI
  • 3. sitebrew.ai
  • 4. Max Planck Institute for Software Systems: MPI SWS
  • 5. MistralAI
  • 6. Hitz Zentroa UPV/EHU
  • 7. @azurro
  • 8. Shinhan Securities Co.
  • 9. Charles University
  • 10. Open Source Developer @ IBM
  • 11. Ivy Natal
  • 12. Platypus Tech

Description

lm-eval v0.4.9.1 Release Notes

This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!

Enhanced Reasoning Model Handling

  • Better support for reasoning models with a think_end_token argument to strip intermediate reasoning from outputs for the hf, vllm, and sglang model backends. A related enable_thinking argument was also added for specific models that support it (e.g., Qwen).

New Benchmarks & Tasks

  • EgyMMLU and EgyHellaSwag by @houdaipha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
  • MultiBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3155
  • LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
  • Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062

Fixes & Improvements

Tasks & Benchmarks:

  • Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (https://github.com/EleutherAI/lm-evaluation-harness/pull/3201. #3092, #3102)
  • Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
  • Removed redundant "Let's think step by step" text from bbh_cot_fewshot prompts by @philipdoldo. (#3140)
  • Increased max_gen_toks to 2048 for HRM8K math benchmarks by @shing100. (#3124)

Backend & Stability:

  • Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
  • Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced DISABLE_MULTIPROC envar by @ankitgola005 and @neel04. (#3135, #3106)
  • add image hashing and LMEVAL_HASHMM envar by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2973
  • TaskManager: include-path precedence handling to prioritize custom dir over default by @parkhs21 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068

Housekeeping:

  • Pinned datasets < 4.0.0 temporarily to maintain compatibility with trust_remote_code by @baberabb. (#3172)
  • Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)

What's Changed

  • llama3 task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3074
  • Fix Anthropic API compatibility issues in chat completions by @NourFahmy in https://github.com/EleutherAI/lm-evaluation-harness/pull/3054
  • Ensure backwards compatibility in fewshot_context by using kwargs by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3079
  • [vllm] remove system message if TemplateError for chat_template by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3076
  • feat / fix: Properly make use of subfolder from HF models by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3072
  • [HF] fix quantization config by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3039
  • FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in https://github.com/EleutherAI/lm-evaluation-harness/pull/3092
  • Truthfulqa multi harness by @BlancaCalvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
  • Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in https://github.com/EleutherAI/lm-evaluation-harness/pull/3099
  • Humaneval - fix regression by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3102
  • Bugfix/hf tokenizer gguf override by @ankush13r in https://github.com/EleutherAI/lm-evaluation-harness/pull/3098
  • [FIX] Initial code to disable multi-proc for stderr by @neel04 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3106
  • fix deps; update hooks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3107
  • delete unneeded files by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3108
  • Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in https://github.com/EleutherAI/lm-evaluation-harness/pull/3097
  • add image hashing and LMEVAL_HASHMM envar by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2973
  • removal of Neural Magic models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3112
  • Neuralmagic by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3113
  • check pil dep when hashing images by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3114
  • warning for "chat" pretrained; disable buggy evalita configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3127
  • fix: remove warning by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3128
  • Adding EgyMMLU and EgyHellaSwag by @houdaipha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
  • Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3138
  • Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3135
  • when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in https://github.com/EleutherAI/lm-evaluation-harness/pull/3132
  • truncate thinking tags in generations by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3145
  • bbh_cot_fewshot: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3140
  • Fix medical benchmarks import by @idantene in https://github.com/EleutherAI/lm-evaluation-harness/pull/3151
  • fix request hanging when request api by @mmmans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3090
  • Custom request headers | trust_remote_code param fix by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3069
  • Bugfix: update path for GLUE by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3159
  • Add the MultiBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3155
  • multiblimp - readme by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3162
  • [tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3163
  • Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3124
  • feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
  • Added chat_template_args to vllm by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3164
  • Pin datasets < 4.0.0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3172
  • Remove "device" from vllm_causallms.py by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3176
  • remove trust-remote-code in configs; fix escape sequences by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3180
  • Fix vllm test issue that call pop() from None by @weireweire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3182
  • [hotfix] vllm: pop device from kwargs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3181
  • Update vLLM compatibility by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3024
  • Fix mmlu_continuation subgroup names to fit Readme and other variants by @lamalunderscore in https://github.com/EleutherAI/lm-evaluation-harness/pull/3137
  • Fix humaneval_instruct by @idantene in https://github.com/EleutherAI/lm-evaluation-harness/pull/3201
  • Update README.md for mlqa by @newme616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3117
  • improve include-path precedence handling by @parkhs21 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068
  • Bump version to 0.4.9.1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3208

New Contributors

  • @NourFahmy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3054
  • @userljz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3092
  • @BlancaCalvo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
  • @stakodiak made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3099
  • @ankush13r made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3098
  • @neel04 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3106
  • @DebjyotiRay made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3097
  • @houdaipha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
  • @ankitgola005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3135
  • @Jacky-MYQ made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3132
  • @philipdoldo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3140
  • @idantene made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3151
  • @mmmans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3090
  • @shing100 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3124
  • @karimovaSvetlana made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
  • @weireweire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3182
  • @DarkLight1337 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3024
  • @lamalunderscore made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3137
  • @newme616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3117
  • @parkhs21 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1

Files

EleutherAI/lm-evaluation-harness-v0.4.9.1.zip

Files (9.2 MB)

Name Size Download all
md5:c9963fc62b221f792bfdbdc69681b33f
9.2 MB Preview Download

Additional details

Related works