Published August 4, 2025
| Version v0.4.9.1
Software
Open
EleutherAI/lm-evaluation-harness: v0.4.9.1
Creators
- Lintang Sutawika1
- Hailey Schoelkopf
- Leo Gao
- Baber Abbasi
- Stella Biderman2
- Jonathan Tow
- ben fattori
- Charles Lovering
- farzanehnakhaee70
- Jason Phang
- Anish Thite3
- Fazz
- Aflah4
- Niklas
- Thomas Wang5
- sdtblck
- nopperl
- gakada
- tttyuntian
- researcher2
- Julen Etxaniz6
- Chris7
- Hanwool Albert Lee8
- Leonid Sinev
- Zdeněk Kasner9
- Kiersten Stokes10
- Khalid
- KonradSzafer
- Jeffrey Hsu11
- Anjor Kanekar12
- 1. Language Technologies Institute, CMU
- 2. Booz Allen Hamilton, EleutherAI
- 3. sitebrew.ai
- 4. Max Planck Institute for Software Systems: MPI SWS
- 5. MistralAI
- 6. Hitz Zentroa UPV/EHU
- 7. @azurro
- 8. Shinhan Securities Co.
- 9. Charles University
- 10. Open Source Developer @ IBM
- 11. Ivy Natal
- 12. Platypus Tech
Description
lm-eval v0.4.9.1 Release Notes
This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!
Enhanced Reasoning Model Handling
- Better support for reasoning models with a
think_end_tokenargument to strip intermediate reasoning from outputs for thehf,vllm, andsglangmodel backends. A relatedenable_thinkingargument was also added for specific models that support it (e.g., Qwen).
New Benchmarks & Tasks
- EgyMMLU and EgyHellaSwag by @houdaipha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
- MultiBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3155
- LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
- Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
Fixes & Improvements
Tasks & Benchmarks:
- Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (https://github.com/EleutherAI/lm-evaluation-harness/pull/3201. #3092, #3102)
- Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
- Removed redundant "Let's think step by step" text from
bbh_cot_fewshotprompts by @philipdoldo. (#3140) - Increased
max_gen_toksto 2048 for HRM8K math benchmarks by @shing100. (#3124)
Backend & Stability:
- Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
- Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced
DISABLE_MULTIPROCenvar by @ankitgola005 and @neel04. (#3135, #3106) - add image hashing and
LMEVAL_HASHMMenvar by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2973 - TaskManager:
include-pathprecedence handling to prioritize custom dir over default by @parkhs21 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068
Housekeeping:
- Pinned
datasets < 4.0.0temporarily to maintain compatibility withtrust_remote_codeby @baberabb. (#3172) - Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)
What's Changed
- llama3 task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3074
- Fix Anthropic API compatibility issues in chat completions by @NourFahmy in https://github.com/EleutherAI/lm-evaluation-harness/pull/3054
- Ensure backwards compatibility in
fewshot_contextby using kwargs by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3079 - [vllm] remove system message if
TemplateErrorfor chat_template by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3076 - feat / fix: Properly make use of
subfolderfrom HF models by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3072 - [HF] fix quantization config by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3039
- FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in https://github.com/EleutherAI/lm-evaluation-harness/pull/3092
- Truthfulqa multi harness by @BlancaCalvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
- Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in https://github.com/EleutherAI/lm-evaluation-harness/pull/3099
- Humaneval - fix regression by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3102
- Bugfix/hf tokenizer gguf override by @ankush13r in https://github.com/EleutherAI/lm-evaluation-harness/pull/3098
- [FIX] Initial code to disable multi-proc for stderr by @neel04 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3106
- fix deps; update hooks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3107
- delete unneeded files by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3108
- Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in https://github.com/EleutherAI/lm-evaluation-harness/pull/3097
- add image hashing and
LMEVAL_HASHMMenvar by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2973 - removal of Neural Magic models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3112
- Neuralmagic by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3113
- check pil dep when hashing images by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3114
- warning for "chat" pretrained; disable buggy evalita configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3127
- fix: remove warning by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3128
- Adding EgyMMLU and EgyHellaSwag by @houdaipha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
- Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3138
- Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3135
- when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in https://github.com/EleutherAI/lm-evaluation-harness/pull/3132
- truncate thinking tags in generations by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3145
bbh_cot_fewshot: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in https://github.com/EleutherAI/lm-evaluation-harness/pull/3140- Fix medical benchmarks import by @idantene in https://github.com/EleutherAI/lm-evaluation-harness/pull/3151
- fix request hanging when request api by @mmmans in https://github.com/EleutherAI/lm-evaluation-harness/pull/3090
- Custom request headers | trust_remote_code param fix by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/3069
- Bugfix: update path for GLUE by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3159
- Add the MultiBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3155
- multiblimp - readme by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3162
- [tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3163
- Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3124
- feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
- Added
chat_template_argsto vllm by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3164 - Pin datasets < 4.0.0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3172
- Remove "device" from vllm_causallms.py by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3176
- remove trust-remote-code in configs; fix escape sequences by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3180
- Fix vllm test issue that call pop() from None by @weireweire in https://github.com/EleutherAI/lm-evaluation-harness/pull/3182
- [hotfix] vllm: pop
devicefrom kwargs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3181 - Update vLLM compatibility by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3024
- Fix
mmlu_continuationsubgroup names to fit Readme and other variants by @lamalunderscore in https://github.com/EleutherAI/lm-evaluation-harness/pull/3137 - Fix humaneval_instruct by @idantene in https://github.com/EleutherAI/lm-evaluation-harness/pull/3201
- Update README.md for mlqa by @newme616 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3117
- improve include-path precedence handling by @parkhs21 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068
- Bump version to 0.4.9.1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3208
New Contributors
- @NourFahmy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3054
- @userljz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3092
- @BlancaCalvo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3062
- @stakodiak made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3099
- @ankush13r made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3098
- @neel04 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3106
- @DebjyotiRay made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3097
- @houdaipha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3063
- @ankitgola005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3135
- @Jacky-MYQ made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3132
- @philipdoldo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3140
- @idantene made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3151
- @mmmans made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3090
- @shing100 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3124
- @karimovaSvetlana made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2943
- @weireweire made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3182
- @DarkLight1337 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3024
- @lamalunderscore made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3137
- @newme616 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3117
- @parkhs21 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3068
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1
Files
EleutherAI/lm-evaluation-harness-v0.4.9.1.zip
Files
(9.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c9963fc62b221f792bfdbdc69681b33f
|
9.2 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.9.1 (URL)
Software
- Repository URL
- https://github.com/EleutherAI/lm-evaluation-harness