EleutherAI/lm-evaluation-harness: v0.4.7

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas Muennighoff; Thomas Wang; sdtblck; nopperl; gakada; tttyuntian; researcher2; Julen Etxaniz; Chris; Hanwool Albert Lee; Leonid Sinev; Zdeněk Kasner; Khalid; KonradSzafer; Jeffrey Hsu; Anjor Kanekar; Pawan Sasanka Ammanamanchi

doi:10.5281/zenodo.14506035

Published December 17, 2024 | Version v0.4.7

Software Open

EleutherAI/lm-evaluation-harness: v0.4.7

1. @EleutherAI
2. Booz Allen Hamilton, EleutherAI
3. @ClarosAI
4. Max Planck Institute for Software Systems: MPI SWS
5. MistralAI
6. Hitz Zentroa UPV/EHU
7. @azurro
8. NCSOFT
9. Charles University
10. Ivy Natal
11. Platypus Tech

lm-eval v0.4.7 Release Notes

This release includes several bug fixes, minor improvements to model handling, and task additions.

⚠️ Python 3.8 End of Support Notice

Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.

Backwards Incompatibilities

Chat Template Delimiter Handling (in v0.4.6)

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

Score tasks by @rimashahbazyan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2452
Filters bugfix; add metrics and filter to logged sample by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2517
skip casting if predict_only by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2524
make utility function to handle until by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2518
Update Unitxt task to use locally installed unitxt and not download Unitxt code from Huggingface by @yoavkatz in https://github.com/EleutherAI/lm-evaluation-harness/pull/2514
add Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2531
avoid timeout errors with high concurrency in api_model by @dtrawins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2307
Update README.md by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2534
better doc_to_test testing by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2535
Support pipeline parallel with OpenVINO models by @sstrehlk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2349
Super little tiny fix doc by @fzyzcjy in https://github.com/EleutherAI/lm-evaluation-harness/pull/2546
[API] left truncate for generate_until by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2554
Update Lightning import by @maanug-nv in https://github.com/EleutherAI/lm-evaluation-harness/pull/2549
add optimum-intel ipex model by @yao-matrix in https://github.com/EleutherAI/lm-evaluation-harness/pull/2566
add warning to readme by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2568
Adding new subtask to SCORE tasks: non greedy robustness by @rimashahbazyan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2558
batch loglikelihood_rolling across requests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2559
fix DeprecationWarning: invalid escape sequence '\s' for whitespace filter by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2560
increment version to 4.6.7 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2574

New Contributors

@rimashahbazyan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2452
@naiarapm made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2531
@dtrawins made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2307
@sstrehlk made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2349
@fzyzcjy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2546
@maanug-nv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2549
@yao-matrix made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2566

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.6...v0.4.7

Files