EleutherAI/lm-evaluation-harness: v0.4.6

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas Muennighoff; Thomas Wang; sdtblck; nopperl; gakada; tttyuntian; researcher2; Julen Etxaniz; Chris; Hanwool Albert Lee; Leonid Sinev; Zdeněk Kasner; Khalid; KonradSzafer; Jeffrey Hsu; Anjor Kanekar; Pawan Sasanka Ammanamanchi

doi:10.5281/zenodo.14216804

Published November 25, 2024 | Version v0.4.6

Software Open

EleutherAI/lm-evaluation-harness: v0.4.6

1. @EleutherAI
2. Booz Allen Hamilton, EleutherAI
3. @ClarosAI
4. Indraprastha Institute of Information Technology Delhi
5. MistralAI
6. Hitz Zentroa UPV/EHU
7. @azurro
8. NCSOFT
9. Charles University
10. Ivy Natal
11. Platypus Tech

lm-eval v0.4.6 Release Notes

This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.

Backwards Incompatibilities

Chat Template Delimiter Handling

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Multilingual Expansion

Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439

New Task Collections

Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
Metabench: New benchmark contributed by @kozzy97 in #2357

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

Add Unitxt Multimodality Support by @elronbandel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2364
Add new tasks to spanish_bench and fix duplicates by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2390
fix typo bug for minerva_math by @renjie-ranger in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2393
fix storycloze datanames by @t1101675 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
Update NoticIA prompt by @ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2421
[Fix] Replace generic exception classes with a more specific ones by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1989
Support for IBM watsonx_llm by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
Fix package extras for watsonx support by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
Fix lora requests when dp with vllm by @ckgresla in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
Add xquad task by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2435
Add verify_certificate argument to local-completion by @sjmonson in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
Add GPTQModel support for evaluating GPTQ models by @Qubitium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
Add missing task links by @Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
Update CODEOWNERS by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2453
Add real process_docs example by @Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2456
Modify label errors in catcola and paws-x by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
Add Japanese Leaderboard by @sitfoxfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
use global multi_choice_filter for mmlu_flan by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2461
typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2465
pass device_map other than auto for parallelize by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2457
OpenAI ChatCompletions: switch max_tokens by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2443
Ifeval: Dowload punkt_tab on rank 0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2267
Fix chat template; fix leaderboard math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2475
change warning to debug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2481
Updated wandb logger to use new_printer() instead of get_printer(...) by @alex-titterton in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
IBM watsonx_llm fixes & refactor by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2464
Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
update pre-commit hooks and git actions by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2497
kbl-v0.1.1 by @whwang299 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
Add mamba hf to mamba_ssm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2496
remove duplicate arc_ca tag by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2499
Add metabench task to LM Evaluation Harness by @kozzy97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357
Nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2500
[API models] parse tokenizer_backend=None properly by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2509

New Contributors

@renjie-ranger made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
@t1101675 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
@Medokins made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
@kiersten-stokes made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
@ckgresla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
@sjmonson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
@Qubitium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
@Sypherd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
@sitfoxfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
@RobGeada made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
@alex-titterton made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
@OyvindTafjord made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
@whwang299 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
@kozzy97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.5...v0.4.6

Files