EleutherAI/lm-evaluation-harness: v0.4.6
Creators
- Lintang Sutawika1
- Hailey Schoelkopf
- Leo Gao
- Baber Abbasi
- Stella Biderman2
- Jonathan Tow
- ben fattori
- Charles Lovering
- farzanehnakhaee70
- Jason Phang
- Anish Thite3
- Fazz
- Aflah4
- Niklas Muennighoff
- Thomas Wang5
- sdtblck
- nopperl
- gakada
- tttyuntian
- researcher2
- Julen Etxaniz6
- Chris7
- Hanwool Albert Lee8
- Leonid Sinev
- Zdenฤk Kasner9
- Khalid
- KonradSzafer
- Jeffrey Hsu10
- Anjor Kanekar11
- Pawan Sasanka Ammanamanchi
- 1. @EleutherAI
- 2. Booz Allen Hamilton, EleutherAI
- 3. @ClarosAI
- 4. Indraprastha Institute of Information Technology Delhi
- 5. MistralAI
- 6. Hitz Zentroa UPV/EHU
- 7. @azurro
- 8. NCSOFT
- 9. Charles University
- 10. Ivy Natal
- 11. Platypus Tech
Description
lm-eval v0.4.6 Release Notes
This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.
Backwards Incompatibilities
Chat Template Delimiter Handling
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
๐ For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
Multilingual Expansion
- Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
- Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439
New Task Collections
- Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
- Metabench: New benchmark contributed by @kozzy97 in #2357
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- Add Unitxt Multimodality Support by @elronbandel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2364
- Add new tasks to spanish_bench and fix duplicates by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2390
- fix typo bug for minerva_math by @renjie-ranger in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
- Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2393
- fix storycloze datanames by @t1101675 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
- Update NoticIA prompt by @ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2421
- [Fix] Replace generic exception classes with a more specific ones by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1989
- Support for IBM watsonx_llm by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
- Fix package extras for watsonx support by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
- Fix lora requests when dp with vllm by @ckgresla in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
- Add xquad task by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2435
- Add verify_certificate argument to local-completion by @sjmonson in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
- Add GPTQModel support for evaluating GPTQ models by @Qubitium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
- Add missing task links by @Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
- Update CODEOWNERS by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2453
- Add real process_docs example by @Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2456
- Modify label errors in catcola and paws-x by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
- Add Japanese Leaderboard by @sitfoxfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
- Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
- use global
multi_choice_filter
for mmlu_flan by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2461 - typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2465
- pass device_map other than auto for parallelize by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2457
- OpenAI ChatCompletions: switch
max_tokens
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2443 - Ifeval: Dowload
punkt_tab
on rank 0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2267 - Fix chat template; fix leaderboard math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2475
- change warning to debug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2481
- Updated wandb logger to use
new_printer()
instead ofget_printer(...)
by @alex-titterton in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484 - IBM watsonx_llm fixes & refactor by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2464
- Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
- update pre-commit hooks and git actions by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2497
- kbl-v0.1.1 by @whwang299 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
- Add mamba hf to
mamba_ssm
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2496 - remove duplicate
arc_ca
tag by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2499 - Add metabench task to LM Evaluation Harness by @kozzy97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357
- Nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2500
- [API models] parse tokenizer_backend=None properly by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2509
New Contributors
- @renjie-ranger made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
- @t1101675 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
- @Medokins made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
- @kiersten-stokes made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
- @ckgresla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
- @sjmonson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
- @Qubitium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
- @Sypherd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
- @sitfoxfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
- @RobGeada made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
- @alex-titterton made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
- @OyvindTafjord made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
- @whwang299 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
- @kozzy97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.5...v0.4.6
Files
EleutherAI/lm-evaluation-harness-v0.4.6.zip
Files
(3.5 MB)
Name | Size | Download all |
---|---|---|
md5:c8a94b792c0d02fddd7e643b651b410b
|
3.5 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.6 (URL)