There is a newer version of the record available.

Published November 25, 2024 | Version v0.4.6
Software Open

EleutherAI/lm-evaluation-harness: v0.4.6

Description

lm-eval v0.4.6 Release Notes

This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.

Backwards Incompatibilities

Chat Template Delimiter Handling

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

๐Ÿ“ For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Multilingual Expansion

  • Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
  • Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439

New Task Collections

  • Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
  • Metabench: New benchmark contributed by @kozzy97 in #2357

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

  • Add Unitxt Multimodality Support by @elronbandel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2364
  • Add new tasks to spanish_bench and fix duplicates by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2390
  • fix typo bug for minerva_math by @renjie-ranger in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
  • Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2393
  • fix storycloze datanames by @t1101675 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
  • Update NoticIA prompt by @ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2421
  • [Fix] Replace generic exception classes with a more specific ones by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1989
  • Support for IBM watsonx_llm by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
  • Fix package extras for watsonx support by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
  • Fix lora requests when dp with vllm by @ckgresla in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
  • Add xquad task by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2435
  • Add verify_certificate argument to local-completion by @sjmonson in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
  • Add GPTQModel support for evaluating GPTQ models by @Qubitium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
  • Add missing task links by @Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
  • Update CODEOWNERS by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2453
  • Add real process_docs example by @Sypherd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2456
  • Modify label errors in catcola and paws-x by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2434
  • Add Japanese Leaderboard by @sitfoxfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
  • Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
  • use global multi_choice_filter for mmlu_flan by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2461
  • typo by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2465
  • pass device_map other than auto for parallelize by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2457
  • OpenAI ChatCompletions: switch max_tokens by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2443
  • Ifeval: Dowload punkt_tab on rank 0 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2267
  • Fix chat template; fix leaderboard math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2475
  • change warning to debug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2481
  • Updated wandb logger to use new_printer() instead of get_printer(...) by @alex-titterton in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
  • IBM watsonx_llm fixes & refactor by @Medokins in https://github.com/EleutherAI/lm-evaluation-harness/pull/2464
  • Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
  • update pre-commit hooks and git actions by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2497
  • kbl-v0.1.1 by @whwang299 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
  • Add mamba hf to mamba_ssm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2496
  • remove duplicate arc_ca tag by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2499
  • Add metabench task to LM Evaluation Harness by @kozzy97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357
  • Nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2500
  • [API models] parse tokenizer_backend=None properly by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2509

New Contributors

  • @renjie-ranger made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2404
  • @t1101675 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2409
  • @Medokins made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2397
  • @kiersten-stokes made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2426
  • @ckgresla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2433
  • @sjmonson made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2440
  • @Qubitium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2217
  • @Sypherd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2449
  • @sitfoxfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2439
  • @RobGeada made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2459
  • @alex-titterton made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2484
  • @OyvindTafjord made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2492
  • @whwang299 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2493
  • @kozzy97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2357

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.5...v0.4.6

Files

EleutherAI/lm-evaluation-harness-v0.4.6.zip

Files (3.5 MB)

Name Size Download all
md5:c8a94b792c0d02fddd7e643b651b410b
3.5 MB Preview Download

Additional details

Related works