There is a newer version of the record available.

Published March 5, 2025 | Version v0.4.8
Software Open

EleutherAI/lm-evaluation-harness: v0.4.8

  • 1. @EleutherAI
  • 2. Booz Allen Hamilton, EleutherAI
  • 3. sitebrew.ai
  • 4. Max Planck Institute for Software Systems: MPI SWS
  • 5. MistralAI
  • 6. Hitz Zentroa UPV/EHU
  • 7. @azurro
  • 8. Shinhan Securities Co.
  • 9. Charles University
  • 10. Ivy Natal
  • 11. Platypus Tech

Description

lm-eval v0.4.8 Release Notes

Key Improvements

  • New Backend Support:

    • Added SGLang as new evaluation backend!
    • Enabled model steering with vector support via sparsify or sae_lens
  • Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.

  • Added Support for gen_prefix in config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models

New Benchmarks & Tasks

Code Evaluation

  • HumanEval by @hjlee1371 in #1992
  • MBPP by @hjlee1371 in #2247
  • HumanEval+ and MBPP+ by @bzantium in #2734

Multilingual Expansion

  • Global Coverage:

    • Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
    • MLQA multilingual question answering by @KahnSvaer in #2622
  • Asian Languages:

    • HRM8K benchmark for Korean and English by @bzantium in #2627
    • Updated KorMedMCQA to version 2.0 by @GyoukChu in #2540
    • Fixed TMLU Taiwan-specific tasks tag by @nike00811 in #2420
  • European Languages:

    • Added Evalita-LLM benchmark by @m-resta in #2681
    • BasqueBench with Basque translations of ARC and PAWS by @naiarapm in #2732
    • Updated Turkish MMLU configuration by @ArdaYueksel in #2678
  • Middle Eastern Languages:

    • Arabic MMLU by @bodasadallah in #2541
    • AraDICE task by @firojalam in #2507

Ethics & Reasoning

  • Moral Stories by @upunaprosk in #2653
  • Histoires Morales by @upunaprosk in #2662

Others

  • MMLU Pro Plus by @asgsaeid in #2366
  • GroundCocoa by @HarshKohli in #2724

We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

  • drop python 3.8 support by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2575
  • Add Global MMLU Lite by @shivalika-singh in https://github.com/EleutherAI/lm-evaluation-harness/pull/2567
  • add warning for truncation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2585
  • Wandb step handling bugfix and feature by @sjmielke in https://github.com/EleutherAI/lm-evaluation-harness/pull/2580
  • AraDICE task config file by @firojalam in https://github.com/EleutherAI/lm-evaluation-harness/pull/2507
  • fix extra_match low if batch_size > 1 by @sywangyi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2595
  • fix model tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2604
  • update scrolls by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2602
  • some minor logging nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2609
  • Fix gguf loading via Transformers by @CL-ModelCloud in https://github.com/EleutherAI/lm-evaluation-harness/pull/2596
  • Fix Zeno visualizer on tasks like GSM8k by @pasky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2599
  • Fix the format of mgsm zh and ja. by @timturing in https://github.com/EleutherAI/lm-evaluation-harness/pull/2587
  • Add HumanEval by @hjlee1371 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1992
  • Add MBPP by @hjlee1371 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2247
  • Add MLQA by @KahnSvaer in https://github.com/EleutherAI/lm-evaluation-harness/pull/2622
  • assistant prefill by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2615
  • fix gen_prefix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2630
  • update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2632
  • add hrm8k benchmark for both Korean and English by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2627
  • New arabicmmlu by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/2541
  • Add global_mmlu full version by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2636
  • Update KorMedMCQA: ver 2.0 by @GyoukChu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2540
  • fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2420
  • fixed mmlu generative response extraction by @RawthiL in https://github.com/EleutherAI/lm-evaluation-harness/pull/2503
  • revise mbpp prompt by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2645
  • aggregate by group (total and categories) by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2643
  • Fix max_tokens handling in vllm_vlms.py by @jkaniecki in https://github.com/EleutherAI/lm-evaluation-harness/pull/2637
  • separate category for global_mmlu by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2652
  • Add Moral Stories by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2653
  • add TransformerLens example by @nickypro in https://github.com/EleutherAI/lm-evaluation-harness/pull/2651
  • fix multiple input chat tempalte by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2576
  • Add Aggregation for Kobest Benchmark by @tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/2446
  • update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2660
  • remove group from bigbench task configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2663
  • Add Histoires Morales task by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2662
  • MMLU Pro Plus by @asgsaeid in https://github.com/EleutherAI/lm-evaluation-harness/pull/2366
  • fix early return for multiple dict in task process_results by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2673
  • Turkish mmlu Config Update by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2678
  • Fix typos by @omahs in https://github.com/EleutherAI/lm-evaluation-harness/pull/2679
  • remove cuda device assertion by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2680
  • Adding the Evalita-LLM benchmark by @m-resta in https://github.com/EleutherAI/lm-evaluation-harness/pull/2681
  • Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2687
  • Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2684
  • change ensure_ascii to False for JsonChatStr by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2691
  • Set defaults for BLiMP scores by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/2692
  • Update remaining references to assistant_prefill in docs to gen_prefix by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2683
  • Update README.md by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/2694
  • fix construct_requests kwargs in python tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2700
  • arithmetic: set target delimiter to empty string by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2701
  • fix vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2708
  • add math_verify to some tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2686
  • Logging by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2203
  • Replace missing lighteval/MATH-Hard dataset with DigitalLearningGmbH/MATH-lighteval by @f4str in https://github.com/EleutherAI/lm-evaluation-harness/pull/2719
  • remove unused import by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2728
  • README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2729
  • add o3-mini support by @HelloJocelynLu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2697
  • add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in https://github.com/EleutherAI/lm-evaluation-harness/pull/2732
  • Add cocoteros_es task in spanish_bench by @sgs97ua in https://github.com/EleutherAI/lm-evaluation-harness/pull/2721
  • Fix the import source for eval_logger by @kailashbuki in https://github.com/EleutherAI/lm-evaluation-harness/pull/2735
  • add humaneval+ and mbpp+ by @bzantium in https://github.com/EleutherAI/lm-evaluation-harness/pull/2734
  • Support SGLang as Potential Backend for Evaluation by @Monstertail in https://github.com/EleutherAI/lm-evaluation-harness/pull/2703
  • fix log condition on main by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2737
  • fix vllm data parallel by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2746
  • [Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in https://github.com/EleutherAI/lm-evaluation-harness/pull/2738
  • Groundcocoa by @HarshKohli in https://github.com/EleutherAI/lm-evaluation-harness/pull/2724
  • fix doc: generate_until only outputs the generated text! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2755
  • Enable steering HF models by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/2749
  • Add test for a simple Unitxt task by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/2742
  • add debug log by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2757
  • increment version to 0.4.8 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2760

New Contributors

  • @shivalika-singh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2567
  • @sjmielke made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2580
  • @firojalam made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2507
  • @CL-ModelCloud made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2596
  • @pasky made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2599
  • @timturing made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2587
  • @hjlee1371 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1992
  • @KahnSvaer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2622
  • @bzantium made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2627
  • @bodasadallah made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2541
  • @GyoukChu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2540
  • @nike00811 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2420
  • @RawthiL made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2503
  • @jkaniecki made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2637
  • @upunaprosk made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2653
  • @nickypro made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2651
  • @asgsaeid made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2366
  • @omahs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2679
  • @m-resta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2681
  • @f4str made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2719
  • @HelloJocelynLu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2697
  • @sgs97ua made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2721
  • @kailashbuki made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2735
  • @Monstertail made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2703
  • @HarshKohli made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2724
  • @luciaquirke made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2749

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.7...v0.4.8

Files

EleutherAI/lm-evaluation-harness-v0.4.8.zip

Files (5.1 MB)

Name Size Download all
md5:cb854773e84e40e1f342a1d14b8518c6
5.1 MB Preview Download

Additional details

Related works