There is a newer version of the record available.

Published July 1, 2024 | Version v0.4.3
Software Open

EleutherAI/lm-evaluation-harness: v0.4.3

  • 1. @EleutherAI
  • 2. EleutherAI
  • 3. Booz Allen Hamilton, EleutherAI
  • 4. @ClarosAI
  • 5. Indraprastha Institute of Information Technology Delhi
  • 6. Peking University
  • 7. MistralAI
  • 8. Hitz Zentroa UPV/EHU
  • 9. @azurro
  • 10. NCSOFT
  • 11. @ufal
  • 12. Ivy Natal
  • 13. Platypus Tech

Description

lm-eval v0.4.3 Release Notes

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with --apply_chat_template and a system prompt of your choosing using --system_instruction "my sysprompt here". The --fewshot_as_multiturn flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is currently only supported for model types hf and vllm but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

  • Logging results to the HF Hub if desired using --hf_hub_log_args, by @KonradSzafer and team!

  • NeMo model support by @sergiopperez !

  • Anthropic Chat API support by @tryuman !

  • DeepSparse and SparseML model types by @mgoin !

  • Handling of delta-weights in HF models, by @KonradSzafer !

  • LoRA support for VLLM, by @bcicc !

  • Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !

  • Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !

  • The use of custom Sampler subclasses in tasks, by @LSinev !

  • The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !

  • Support for Ascend NPUs (--device npu) by @statelesshz, @zhabuye, @jiaqiw09 and others!

  • Logging of higher_is_better in results tables for clearer understanding of eval metrics by @zafstojano !

  • extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!

  • Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder. Thank you to @AnthonyDipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:

  • ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
  • BasqueGlue and EusExams, two Basque-language tasks by @juletx
  • TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
  • XNLIeu, a Basque version of XNLI, by @juletx
  • Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
  • FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
  • Added back the hendrycks_math task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
  • COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
  • tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
  • Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
  • New FLD (formal logic) task variants by @MorishT
  • Improved translations of Lambada Multilingual tasks, added by @zafstojano
  • NoticIA, a Spanish summarization dataset by @ikergarcia1996
  • The Paloma perplexity benchmark, added by @zafstojano
  • We've removed the AMMLU dataset due to concerns about auto-translation quality.
  • Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
  • BertaQA, a Basque cultural knowledge benchmark, by @juletx
  • New machine-translated ARC-C datasets by @jonabur !
  • CommonsenseQA, in a prompt format following Llama, by @murphybrendan
  • ...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to

  • {output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json if --output_path is set, and
  • {output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl for each task's samples if --log_samples is set.

e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl.

See https://github.com/EleutherAI/lm-evaluation-harness/pull/1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

  • The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in v0.4.4 on PyPI!

  • The fact that groups of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between groups of tasks that do report aggregate scores (think mmlu) versus tags which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the pythia grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).

  • We'd also like to improve the API model support in the Eval Harness from its current state.

  • More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, @haileyschoelkopf @lintangsutawika

What's Changed

  • use BOS token in loglikelihood by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1588
  • Revert "Patch for Seq2Seq Model predictions" by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1601
  • fix gen_kwargs arg reading by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1607
  • fix until arg processing by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1608
  • Fixes to Loglikelihood prefix token / VLLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1611
  • Add ACLUE task by @haonan-li in https://github.com/EleutherAI/lm-evaluation-harness/pull/1614
  • OpenAI Completions -- fix passing of unexpected 'until' arg by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1612
  • add logging of model args by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1619
  • Add vLLM FAQs to README (#1625) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1633
  • peft Version Assertion by @LameloBally in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
  • Seq2seq fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1604
  • Integration of NeMo models into LM Evaluation Harness library by @sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
  • Fix conditional import for Nemo LM class by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1641
  • Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by @orsharir in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
  • Add Latxa paper evaluation tasks for Basque by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1654
  • Fix CLI --batch_size arg for openai-completions/local-completions by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1656
  • Patch QQP prompt (#1648 ) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1661
  • TMMLU+ implementation by @ZoneTwelve in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
  • Anthropic Chat API by @tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
  • correction bug EleutherAI#1664 by @nicho2 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
  • Signpost potential bugs / unsupported ops in MPS backend by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1680
  • Add delta weights model loading by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
  • Add neuralmagic models for sparseml and deepsparse by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1674
  • Improvements to run NVIDIA NeMo models on LM Evaluation Harness by @sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1699
  • Adding retries and rate limit to toxicity tasks by @sator-labs in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
  • reference --tasks list in README by @nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1726
  • Add XNLIeu: a dataset for cross-lingual NLI in Basque by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1694
  • Fix Parameter Propagation for Tasks that have include by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1749
  • Support individual scrolls datasets by @giorgossideris in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
  • Add filter registry decorator by @lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
  • remove duplicated num_fewshot: 0 by @chujiezheng in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
  • Pile 10k new task by @mukobi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
  • Fix m_arc choices by @jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1760
  • upload new tasks by @simran-arora in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
  • vllm lora support by @bcicc in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
  • Add option to set OpenVINO config by @helena-intel in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
  • evaluation tracker implementation by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1766
  • eval tracker args fix by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1777
  • limit fix by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1785
  • remove echo parameter in OpenAI completions API by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1779
  • Fix README: change----hf_hub_log_args to --hf_hub_log_args by @MuhammadBinUsman03 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
  • Fix bug in setting until kwarg in openai completions by @ciaranby in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
  • Provide ability for custom sampler for ConfigurableTask by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1616
  • Update --tasks list option in interface documentation by @sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
  • Fix Caching Tests ; Remove pretrained=gpt2 default by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1775
  • link to the example output on the hub by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1798
  • Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1793
  • Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1791
  • Initial integration of the Unitxt to LM eval harness by @yoavkatz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
  • add task for mmlu evaluation in arc multiple choice format by @jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1745
  • Update flag --hf_hub_log_args in interface documentation by @sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1806
  • Copal task by @Erland366 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
  • Adding tinyBenchmarks datasets by @LucWeber in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
  • interface doc update by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1807
  • Fix links in README guiding to another branch by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1838
  • Fix: support PEFT/LoRA with added tokens by @mapmeld in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
  • Fix incorrect check for task type by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
  • Fixing typos in docs by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1863
  • Update polemo2_out.yaml by @zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
  • Unpin vllm in dependencies by @edgan8 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
  • Fix outdated links to the latest links in docs by @oneonlee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
  • [HFLM]Use Accelerate's API to reduce hard-coded CUDA code by @statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
  • Fix batch_size=auto for HF Seq2Seq models (#1765) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1790
  • Fix Brier Score by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1847
  • Fix for bootstrap_iters = 0 case (#1715) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1789
  • add mmlu tasks from pile-t5 by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1710
  • Bigbench fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1686
  • Rename lm_eval.logging -> lm_eval.loggers by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1858
  • Updated vllm imports in vllm_causallms.py by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1890
  • [HFLM]Add support for Ascend NPU by @statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1886
  • higher_is_better tickers in output table by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1893
  • Add dataset card when pushing to HF hub by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1898
  • Making hardcoded few shots compatible with the chat template mechanism by @clefourrier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
  • Try to make existing tests run little bit faster by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1905
  • Fix fewshot seed only set when overriding num_fewshot by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1914
  • Complete task list from pr 1727 by @anthony-dipofi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1901
  • Add chat template by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1873
  • Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data by @maximegmd in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
  • Modify pre-commit hook to check merge conflicts accidentally committed by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1927
  • [add] fld logical formula task by @MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1931
  • Add new Lambada translations by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1897
  • Implement NoticIA by @ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
  • Add The Arabic version of the PICA benchmark by @khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1917
  • Fix social_iqa answer choices by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1909
  • Update basque-glue by @zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1913
  • Test output table layout consistency by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1916
  • Fix a tiny typo in __main__.py by @sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
  • Add the Arabic version with refactor to Arabic pica to be in alghafa … by @khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1940
  • Results filenames handling fix by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1926
  • Remove AMMLU Due to Translation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1948
  • Add option in TaskManager to not index library default tasks ; Tests for include_path by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1856
  • Force BOS token usage in 'gemma' models for VLLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1857
  • Fix a tiny typo in docs/interface.md by @sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1955
  • Fix self.max_tokens in anthropic_llms.py by @lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1848
  • samples is newline delimited by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1930
  • Fix --gen_kwargs and VLLM (temperature not respected) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1800
  • Make scripts.write_out error out when no splits match by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1796
  • fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' by @johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
  • add trust_remote_code for piqa by @changwangss in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
  • Fix self assignment in neuron_optimum.py by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1990
  • [New Task] Add Paloma benchmark by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1928
  • Fix Paloma Template yaml by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1993
  • Log fewshot_as_multiturn in results files by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1995
  • Added ArabicMMLU by @Yazeed7 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
  • Fix Datasets --trust_remote_code by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1998
  • Add BertaQA dataset tasks by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1964
  • add tokenizer logs info by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1731
  • Hotfix breaking import by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2015
  • add arc_challenge_mt by @jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1900
  • Remove LM dependency from build_all_requests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2011
  • Added CommonsenseQA task by @murphybrendan in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
  • Factor out LM-specific tests by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1859
  • Update interface.md by @johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1982
  • Fix trust_remote_code-related test failures by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2024
  • Fixes scrolls task bug with few_shot examples by @xksteven in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003
  • fix cache by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2037
  • Add chat template to vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2034
  • Fail gracefully upon tokenizer logging failure (#2035) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2038
  • Bundle exact_match HF Evaluate metric with install, don't call evaluate.load() on import by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2045
  • Update package version to v0.4.3 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2046

New Contributors

  • @LameloBally made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
  • @sergiopperez made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
  • @orsharir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
  • @ZoneTwelve made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
  • @tryumanshow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
  • @nicho2 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
  • @KonradSzafer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
  • @sator-labs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
  • @giorgossideris made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
  • @lozhn made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
  • @chujiezheng made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
  • @mukobi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
  • @simran-arora made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
  • @bcicc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
  • @helena-intel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
  • @MuhammadBinUsman03 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
  • @ciaranby made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
  • @sepiatone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
  • @yoavkatz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
  • @Erland366 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
  • @LucWeber made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
  • @mapmeld made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
  • @zafstojano made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
  • @zhabuye made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
  • @edgan8 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
  • @oneonlee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
  • @statelesshz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
  • @clefourrier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
  • @maximegmd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
  • @ikergarcia1996 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
  • @sadra-barikbin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
  • @johnwee1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
  • @changwangss made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
  • @Yazeed7 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
  • @murphybrendan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
  • @xksteven made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.2...v0.4.3

Files

EleutherAI/lm-evaluation-harness-v0.4.3.zip

Files (2.6 MB)

Name Size Download all
md5:a826f0b46e36cecd7e156a66f4c90b91
2.6 MB Preview Download

Additional details

Related works