EleutherAI/lm-evaluation-harness: v0.4.3

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas Muennighoff; Thomas Wang; sdtblck; nopperl; gakada; tttyuntian; researcher2; Julen Etxaniz; Chris; Hanwool Albert Lee; Zdeněk Kasner; Khalid; LSinev; Jeffrey Hsu; Anjor Kanekar; KonradSzafer; AndyZwei

doi:10.5281/zenodo.12608602

Published July 1, 2024 | Version v0.4.3

Software Open

EleutherAI/lm-evaluation-harness: v0.4.3

1. @EleutherAI
2. EleutherAI
3. Booz Allen Hamilton, EleutherAI
4. @ClarosAI
5. Indraprastha Institute of Information Technology Delhi
6. Peking University
7. MistralAI
8. Hitz Zentroa UPV/EHU
9. @azurro
10. NCSOFT
11. @ufal
12. Ivy Natal
13. Platypus Tech

lm-eval v0.4.3 Release Notes

We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.

New Additions

The big new feature is the often-requested Chat Templating, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors!

You can now run using a chat template with --apply_chat_template and a system prompt of your choosing using --system_instruction "my sysprompt here". The --fewshot_as_multiturn flag can control whether each few-shot example in context is a new conversational turn or not.

This feature is currently only supported for model types hf and vllm but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs.

There's a lot more to check out, including:

Logging results to the HF Hub if desired using --hf_hub_log_args, by @KonradSzafer and team!
NeMo model support by @sergiopperez !
Anthropic Chat API support by @tryuman !
DeepSparse and SparseML model types by @mgoin !
Handling of delta-weights in HF models, by @KonradSzafer !
LoRA support for VLLM, by @bcicc !
Fixes to PEFT modules which add new tokens to the embedding layers, by @mapmeld !
Fixes to handling of BOS tokens in multiple-choice loglikelihood settings, by @djstrong !
The use of custom Sampler subclasses in tasks, by @LSinev !
The ability to specify "hardcoded" few-shot examples more cleanly, by @clefourrier !
Support for Ascend NPUs (--device npu) by @statelesshz, @zhabuye, @jiaqiw09 and others!
Logging of higher_is_better in results tables for clearer understanding of eval metrics by @zafstojano !
extra info logged about models, including info about tokenizers, chat templating, and more, by @artemorloff @djstrong and others!
Miscellaneous bug fixes! And many more great contributions we weren't able to list here.

New Tasks

We had a number of new tasks contributed. A listing of subfolders and a brief description of the tasks contained in them can now be found at lm_eval/tasks/README.md. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder. Thank you to @AnthonyDipofi @Harryalways317 @nairbv @sepiatone and others for working on this and giving feedback!

Without further ado, the tasks:

ACLUE, a benchmark for Ancient Chinese understanding, by @haonan-li
BasqueGlue and EusExams, two Basque-language tasks by @juletx
TMMLU+, an evaluation for Traditional Chinese, contributed by @ZoneTwelve
XNLIeu, a Basque version of XNLI, by @juletx
Pile-10K, a perplexity eval taken from a subset of the Pile's validation set, contributed by @mukobi
FDA, SWDE, and Squad-Completion zero-shot tasks by @simran-arora and team
Added back the hendrycks_math task, the MATH task using the prompt and answer parsing from the original Hendrycks et al. MATH paper rather than Minerva's prompt and parsing
COPAL-ID, a natively-Indonesian commonsense benchmark, contributed by @Erland366
tinyBenchmarks variants of the Open LLM Leaderboard 1 tasks, by @LucWeber and team!
Glianorex, a benchmark for testing performance on fictional medical questions, by @maximegmd
New FLD (formal logic) task variants by @MorishT
Improved translations of Lambada Multilingual tasks, added by @zafstojano
NoticIA, a Spanish summarization dataset by @ikergarcia1996
The Paloma perplexity benchmark, added by @zafstojano
We've removed the AMMLU dataset due to concerns about auto-translation quality.
Added the localized, not translated, ArabicMMLU dataset, contributed by @Yazeed7 !
BertaQA, a Basque cultural knowledge benchmark, by @juletx
New machine-translated ARC-C datasets by @jonabur !
CommonsenseQA, in a prompt format following Llama, by @murphybrendan
...

Backwards Incompatibilities

The save format for logged results has now changed.

output files will now be written to

{output_path}/{sanitized_model_name}/results_YYYY-MM-DDTHH-MM-SS.xxxxx.json if --output_path is set, and
{output_path}/{sanitized_model_name}/samples_{task_name}_YYYY-MM-DDTHH-MM-SS.xxxxx.jsonl for each task's samples if --log_samples is set.

e.g. outputs/gpt2/results_2024-06-28T00-00-00.00001.json and outputs/gpt2/samples_lambada_openai_2024-06-28T00-00-00.00001.jsonl.

See https://github.com/EleutherAI/lm-evaluation-harness/pull/1926 for utilities which may help to work with these new filenames.

Future Plans

In general, we'll be doing our best to keep up with the strong interest and large number of contributions we've seen coming in!

The official Open LLM Leaderboard 2 tasks will be landing soon in the Eval Harness main branch and subsequently in v0.4.4 on PyPI!
The fact that groups of tasks by-default attempt to report an aggregated score across constituent subtasks has been a sharp edge. We are finishing up some internal reworking to distinguish between groups of tasks that do report aggregate scores (think mmlu) versus tags which simply are a convenient shortcut to call a bunch of tasks one might want to run at once (think the pythia grouping which merely represents a collection of tasks one might want to gather results on each of all at once but where averaging doesn't make sense).
We'd also like to improve the API model support in the Eval Harness from its current state.
More to come!

Thank you to everyone who's contributed to or used the library!

Thanks, @haileyschoelkopf @lintangsutawika

What's Changed

use BOS token in loglikelihood by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1588
Revert "Patch for Seq2Seq Model predictions" by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1601
fix gen_kwargs arg reading by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1607
fix until arg processing by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1608
Fixes to Loglikelihood prefix token / VLLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1611
Add ACLUE task by @haonan-li in https://github.com/EleutherAI/lm-evaluation-harness/pull/1614
OpenAI Completions -- fix passing of unexpected 'until' arg by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1612
add logging of model args by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1619
Add vLLM FAQs to README (#1625) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1633
peft Version Assertion by @LameloBally in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
Seq2seq fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1604
Integration of NeMo models into LM Evaluation Harness library by @sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
Fix conditional import for Nemo LM class by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1641
Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring by @orsharir in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
Add Latxa paper evaluation tasks for Basque by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1654
Fix CLI --batch_size arg for openai-completions/local-completions by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1656
Patch QQP prompt (#1648 ) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1661
TMMLU+ implementation by @ZoneTwelve in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
Anthropic Chat API by @tryumanshow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
correction bug EleutherAI#1664 by @nicho2 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
Signpost potential bugs / unsupported ops in MPS backend by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1680
Add delta weights model loading by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
Add neuralmagic models for sparseml and deepsparse by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1674
Improvements to run NVIDIA NeMo models on LM Evaluation Harness by @sergiopperez in https://github.com/EleutherAI/lm-evaluation-harness/pull/1699
Adding retries and rate limit to toxicity tasks by @sator-labs in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
reference --tasks list in README by @nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1726
Add XNLIeu: a dataset for cross-lingual NLI in Basque by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1694
Fix Parameter Propagation for Tasks that have include by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1749
Support individual scrolls datasets by @giorgossideris in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
Add filter registry decorator by @lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
remove duplicated num_fewshot: 0 by @chujiezheng in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
Pile 10k new task by @mukobi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
Fix m_arc choices by @jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1760
upload new tasks by @simran-arora in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
vllm lora support by @bcicc in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
Add option to set OpenVINO config by @helena-intel in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
evaluation tracker implementation by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1766
eval tracker args fix by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1777
limit fix by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1785
remove echo parameter in OpenAI completions API by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1779
Fix README: change----hf_hub_log_args to --hf_hub_log_args by @MuhammadBinUsman03 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
Fix bug in setting until kwarg in openai completions by @ciaranby in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
Provide ability for custom sampler for ConfigurableTask by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1616
Update --tasks list option in interface documentation by @sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
Fix Caching Tests ; Remove pretrained=gpt2 default by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1775
link to the example output on the hub by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1798
Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1793
Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1791
Initial integration of the Unitxt to LM eval harness by @yoavkatz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
add task for mmlu evaluation in arc multiple choice format by @jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1745
Update flag --hf_hub_log_args in interface documentation by @sepiatone in https://github.com/EleutherAI/lm-evaluation-harness/pull/1806
Copal task by @Erland366 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
Adding tinyBenchmarks datasets by @LucWeber in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
interface doc update by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1807
Fix links in README guiding to another branch by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1838
Fix: support PEFT/LoRA with added tokens by @mapmeld in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
Fix incorrect check for task type by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
Fixing typos in docs by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1863
Update polemo2_out.yaml by @zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
Unpin vllm in dependencies by @edgan8 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
Fix outdated links to the latest links in docs by @oneonlee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
[HFLM]Use Accelerate's API to reduce hard-coded CUDA code by @statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
Fix batch_size=auto for HF Seq2Seq models (#1765) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1790
Fix Brier Score by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1847
Fix for bootstrap_iters = 0 case (#1715) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1789
add mmlu tasks from pile-t5 by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1710
Bigbench fix by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1686
Rename lm_eval.logging -> lm_eval.loggers by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1858
Updated vllm imports in vllm_causallms.py by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1890
[HFLM]Add support for Ascend NPU by @statelesshz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1886
higher_is_better tickers in output table by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1893
Add dataset card when pushing to HF hub by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1898
Making hardcoded few shots compatible with the chat template mechanism by @clefourrier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
Try to make existing tests run little bit faster by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1905
Fix fewshot seed only set when overriding num_fewshot by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1914
Complete task list from pr 1727 by @anthony-dipofi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1901
Add chat template by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1873
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data by @maximegmd in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
Modify pre-commit hook to check merge conflicts accidentally committed by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1927
[add] fld logical formula task by @MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1931
Add new Lambada translations by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1897
Implement NoticIA by @ikergarcia1996 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
Add The Arabic version of the PICA benchmark by @khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1917
Fix social_iqa answer choices by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1909
Update basque-glue by @zhabuye in https://github.com/EleutherAI/lm-evaluation-harness/pull/1913
Test output table layout consistency by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1916
Fix a tiny typo in __main__.py by @sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
Add the Arabic version with refactor to Arabic pica to be in alghafa … by @khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1940
Results filenames handling fix by @KonradSzafer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1926
Remove AMMLU Due to Translation by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1948
Add option in TaskManager to not index library default tasks ; Tests for include_path by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1856
Force BOS token usage in 'gemma' models for VLLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1857
Fix a tiny typo in docs/interface.md by @sadra-barikbin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1955
Fix self.max_tokens in anthropic_llms.py by @lozhn in https://github.com/EleutherAI/lm-evaluation-harness/pull/1848
samples is newline delimited by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1930
Fix --gen_kwargs and VLLM (temperature not respected) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1800
Make scripts.write_out error out when no splits match by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1796
fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' by @johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
add trust_remote_code for piqa by @changwangss in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
Fix self assignment in neuron_optimum.py by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1990
[New Task] Add Paloma benchmark by @zafstojano in https://github.com/EleutherAI/lm-evaluation-harness/pull/1928
Fix Paloma Template yaml by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1993
Log fewshot_as_multiturn in results files by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1995
Added ArabicMMLU by @Yazeed7 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
Fix Datasets --trust_remote_code by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1998
Add BertaQA dataset tasks by @juletx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1964
add tokenizer logs info by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1731
Hotfix breaking import by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2015
add arc_challenge_mt by @jonabur in https://github.com/EleutherAI/lm-evaluation-harness/pull/1900
Remove LM dependency from build_all_requests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2011
Added CommonsenseQA task by @murphybrendan in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
Factor out LM-specific tests by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1859
Update interface.md by @johnwee1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1982
Fix trust_remote_code-related test failures by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2024
Fixes scrolls task bug with few_shot examples by @xksteven in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003
fix cache by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2037
Add chat template to vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2034
Fail gracefully upon tokenizer logging failure (#2035) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2038
Bundle exact_match HF Evaluate metric with install, don't call evaluate.load() on import by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2045
Update package version to v0.4.3 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2046

New Contributors

@LameloBally made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1635
@sergiopperez made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1598
@orsharir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1647
@ZoneTwelve made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1394
@tryumanshow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1594
@nicho2 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1670
@KonradSzafer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1712
@sator-labs made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1620
@giorgossideris made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1740
@lozhn made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1750
@chujiezheng made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1769
@mukobi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1758
@simran-arora made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1728
@bcicc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1756
@helena-intel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1730
@MuhammadBinUsman03 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1776
@ciaranby made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1784
@sepiatone made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1792
@yoavkatz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1615
@Erland366 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1803
@LucWeber made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1545
@mapmeld made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1828
@zafstojano made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1865
@zhabuye made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1871
@edgan8 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1874
@oneonlee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1876
@statelesshz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1880
@clefourrier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1895
@maximegmd made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1867
@ikergarcia1996 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1912
@sadra-barikbin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1939
@johnwee1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1956
@changwangss made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1983
@Yazeed7 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1987
@murphybrendan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1721
@xksteven made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2003

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.2...v0.4.3

Files