EleutherAI/lm-evaluation-harness: v0.4.1

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Stella Biderman; Baber Abbasi; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas Muennighoff; Thomas Wang; sdtblck; gakada; nopperl; researcher2; tttyuntian; Chris; Julen Etxaniz; Zdeněk Kasner; Khalid; Jeffrey Hsu; Hanwool Albert Lee; Anjor Kanekar; AndyZwei; Pawan Sasanka Ammanamanchi; Dirk Groeneveld

doi:10.5281/zenodo.10600400

Published January 31, 2024 | Version v0.4.1

Software Open

EleutherAI/lm-evaluation-harness: v0.4.1

1. @EleutherAI
2. EleutherAI
3. Booz Allen Hamilton, EleutherAI
4. @ClarosAI
5. Indraprastha Institute of Information Technology Delhi
6. Peking University
7. MistralAI
8. @azurro
9. Hitz Zentroa UPV/EHU
10. @ufal
11. Ivy Natal
12. NCSOFT

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .

At a high level, some of the changes include:

Data-parallel inference using vLLM (contributed by @baberabb )
A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
Miscellaneous documentation updates
A number of new tasks, and bugfixes to old tasks!
The support for OpenAI-like API models using local-completions or local-chat-completions ( Thanks to @veekaybee @mgoin @anjor and others on this)!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include

Chat Templating + System Prompt support, for locally-run models
Improved Answer Extraction for many generative tasks
General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including startup times and speedups when num_fewshot is large!
A new TaskManager object and the deprecation of lm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed

Announce v0.4.0 in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1061
remove commented planned samplers in lm_eval/api/samplers.py by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1062
Confirming links in docs work (WIP) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1065
Set actual version to v0.4.0 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1064
Updating docs hyperlinks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1066
Fiddling with READMEs, Reenable CI tests on main by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1063
Update _cot_fewshot_template_yaml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1074
Patch scrolls by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1077
Update template of qqp dataset by @shiweijiezero in https://github.com/EleutherAI/lm-evaluation-harness/pull/1097
Change the sub-task name from sst to sst2 in glue by @shiweijiezero in https://github.com/EleutherAI/lm-evaluation-harness/pull/1099
Add kmmlu evaluation to tasks by @h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1089
Fix stderr by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1106
Simplified evaluator.py by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1104
[Refactor] vllm data parallel by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1035
Unpack group in write_out by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1113
Revert "Simplified evaluator.py" by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1116
qqp, mnli_mismatch: remove unlabled test sets by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1114
fix: bug of BBH_cot_fewshot by @Momo-Tori in https://github.com/EleutherAI/lm-evaluation-harness/pull/1118
Bump BBH version by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1120
Refactor hf modeling code by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1096
Additional process for doc_to_choice by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1093
doc_to_decontamination_query can use function by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1082
Fix vllm batch_size type by @xTayEx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1128
fix: passing max_length to vllm engine args by @NanoCode012 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1124
Fix Loading Local Dataset by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1127
place model onto mps by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1133
Add benchmark FLD by @MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1122
fix typo in README.md by @lennijusten in https://github.com/EleutherAI/lm-evaluation-harness/pull/1136
add correct openai api key to README.md by @lennijusten in https://github.com/EleutherAI/lm-evaluation-harness/pull/1138
Update Linter CI Job by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1130
add utils.clear_torch_cache() to model_comparator by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1142
Enabling OpenAI completions via gooseai by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1141
vllm clean up tqdm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1144
openai nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1139
Add IFEval / Instruction-Following Eval by @wiskojo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1087
set --gen_kwargs arg to None by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1145
Add shorthand flags by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1149
fld bugfix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1150
Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1154
Add docs on adding a multiple choice metric by @polm-stability in https://github.com/EleutherAI/lm-evaluation-harness/pull/1147
Simplify evaluator by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1126
Generalize Qwen tokenizer fix by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1146
self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1172
Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in https://github.com/EleutherAI/lm-evaluation-harness/pull/1171
feat: add option to upload results to Zeno by @Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/990
Switch Linting to ruff by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1166
Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in https://github.com/EleutherAI/lm-evaluation-harness/pull/1178
Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1174
Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1184
Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1183
Add tokenizer backend by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1186
Correctly Print Task Versioning by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1173
update Zeno example and reference in README by @Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1190
Remove tokenizer for openai chat completions by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1191
Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1181
disable mypy by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1193
Generic decorator for handling rate limit errors by @zachschillaci27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1109
Refer in README to main branch by @BramVanroy in https://github.com/EleutherAI/lm-evaluation-harness/pull/1200
Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1189
Upstream Mamba Support (mamba_ssm) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1110
Update cuda handling by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1180
Fix documentation in API table by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1203
Consolidate batching by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1197
Add remove_whitespace to FLD benchmark by @MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1206
Fix the argument order in utils.divide doc by @xTayEx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1208
[Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1212
fix unbounded local variable by @onnoo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1218
nits + fix siqa by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1216
add length of strings and answer options to Zeno metadata by @Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1222
Don't silence errors when loading tasks by @polm-stability in https://github.com/EleutherAI/lm-evaluation-harness/pull/1148
Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1195
Update race's README.md by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1230
batch_schedular bug in Collator by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1229
Update openai_completions.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1238
vllm: handle max_length better and substitute Collator by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1241
Remove self.dataset_path post_init process by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1243
Add multilingual HellaSwag task by @JorgeDeCorte in https://github.com/EleutherAI/lm-evaluation-harness/pull/1228
Do not escape ascii in logging outputs by @passaglia in https://github.com/EleutherAI/lm-evaluation-harness/pull/1246
fixed fewshot loading for multiple input tasks by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1255
Revert citation by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1257
Specify utf-8 encoding to properly save non-ascii samples to file by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1265
Fix evaluation for the belebele dataset by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/1267
Call "exact_match" once for each multiple-target sample by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1266
MultiMedQA by @tmabraham in https://github.com/EleutherAI/lm-evaluation-harness/pull/1198
Fix bug in multi-token Stop Sequences by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1268
Update Table Printing by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1271
add Kobest by @jp1924 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1263
Apply process_docs() to fewshot_split by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1276
Fix whitespace issues in GSM8k-CoT by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1275
Make parallelize=True vs. accelerate launch distinction clearer in docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1261
Allow parameter edits for registered tasks when listed in a benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1273
Fix data-parallel evaluation with quantized models by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1270
Rework documentation for explaining local dataset by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1284
Update CITATION.bib by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1285
Update nq_open / NaturalQs whitespacing by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1289
Update README.md with custom integration doc by @msaroufim in https://github.com/EleutherAI/lm-evaluation-harness/pull/1298
Update nq_open.yaml by @Hannibal046 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1305
Update task_guide.md by @daniellepintz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1306
Pin datasets dependency at 2.15 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1312
Fix polemo2_in.yaml subset name by @lhoestq in https://github.com/EleutherAI/lm-evaluation-harness/pull/1313
Fix datasets dependency to >=2.14 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1314
Fix group register by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1315
Update task_guide.md by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1316
Update polemo2_in.yaml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1318
Fix: Mamba receives extra kwargs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1328
Fix Issue regarding stderr by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1327
Add local-completions support using OpenAI interface by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1277
fallback to classname when LM doesnt have config by @nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1334
fix a trailing whitespace that breaks a lint job by @nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1335
skip "benchmarks" in changed_tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1336
Update migrated HF dataset paths by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1332
Don't use get_task_dict() in task registration / initialization by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1331
manage default (greedy) gen_kwargs in vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1341
vllm: change default gen_kwargs behaviour; prompt_logprobs=1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1345
Update links to advanced_task_guide.md by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1348
Filter docs not offset by doc_id by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1349
Add FAQ on lm_eval.tasks.initialize_tasks() to README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1330
Refix issue regarding stderr by @thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1357
Add causalLM OpenVino models by @NoushNabi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1290
Apply some best practices and guideline recommendations to code by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1363
serialize callable functions in config by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1367
delay filter init; remove *args by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1369
Fix unintuitive --gen_kwargs behavior by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1329
Publish to pypi by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1194
Make dependencies compatible with PyPI by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1378

New Contributors

@shiweijiezero made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1097
@h-albert-lee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1089
@Momo-Tori made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1118
@xTayEx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1128
@NanoCode012 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1124
@MorishT made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1122
@lennijusten made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1136
@veekaybee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1141
@wiskojo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1087
@polm-stability made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1147
@seungduk-yanolja made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1171
@Sparkier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/990
@anjor made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1184
@zachschillaci27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1109
@BramVanroy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1200
@onnoo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1218
@JorgeDeCorte made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1228
@jmichaelov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1267
@jp1924 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1263
@msaroufim made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1298
@Hannibal046 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1305
@daniellepintz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1306
@lhoestq made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1313
@djstrong made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1316
@nairbv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1334
@thnkinbtfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1357
@NoushNabi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1290
@LSinev made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1363

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.0...v0.4.1

Files