EleutherAI/lm-evaluation-harness: v0.4.5
Creators
- Lintang Sutawika1
- Hailey Schoelkopf1
- Leo Gao
- Baber Abbasi
- Stella Biderman2
- Jonathan Tow
- ben fattori
- Charles Lovering
- farzanehnakhaee70
- Jason Phang
- Anish Thite3
- Fazz
- Thomas Wang4
- Niklas Muennighoff
- Aflah5
- sdtblck
- nopperl
- gakada
- tttyuntian
- researcher2
- Julen Etxaniz6
- Chris7
- Hanwool Albert Lee8
- Khalid
- Zdeněk Kasner9
- LSinev
- KonradSzafer
- Jeffrey Hsu10
- Anjor Kanekar11
- Pawan Sasanka Ammanamanchi
- 1. @EleutherAI
- 2. Booz Allen Hamilton, EleutherAI
- 3. @ClarosAI
- 4. MistralAI
- 5. Indraprastha Institute of Information Technology Delhi
- 6. Hitz Zentroa UPV/EHU
- 7. @azurro
- 8. NCSOFT
- 9. Charles University
- 10. Ivy Natal
- 11. Platypus Tech
Description
lm-eval v0.4.5 Release Notes
New Additions
Prototype Support for Vision Language Models (VLMs)
We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal
and vllm-vlm
. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val
) task and we welcome contributions and feedback from the community!
New VLM-Specific Arguments
VLM models can be configured with several new arguments within --model_args
to support their specific requirements:
max_images
(int): Set the maximum number of images for each prompt.interleave
(bool): Determines the positioning of image inputs. WhenTrue
(default) images are interleaved with the text. WhenFalse
all images are placed at the front of the text. This is model dependent.
hf-multimodal
specific args:
image_token_id
(int) orimage_string
(str): Specifies a custom token or string for image placeholders. For example, Llava models expect an"<image>"
string to indicate the location of images in the input, while Qwen2-VL models expect an"<|image_pad|>"
sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model familyconvert_img_format
(bool): Whether to convert the images to RGB format.
Example usage:
lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template
Important considerations
- Chat Template: Most VLMs require the
--apply_chat_template
flag to ensure proper input formatting according to the model's expected chat template. - Some VLM models are limited to processing a single image per prompt. For these models, always set
max_images=1
. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiringinterleave=False
. - Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.
Tested VLM Models
We have currently most notably tested the implementation with the following models:
- llava-hf/llava-1.5-7b-hf
- llava-hf/llava-v1.6-mistral-7b-hf
- Qwen/Qwen2-VL-2B-Instruct
- HuggingFaceM4/idefics2 (requires the latest
transformers
from source)
New Tasks
Several new tasks have been contributed to the library for this version!
New tasks as of v0.4.5 include:
- Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232
- MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243
- TurkishMMLU by @ArdaYueksel in #2283
- PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Backwards Incompatibilities
Finalizing group
versus tag
split
We've now fully deprecated the use of group
keys directly within a task's configuration file. The appropriate key to use is now solely tag
for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.
Handling of Causal vs. Seq2seq backend in HFLM
In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS
to a different factory class, such as transformers.AutoModelForVision2Seq
.
As a result, those users who subclass HFLM but do not call HFLM.__init__()
may now also need to set the self.backend
attribute to either "causal"
or "seq2seq"
during initialization themselves.
While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see https://github.com/EleutherAI/lm-evaluation-harness/pull/2353 for the full set of changes.
Future Plans
We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!
Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)
What's Changed
- Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by @Malikeh97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2232
- Multimodal prototyping by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2243
- Update README.md by @SYusupov in https://github.com/EleutherAI/lm-evaluation-harness/pull/2297
- remove comma by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2315
- Update neuron backend by @dacorvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2314
- Fixed dummy model by @Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/2339
- Add a note for missing dependencies by @eldarkurtic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2336
- squad v2: load metric with
evaluate
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2351 - fix writeout script by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2350
- Treat tags in python tasks the same as yaml tasks by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2288
- change group to tags in task
eus_exams
task configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2320 - change glianorex to test split by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2332
- mmlu-pro: add newlines to task descriptions (not leaderboard) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2334
- Added TurkishMMLU to LM Evaluation Harness by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2283
- add mmlu readme by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2282
- openai: better error messages; fix greedy matching by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2327
- fix some bugs of mmlu by @eyuansu62 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2299
- Add new benchmark: Portuguese bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2156
- Fix missing key in custom task loading. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2304
- Add new benchmark: Spanish bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2157
- Add new benchmark: Galician bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2155
- Add new benchmark: Basque bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2153
- Add new benchmark: Catalan bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2154
- fix tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2380
- Hotfix! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2383
- Solution for CSAT-QA tasks evaluation by @KyujinHan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2385
- LingOly - Fixing scoring bugs for smaller models by @am-bean in https://github.com/EleutherAI/lm-evaluation-harness/pull/2376
- Fix float limit override by @cjluo-omniml in https://github.com/EleutherAI/lm-evaluation-harness/pull/2325
- [API] tokenizer: add trust-remote-code by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2372
- HF: switch conditional checks to
self.backend
fromAUTO_MODEL_CLASS
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2353 - max_images are passed on to vllms
limit_mm_per_prompt
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2387 - Fix Llava-1.5-hf ; Update to version 0.4.5 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2388
- Bump version to v0.4.5 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2389
New Contributors
- @Malikeh97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2232
- @SYusupov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2297
- @dacorvo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2314
- @eldarkurtic made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2336
- @giuliolovisotto made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2288
- @ArdaYueksel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2283
- @zxcvuser made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2156
- @KyujinHan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2385
- @cjluo-omniml made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2325
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.4...v0.4.5
Files
EleutherAI/lm-evaluation-harness-v0.4.5.zip
Files
(3.4 MB)
Name | Size | Download all |
---|---|---|
md5:3d59962a5722542907bdf6dbef1e873c
|
3.4 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.5 (URL)