CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion
Description
CPIQA is a large scale QA dataset focused on figured extracted from scientific research papers from various peer reviewed venues in the climate science domain. The figures extracted include tables, graphs and diagrams, which inform the generation of questions using large language models (LLMs). Notably this dataset includes questions for 3 audiences: general public, climate skeptic and climate expert. 4 types of questions are generated with various focusses including figures, numerical, text-only and general. This results in 12 questions generated per scientific paper. Alongside figures, descriptions of the figures generated using multimodal LLMs are included and used.
This work was funded through the WCSSP South Africa project, a collaborative initiative between the Met Office, South African and UK partners, supported by the International Science Partnership Fund (ISPF) from the UK's Department for Science, Innovation and Technology (DSIT). It is also supported by the Natural Environment Research Council (grant NE/S015604/1) project GloSAT.
Mutalik, R. Panchalingam, A. Loitongbam, G. Osborn, T. J. Hawkins, E. Middleton, S. E. CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion, ClimateNLP-2025, ACL, 31st July 2025, https://nlp4climate.github.io/
Files
cpiqa.zip
Files
(43.4 GB)
Name | Size | Download all |
---|---|---|
md5:87f0a9e3f91f28473bdc2bb06a949f87
|
43.4 GB | Preview Download |
Additional details
Identifiers
Funding
- UK Research and Innovation
- Global Surface Air Temperature (GloSAT) NE/S015604/1
Software
- Repository URL
- https://github.com/RudraMutalik/CPIQA
- Programming language
- Python