There is a newer version of the record available.

Published May 9, 2025 | Version 1.0
Dataset Open

CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion

  • 1. ROR icon University of Southampton

Description

CPIQA is a large scale QA dataset focused on figured extracted from scientific research papers from various peer reviewed venues in the climate science domain. The figures extracted include tables, graphs and diagrams, which inform the generation of questions using large language models (LLMs). Notably this dataset includes questions for 3 audiences: general public, climate skeptic and climate expert. 4 types of questions are generated with various focusses including figures, numerical, text-only and general. This results in 12 questions generated per scientific paper. Alongside figures, descriptions of the figures generated using multimodal LLMs are included and used.

This work was funded through the WCSSP South Africa project, a collaborative initiative between the Met Office, South African and UK partners, supported by the International Science Partnership Fund (ISPF) from the UK's Department for Science, Innovation and Technology (DSIT). It is also supported by the Natural Environment Research Council (grant NE/S015604/1) project GloSAT.

Mutalik, R. Panchalingam, A. Loitongbam, G. Osborn, T. J. Hawkins, E. Middleton, S. E. CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion, ClimateNLP-2025, ACL, 31st July 2025, https://nlp4climate.github.io/

Files

cpiqa.zip

Files (43.4 GB)

Name Size Download all
md5:87f0a9e3f91f28473bdc2bb06a949f87
43.4 GB Preview Download

Additional details

Funding

UK Research and Innovation
Global Surface Air Temperature (GloSAT) NE/S015604/1

Software

Repository URL
https://github.com/RudraMutalik/CPIQA
Programming language
Python