CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion

Mutalik, Rudra; Middleton, Stuart E.

doi:10.5281/zenodo.15374870

Published May 9, 2025 | Version 1.0

Dataset Open

CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion

1. University of Southampton

CPIQA is a large scale QA dataset focused on figured extracted from scientific research papers from various peer reviewed venues in the climate science domain. The figures extracted include tables, graphs and diagrams, which inform the generation of questions using large language models (LLMs). Notably this dataset includes questions for 3 audiences: general public, climate skeptic and climate expert. 4 types of questions are generated with various focusses including figures, numerical, text-only and general. This results in 12 questions generated per scientific paper. Alongside figures, descriptions of the figures generated using multimodal LLMs are included and used.

This work was funded through the WCSSP South Africa project, a collaborative initiative between the Met Office, South African and UK partners, supported by the International Science Partnership Fund (ISPF) from the UK's Department for Science, Innovation and Technology (DSIT). It is also supported by the Natural Environment Research Council (grant NE/S015604/1) project GloSAT.

Mutalik, R. Panchalingam, A. Loitongbam, G. Osborn, T. J. Hawkins, E. Middleton, S. E. CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion, ClimateNLP-2025, ACL, 31st July 2025, https://nlp4climate.github.io/

Files

cpiqa.zip

Files (43.4 GB)

Name	Size	Download all
cpiqa.zip md5:87f0a9e3f91f28473bdc2bb06a949f87	43.4 GB	Preview Download

Additional details

URL: https://huggingface.co/datasets/RudraMutalik/CPIQA

UK Research and Innovation
Global Surface Air Temperature (GloSAT) NE/S015604/1

Repository URL: https://github.com/RudraMutalik/CPIQA
Programming language: Python

	All versions	This version
Views	105	101
Downloads	20	20
Data volume	911.6 GB	911.6 GB

CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion

Files

cpiqa.zip

Files (43.4 GB)

Additional details

Identifiers

Funding

Software

CPIQA: Climate Paper Image Question Answering Dataset for Retrieval-Augmented Generation with Context-based Query Expansion

Creators

Description

Files

cpiqa.zip

Files (43.4 GB)

Additional details

Identifiers

Funding

Software