KnowAir-V2: A Benchmark Dataset for Air Quality Forecasting with PCDCNet
Authors/Creators
- 1. School of Systems Science, Beijing Normal University, Beijing, China
- 2. D-ITET, ETH Zurich, Zurich, Switzerland
- 3. Swiss Data Science Center, ETH Zurich, Zurich, Switzerland
- 4. ColorfulClouds Technology Co.,Ltd. Beijing, China
- 5. Graz University of Technology, Graz, Austria
- 6. Complexity Science Hub, Vienna, Austria
- 7. Institute of Nonequilibrium Systems, Beijing Normal University, Beijing, China
- 8. Potsdam Institute for Climate Impact Research, Potsdam, Germany
Description
Abstract
This document describes the KnowAir-V2 dataset, a high-quality, large-scale data resource first introduced in our paper, "PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and Constraints"(arXiv:2505.19842). This dataset is specifically designed to serve as a benchmark for the development and validation of deep learning surrogate models that incorporate physical-chemical principles. Given that traditional numerical models are computationally expensive and pure data-driven models often lack physical consistency, this dataset aims to bridge that gap. The data has been preprocessed, including the imputation of missing values, and has been utilized in operational online forecasting systems. It serves as a robust foundation for building and evaluating interpretable, physically-consistent AI models for environmental science.
Dataset Description
Geographic and Temporal Coverage
-
Regions: The dataset covers two major, densely populated regions in China:
-
The Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), with data from 228 monitoring stations.
-
The Yangtze River Delta (YRD), with data from 127 monitoring stations.
-
-
Temporal Span: The data spans from January 1, 2016, to December 31, 2023, providing an extensive timeline for training and evaluation.
Data Content
This dataset contains hourly time-series data for a total of 10 variables:
-
Air Quality Variables (from CNEMC):
-
PM2.5(Fine particulate matter) -
O3(Ozone)
-
-
Meteorological Variables (from ERA5 Reanalysis):
-
t2m: 2-meter air temperature -
d2m: 2-meter dew point temperature -
tp: Total precipitation -
sp: Surface pressure -
blh: Boundary layer height -
msdwswrf: Mean surface downward short-wave radiation flux -
u100: U-component of wind at 100m -
v100: V-component of wind at 100m
-
File Inventory
-
stations_bthsa.csv&stations_yrd.csv: These files contain metadata for the monitoring stations. Each station has a uniquestation_idand includes geographic coordinates (lon,lat). -
dataset_bthsa.nc&dataset_yrd.nc: These are the primary data files in NetCDF format.-
Coordinates:
stationandtime. -
Data variables: Contains the air quality and meteorological variables listed above.
-
Usage: These files can be easily opened and explored with tools like Python's
xarraylibrary or visualized with software like Panoply.
-
Data Quality and Significance
This is an operational-level dataset that has powered online forecasting services. A statistical comparison in "Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network" (arXiv:2504.10014) highlights that, among public air quality datasets (including our prior work, KnowAir/PM2.5-GNN described in arXiv:2002.12898), this dataset offers the longest and most recent temporal coverage. It focuses on two of China's most significant and closely watched regions, which exhibit strong spatio-temporal correlations, making them an excellent benchmark and testbed for advanced spatio-temporal prediction models targeting a complex, real-world problem.
Usage Notes
To fully reproduce the results of the PCDCNet model, emission inventory data is also required. Due to licensing restrictions, this component is not distributed here. Researchers must register an account on the official website of the Multi-resolution Emission Inventory for China (MEIC) at http://meicmodel.org.cn to download the necessary data for pollutants such as NOx, VOC, SO2, NH3, PM10 and PM2.5. However, a valid model can still be trained and run using only the meteorological and air quality data provided in this dataset, though including emissions data is recommended for achieving optimal performance.
License and Citation
This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This means you are free to share and adapt the dataset for any purpose, provided that you give appropriate credit by citing both the original paper and this dataset.
How to Cite
If you use this dataset in your research, please cite the following:
-
The Paper (arXiv:2505.19842):
Wang, S., Cheng, Y., Meng, Q., Saukh, O., Zhang, J., Fan, J., Zhang, Y., Yuan, X., & Thiele, L. (2025). PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and Constraints. arXiv preprint arXiv:2505.19842.
-
This Dataset:
Please cite this dataset using the Zenodo DOI provided on this page.
Files
stations_bthsa.csv
Additional details
Related works
- Is published in
- 10.48550/arXiv.2505.19842 (DOI)