Published June 7, 2025 | Version v1
Dataset Open

KnowAir-V2: A Benchmark Dataset for Air Quality Forecasting with PCDCNet

  • 1. School of Systems Science, Beijing Normal University, Beijing, China
  • 2. D-ITET, ETH Zurich, Zurich, Switzerland
  • 3. Swiss Data Science Center, ETH Zurich, Zurich, Switzerland
  • 4. ColorfulClouds Technology Co.,Ltd. Beijing, China
  • 5. Graz University of Technology, Graz, Austria
  • 6. Complexity Science Hub, Vienna, Austria
  • 7. Institute of Nonequilibrium Systems, Beijing Normal University, Beijing, China
  • 8. Potsdam Institute for Climate Impact Research, Potsdam, Germany

Description

Abstract

This document describes the KnowAir-V2 dataset, a high-quality, large-scale data resource first introduced in our paper, "PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and Constraints"(arXiv:2505.19842). This dataset is specifically designed to serve as a benchmark for the development and validation of deep learning surrogate models that incorporate physical-chemical principles. Given that traditional numerical models are computationally expensive and pure data-driven models often lack physical consistency, this dataset aims to bridge that gap. The data has been preprocessed, including the imputation of missing values, and has been utilized in operational online forecasting systems. It serves as a robust foundation for building and evaluating interpretable, physically-consistent AI models for environmental science.

Dataset Description

Geographic and Temporal Coverage

  • Regions: The dataset covers two major, densely populated regions in China:

    • The Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), with data from 228 monitoring stations.

    • The Yangtze River Delta (YRD), with data from 127 monitoring stations.

  • Temporal Span: The data spans from January 1, 2016, to December 31, 2023, providing an extensive timeline for training and evaluation.

Data Content

This dataset contains hourly time-series data for a total of 10 variables:

  • Air Quality Variables (from CNEMC):

    • PM2.5 (Fine particulate matter)

    • O3 (Ozone)

  • Meteorological Variables (from ERA5 Reanalysis):

    • t2m: 2-meter air temperature

    • d2m: 2-meter dew point temperature

    • tp: Total precipitation

    • sp: Surface pressure

    • blh: Boundary layer height

    • msdwswrf: Mean surface downward short-wave radiation flux

    • u100: U-component of wind at 100m

    • v100: V-component of wind at 100m

File Inventory

  • stations_bthsa.csv & stations_yrd.csv: These files contain metadata for the monitoring stations. Each station has a unique station_id and includes geographic coordinates (lon, lat).

  • dataset_bthsa.nc & dataset_yrd.nc: These are the primary data files in NetCDF format.

    • Coordinates: station and time.

    • Data variables: Contains the air quality and meteorological variables listed above.

    • Usage: These files can be easily opened and explored with tools like Python's xarray library or visualized with software like Panoply.

Data Quality and Significance

This is an operational-level dataset that has powered online forecasting services. A statistical comparison in "Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network" (arXiv:2504.10014) highlights that, among public air quality datasets (including our prior work, KnowAir/PM2.5-GNN described in arXiv:2002.12898), this dataset offers the longest and most recent temporal coverage. It focuses on two of China's most significant and closely watched regions, which exhibit strong spatio-temporal correlations, making them an excellent benchmark and testbed for advanced spatio-temporal prediction models targeting a complex, real-world problem.

Usage Notes

To fully reproduce the results of the PCDCNet model, emission inventory data is also required. Due to licensing restrictions, this component is not distributed here. Researchers must register an account on the official website of the Multi-resolution Emission Inventory for China (MEIC) at http://meicmodel.org.cn to download the necessary data for pollutants such as NOx, VOC, SO2, NH3, PM10 and PM2.5. However, a valid model can still be trained and run using only the meteorological and air quality data provided in this dataset, though including emissions data is recommended for achieving optimal performance.

License and Citation

This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This means you are free to share and adapt the dataset for any purpose, provided that you give appropriate credit by citing both the original paper and this dataset.

How to Cite

If you use this dataset in your research, please cite the following:

  1. The Paper (arXiv:2505.19842):

    Wang, S., Cheng, Y., Meng, Q., Saukh, O., Zhang, J., Fan, J., Zhang, Y., Yuan, X., & Thiele, L. (2025). PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and Constraints. arXiv preprint arXiv:2505.19842.

  2. This Dataset:

    Please cite this dataset using the Zenodo DOI provided on this page.

Files

stations_bthsa.csv

Files (997.0 MB)

Name Size Download all
md5:7c738d99246683c2711ac48f3a737aa9
640.2 MB Download
md5:7c7dcef8b5070cbff3ec9abc198a0fb9
356.8 MB Download
md5:a7fd2302a5b21276276e78c9355af764
12.1 kB Preview Download
md5:1a23d70eb49fd0a350d475518089d45e
6.6 kB Preview Download

Additional details

Related works

Is published in
10.48550/arXiv.2505.19842 (DOI)