Published September 3, 2025 | Version 1.0.0
Dataset Open

Air Quality Data from Regulatory AQMS and Low-Cost Sensors in Molí del Sol, Valencia (May 31 2024 - January 23, 2025)

Description

This dataset contains parallel time-series data for air quality and meteorological parameters, collected for the purpose of calibrating low-cost air quality sensors (LCS) against a regulatory-grade reference monitoring station. The data was collected continuously for 238 days, from May 31, 2024, to January 23, 2025, at the Moli del Sol air quality station in the Valencian Community, Spain.

The dataset is divided into two main sources:

  1. Reference Data (GVA): The gva_*.csv files originate from an official Valencian AQ Monitoring Network (VAQMN) station, managed by the Generalitat Valenciana (GVA). This data represents the high-accuracy "ground truth" measurements from professional, regulatory-grade instruments. The selected reference station is 'Moli del Sol'.

The selected reference station is located at (lat: 39.48113875, lon:  -0.40855865), and is identified by its code 46250048

  1. Low-Cost Sensor Data (LCS): The n1_*.csv and n2_*.csv files contain raw data collected by two IoT nodes (Node 1 and Node 2) equipped with ZPHS01B multi-sensor modules. These nodes were co-located with the official GVA station to ensure measurements were taken under identical ambient conditions.

The raw sensor data from all sources has been processed and aggregated into synchronized time intervals of 10, 30, and 60 minutes to facilitate direct comparison and the training of machine learning models. This dataset is designed for developing and evaluating machine learning algorithms to improve the accuracy of raw pollutant readings (e.g., Ozone, Nitrogen Dioxide) from the low-cost sensors, a process detailed in the associated research publication.

Technical info

Files Description

gva_10.csv, gva_30.csv, gva_60.csv: Data from the official Moli del Sol reference station, aggregated at 10, 30, and 60-minute intervals, respectively.

n1_10.csv, n1_30.csv, n1_60.csv: Data from the first low-cost sensor node (Node 1), aggregated at 10, 30, and 60-minute intervals, respectively.

n2_10.csv, n2_30.csv, n2_60.csv: Data from the second low-cost sensor node (Node 2), aggregated at 10, 30, and 60-minute intervals, respectively.

Variable Descriptions

GVA Files (gva_*.csv)

  • SO2: Sulfur Dioxide concentration (µg/m³)

  • CO: Carbon Monoxide concentration (ppm)

  • O3: Ozone concentration (µg/m³) - This is the primary reference variable.

  • NOx: Total Nitrogen Oxides concentration (µg/m³)

  • NO: Nitric Oxide concentration (µg/m³)

  • NO2: Nitrogen Dioxide concentration (µg/m³)

  • PM10: Particulate Matter < 10 µm (µg/m³)

  • PM2,5: Particulate Matter < 2.5 µm (µg/m³)

  • PM1: Particulate Matter < 1 µm (µg/m³)

  • SPL: Sound Pressure Level (dB)

  • PM10_S/C, PM2,5_S/C, PM1_S/C: Additional Particulate Matter readings from a secondary or control system.

  • Temp_Int: Internal Temperature of the station enclosure (°C)

  • HR_Int: Internal Relative Humidity of the station enclosure (%)

  • HR: Relative Humidity (%)*

  • Vac_Min: Minimum vacuum reading**

  • rounded_datetime: Unix timestamp in nanoseconds, marking the start of the measurement interval.

* This column appears to be an erroneous duplicate of the PM1 column and should be used with caution.
** This column is present in the 30 and 60-minute files but contains no values for this time period.

LCS Files (n1_.csv, n2_.csv)

  • Temp: Ambient Temperature (°C)

  • HR: Relative Humidity (%)

  • PM1: Particulate Matter < 1 µm (µg/m³)

  • PM2_5: Particulate Matter < 2.5 µm (µg/m³)

  • PM10: Particulate Matter < 10 µm (µg/m³)

  • VOC: Volatile Organic Compounds (level-based)

  • CH2O: Formaldehyde concentration (mg/m³)

  • CO2: Carbon Dioxide concentration (ppm)

  • CO: Carbon Monoxide concentration (ppm)

  • O3: Raw Ozone concentration from the LCS (µg/m³, converted from ppm for the study) - This is the primary variable to be calibrated.

  • NO2: Raw Nitrogen Dioxide concentration from the LCS (ppm)

  • rounded_datetime: Unix timestamp in nanoseconds, marking the start of the measurement interval.

 

Data Quality and Missing Values

Users should be aware that this dataset contains missing values and potential data quality issues that require careful consideration.

  • GVA Data (gva_*.csv):

    • While the 10-minute file appears complete, the descriptive statistics reveal the presence of extreme outliers and negative values (e.g., in NOx, NO2), indicating potential sensor malfunctions or data processing errors.

    • The aggregated 30-minute and 60-minute files contain missing values across most pollutant variables.

    • The Vac_Min column is entirely empty.

    • The HR column appears to be a direct copy of the PM1 column and is likely an error.

  • LCS Data (n1_*.csv, n2_*.csv):

    • The low-cost sensor data is generally complete at the 10-minute resolution, but the aggregated 30 and 60-minute files may have occasional missing rows due to aggregation logic.

    • Users should note sensor-specific behavior. For example, the NO2 sensor frequently reports a static maximum value, and the HR (Relative Humidity) sensor occasionally reports values exceeding 100%.

We strongly recommend to perform an initial check for missing data and outliers, implementing an appropriate handling and filtering strategy before analysis.

Files

moli-sol-dataset.zip

Files (5.9 MB)

Name Size Download all
md5:3bda941929caf4090819569399326f16
5.9 MB Preview Download

Additional details

Related works

Is supplement to
Journal article: 10.5194/amt-2024-127 (DOI)

Funding

Ministerio de Ciencia, Innovación y Universidades
PID2021-126823OB-I00
Ministerio de Ciencia, Innovación y Universidades
TED2021-131040B-C33
Ministerio de Educación y Formación Profesional
PRX23/00589
Generalitat Valenciana
CIAICO/2022/179
Generalitat Valenciana
CIAEST/2022/64
Generalitat Valenciana
CIACIF/2023/416
Generalitat Valenciana
CIAEST/2024/71

Dates

Collected
2024-05-31/2025-01-23