Air Quality Data from Regulatory AQMS and Low-Cost Sensors in Molí del Sol, Valencia (May 31 2024 - January 23, 2025)
Creators
Contributors
Data collector:
Description
This dataset contains parallel time-series data for air quality and meteorological parameters, collected for the purpose of calibrating low-cost air quality sensors (LCS) against a regulatory-grade reference monitoring station. The data was collected continuously for 238 days, from May 31, 2024, to January 23, 2025, at the Moli del Sol air quality station in the Valencian Community, Spain.
The dataset is divided into two main sources:
- Reference Data (GVA): The gva_*.csv files originate from an official Valencian AQ Monitoring Network (VAQMN) station, managed by the Generalitat Valenciana (GVA). This data represents the high-accuracy "ground truth" measurements from professional, regulatory-grade instruments. The selected reference station is 'Moli del Sol'.
The selected reference station is located at (lat: 39.48113875, lon: -0.40855865), and is identified by its code 46250048
- Low-Cost Sensor Data (LCS): The n1_*.csv and n2_*.csv files contain raw data collected by two IoT nodes (Node 1 and Node 2) equipped with ZPHS01B multi-sensor modules. These nodes were co-located with the official GVA station to ensure measurements were taken under identical ambient conditions.
The raw sensor data from all sources has been processed and aggregated into synchronized time intervals of 10, 30, and 60 minutes to facilitate direct comparison and the training of machine learning models. This dataset is designed for developing and evaluating machine learning algorithms to improve the accuracy of raw pollutant readings (e.g., Ozone, Nitrogen Dioxide) from the low-cost sensors, a process detailed in the associated research publication.
Technical info
Files Description
gva_10.csv, gva_30.csv, gva_60.csv: Data from the official Moli del Sol reference station, aggregated at 10, 30, and 60-minute intervals, respectively.
n1_10.csv, n1_30.csv, n1_60.csv: Data from the first low-cost sensor node (Node 1), aggregated at 10, 30, and 60-minute intervals, respectively.
n2_10.csv, n2_30.csv, n2_60.csv: Data from the second low-cost sensor node (Node 2), aggregated at 10, 30, and 60-minute intervals, respectively.
Variable Descriptions
GVA Files (gva_*.csv)
-
SO2: Sulfur Dioxide concentration (µg/m³)
-
CO: Carbon Monoxide concentration (ppm)
-
O3: Ozone concentration (µg/m³) - This is the primary reference variable.
-
NOx: Total Nitrogen Oxides concentration (µg/m³)
-
NO: Nitric Oxide concentration (µg/m³)
-
NO2: Nitrogen Dioxide concentration (µg/m³)
-
PM10: Particulate Matter < 10 µm (µg/m³)
-
PM2,5: Particulate Matter < 2.5 µm (µg/m³)
-
PM1: Particulate Matter < 1 µm (µg/m³)
-
SPL: Sound Pressure Level (dB)
-
PM10_S/C, PM2,5_S/C, PM1_S/C: Additional Particulate Matter readings from a secondary or control system.
-
Temp_Int: Internal Temperature of the station enclosure (°C)
-
HR_Int: Internal Relative Humidity of the station enclosure (%)
-
HR: Relative Humidity (%)*
-
Vac_Min: Minimum vacuum reading**
-
rounded_datetime: Unix timestamp in nanoseconds, marking the start of the measurement interval.
* This column appears to be an erroneous duplicate of the PM1 column and should be used with caution.
** This column is present in the 30 and 60-minute files but contains no values for this time period.
LCS Files (n1_.csv, n2_.csv)
-
Temp: Ambient Temperature (°C)
-
HR: Relative Humidity (%)
-
PM1: Particulate Matter < 1 µm (µg/m³)
-
PM2_5: Particulate Matter < 2.5 µm (µg/m³)
-
PM10: Particulate Matter < 10 µm (µg/m³)
-
VOC: Volatile Organic Compounds (level-based)
-
CH2O: Formaldehyde concentration (mg/m³)
-
CO2: Carbon Dioxide concentration (ppm)
-
CO: Carbon Monoxide concentration (ppm)
-
O3: Raw Ozone concentration from the LCS (µg/m³, converted from ppm for the study) - This is the primary variable to be calibrated.
-
NO2: Raw Nitrogen Dioxide concentration from the LCS (ppm)
-
rounded_datetime: Unix timestamp in nanoseconds, marking the start of the measurement interval.
Data Quality and Missing Values
Users should be aware that this dataset contains missing values and potential data quality issues that require careful consideration.
-
GVA Data (gva_*.csv):
-
While the 10-minute file appears complete, the descriptive statistics reveal the presence of extreme outliers and negative values (e.g., in NOx, NO2), indicating potential sensor malfunctions or data processing errors.
-
The aggregated 30-minute and 60-minute files contain missing values across most pollutant variables.
-
The Vac_Min column is entirely empty.
-
The HR column appears to be a direct copy of the PM1 column and is likely an error.
-
-
LCS Data (n1_*.csv, n2_*.csv):
-
The low-cost sensor data is generally complete at the 10-minute resolution, but the aggregated 30 and 60-minute files may have occasional missing rows due to aggregation logic.
-
Users should note sensor-specific behavior. For example, the NO2 sensor frequently reports a static maximum value, and the HR (Relative Humidity) sensor occasionally reports values exceeding 100%.
-
We strongly recommend to perform an initial check for missing data and outliers, implementing an appropriate handling and filtering strategy before analysis.
Files
moli-sol-dataset.zip
Files
(5.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:3bda941929caf4090819569399326f16
|
5.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Journal article: 10.5194/amt-2024-127 (DOI)
Funding
- Ministerio de Ciencia, Innovación y Universidades
- PID2021-126823OB-I00
- Ministerio de Ciencia, Innovación y Universidades
- TED2021-131040B-C33
- Ministerio de Educación y Formación Profesional
- PRX23/00589
- Generalitat Valenciana
- CIAICO/2022/179
- Generalitat Valenciana
- CIAEST/2022/64
- Generalitat Valenciana
- CIACIF/2023/416
- Generalitat Valenciana
- CIAEST/2024/71
Dates
- Collected
-
2024-05-31/2025-01-23