Water-quality data imputation with a high percentage of missing values: a machine learning approach

Rafael Rodríguez; Marcos Pastorini; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Angela Gorgoglione

doi:10.5281/zenodo.4731169

Published April 30, 2021 | Version v1

Dataset Open

Water-quality data imputation with a high percentage of missing values: a machine learning approach

1. Department of Fluid Mechanics and Environmental Engineering (IMFIA), School of Engineering, Universidad de la República, Uruguay
2. Department of Computer Science (InCo), School of Engineering, Universidad de la República, Uruguay

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Files

Imputation.csv

Files (61.9 kB)

Name	Size	Download all
Imputation.csv md5:1d424b0492aca90c11b29ef5c65b9182	49.1 kB	Preview Download
Original.csv md5:49fcc98875149ded1e77b92f20237b53	12.8 kB	Preview Download

Additional details

Is derived from: Preprint: 10.20944/preprints202105.0105.v1 (DOI); Journal article: 10.3390/su13116318 (DOI)

Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

	All versions	This version
Views	1,185	1,180
Downloads	662	659
Data volume	38.3 MB	38.1 MB

Imputation.csv

Files (61.9 kB)

Related works

References

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Authors/Creators

Description

Files

Imputation.csv

Files (61.9 kB)

Additional details

Related works

References