Water-quality data imputation with a high percentage of missing values: a machine learning approach
Creators
- 1. Department of Fluid Mechanics and Environmental Engineering (IMFIA), School of Engineering, Universidad de la República, Uruguay
- 2. Department of Computer Science (InCo), School of Engineering, Universidad de la República, Uruguay
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
-
Water temperature (Tw)
-
Dissolved oxygen (DO)
-
Electrical conductivity (EC)
-
pH
-
Turbidity (Turb)
-
Nitrite (NO2-)
-
Nitrate (NO3-)
-
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Files
Imputation.csv
Files
(61.9 kB)
Name | Size | Download all |
---|---|---|
md5:1d424b0492aca90c11b29ef5c65b9182
|
49.1 kB | Preview Download |
md5:49fcc98875149ded1e77b92f20237b53
|
12.8 kB | Preview Download |
Additional details
Related works
- Is derived from
- Preprint: 10.20944/preprints202105.0105.v1 (DOI)
- Journal article: 10.3390/su13116318 (DOI)
References
- Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318