CSSI Elements: Development of Assumption-Free Parallel Data Curing Service for Robust Machine Learning and Statistical Predictions
Description
Missing data is an intrinsic problem of broad science and engineering. In the emerging era of big data and machine learning (ML), the missing data may substantially damage the reliability and accuracy of ML predictions and statistical inference. Researchers are not certain about the negative impact of incomplete data on the final ML and statistical analyses and also about how to tackle large, complex incomplete data. Existing data curing methods (imputation methods) are difficult for general researchers and often unsuitable for large complex data.
To resolve these challenges, this project’s goal is to develop a new community-level data-curing service running on the NSF cyberinfrastructure and local high-performance computing (HPC) facilities for broad researchers in science and engineering. The proposed service requires little expert-level statistical assumptions and has no restrictions on size, dimension, type, and complexity of data. This project’s service pursues generality, reliability, accuracy, and scalability. To tackle big missing data, the project seeks to establish parallel data-curing cores with ideal scalability and to provide uncertainty measures behind the cured data. Also, this project will develop supplementary algorithms that will provide helpful information regarding the positive/negative influence of the cured data on the subsequent ML and SL-based predictions. With the developed general data curing service researchers in broad science and engineering can facilely tackle their incomplete data and use them for subsequent ML and statistical inference with confidence.
Notes
Files
CSSI Poster Cho and Kim (Iowa State University) 2022.pdf
Files
(2.3 MB)
Name | Size | Download all |
---|---|---|
md5:a9ad4deb61769adecae946ea54680262
|
2.3 MB | Preview Download |
Additional details
References
- Song et al. (2019), IEEE TKDE (doi: 10.1109/TKDE.2019.2922638)
- Yicheng et al. (2020), IEEE TKDE (doi: 10.1109/TKDE.2020.3029146)
- Im et al. (2018), The R Journal (https://journal.r-project.org/archive/2018/RJ-2018-020/index.html)