Data-centric ML pipeline for data drift and data preprocessing
Description
Main MLOps challenges in hardware verification originate from severe data heterogeneity and frequent data drift both in feature and type spaces. This study proposes using multi-purpose data schema, inferred in a bottom-up fashion, which can be used for data monitoring, type casting, and preprocessing. This approach provides a data ingestion step in an ML pipeline that increases transparency and flexibility in data preprocessing. With the flexibility in data preprocessing, we also demonstrate that data (preprocessing) tuning can further improve model performance, emphasizing the importance of data handling and data quality in building ML products.
Files
SciPy_2023_Hongsup_Shin.pdf
Files
(410.7 kB)
Name | Size | Download all |
---|---|---|
md5:8bf7f93762eea0e4569cabba66654eb2
|
410.7 kB | Preview Download |