Data-centric ML pipeline for data drift and data preprocessing

Hongsup Shin

doi:10.5281/zenodo.8136049

Published July 12, 2023 | Version v1

Poster Open

Data-centric ML pipeline for data drift and data preprocessing

Hongsup Shin¹

1. Arm Ltd

Main MLOps challenges in hardware verification originate from severe data heterogeneity and frequent data drift both in feature and type spaces. This study proposes using multi-purpose data schema, inferred in a bottom-up fashion, which can be used for data monitoring, type casting, and preprocessing. This approach provides a data ingestion step in an ML pipeline that increases transparency and flexibility in data preprocessing. With the flexibility in data preprocessing, we also demonstrate that data (preprocessing) tuning can further improve model performance, emphasizing the importance of data handling and data quality in building ML products.

Files

SciPy_2023_Hongsup_Shin.pdf

Files (410.7 kB)

Name	Size	Download all
SciPy_2023_Hongsup_Shin.pdf md5:8bf7f93762eea0e4569cabba66654eb2	410.7 kB	Preview Download

120

Views

103

Downloads

Show more details

	All versions	This version
Views	120	119
Downloads	103	103
Data volume	46.0 MB	46.0 MB

More info on how stats are collected....

DOI

Resource type

Poster

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: July 11, 2023
Modified: July 11, 2024

Data-centric ML pipeline for data drift and data preprocessing

Creators

Description

Files

SciPy_2023_Hongsup_Shin.pdf

Files (410.7 kB)