Conceptual Model and Framework for Collaborative Data Cleaning
Description
Data cleaning and preparation are essential parts of data curation lifecycles and scientific workflow. It is also known that exploratory data mining and data cleaning takes 80% of the scientific research pipeline. However, a data cleaning task can be very tedious for a single user, involving lots of exploration and iteration, and prone to error, especially when a curator finds various problems in the dataset. Nevertheless, the single-user data cleaning can also introduce bias where the cleaning quality will only be as good as their knowledge. Therefore, we can assign a data cleaning task to multiple data curators to collaborate on curating datasets. However, when a data cleaning task involves multiple users, it can introduce new problems such as data change disagreement and conflicting process dependency. Understanding this variation in changes and analyzing the merging workflow is important for data curation to evolve the data cleaning workflow and improve the dataset's quality. In line with the reusability theme for IDCC 2022, this approach can help improve the data curation pipeline by improving the data cleaning pipeline through collaboration.
Files
IDCC22_Parulian_ConceptualModel.pdf
Files
(289.2 kB)
Name | Size | Download all |
---|---|---|
md5:ec332c576d2469541f34172c911b243e
|
289.2 kB | Preview Download |