There is a newer version of this record available.

Presentation Open Access

Where has your data come from? Data ancestry and other tales

Allard, Tania

Over the last few years, great improvements have been made around the areas of reproducible scientific computing research and FAIR (findable, accessible, interoperable and reusable) data. As a consequence, data scientists and researchers alike have started to incorporate modern software development practices in their workflows (i.e. version control, testing). More and more emphasis has been made on the need to look after the quality and validity of the software developed. But what about the data? Data validation and integrity is just as important as the adequacy of the code ingesting and processing the datasets. In this talk, I will take a high-level look at concepts such as data lineage, provenance, continuous data validation and present real-world examples in which these concepts have been applied to different real-world data pipelines increasing not only the confidence of the results obtained but also the efficiency and integrity of the workflows themselves.

Files (388.3 MB)
Name Size
2019-05-08-csv-data-lineage.pdf
md5:273b5f3fb63c8ddceee3cbe1525687b8
388.3 MB Download
105
60
views
downloads
All versions This version
Views 10514
Downloads 6016
Data volume 22.3 GB6.2 GB
Unique views 9112
Unique downloads 387

Share

Cite as