Presentation Open Access

csv-detective: solving some of the mysteries of open data

Anthony Auffret; Geoffrey Aldebert; Pavel Soriano

Over the last few years, the emphasis on data quality has evolved from being a nice to have to an absolute necessity. As already stated in past editions of this venue, the quality of open data is essential to *truly fulfill its promise as a channel of empowerment*. In practical terms, data quality for tabular data entails, among other tasks, checking the structural integrity and schematic consistency of its contents. In order to satisfy these requirements, we need to look into the files content and first determine whether our files are indeed CSV files, and secondly, and more importantly, we want to discover the type of data we have in order to properly validate it and thus evaluate and then hopefully improve the quality of our datasets. In this talk, we present our work on the automatic detection of data types within the columns of CSV files. Going beyond classic computer science data types (float, integer, date), we are also interested in detecting more specific in-domain data categories. Specifically, given a CSV file, we look into its columns and determine whether it contains postal codes, enterprise identifiers, geographic coordinates, days of the week, and so on. We will show how applied to data from the French open data platform that we maintain (data.gouv.fr.), these specific types allow users to better control data in order to join or link datasets between them. We see our work as a first step and important stepping stone towards data integrity and validation checks while also facilitating the discoverability and contextualization of open datasets. Our approaches are evaluated by annotating and testing over thousands of columns extracted from more than 15 000 CSV files found in data.gouv.fr.

Files (1.8 MB)
Name Size
csv_detective_short (2).pdf
md5:60b3575fa13fed771b82a249c7554722
1.8 MB Download
48
43
views
downloads
All versions This version
Views 4848
Downloads 4343
Data volume 78.2 MB78.2 MB
Unique views 4646
Unique downloads 4040

Share

Cite as