Presentation Open Access
Anthony Auffret; Geoffrey Aldebert; Pavel Soriano
Over the last few years, the emphasis on data quality has evolved from being a nice to have to an absolute necessity. As already stated in past editions of this venue, the quality of open data is essential to *truly fulfill its promise as a channel of empowerment*. In practical terms, data quality for tabular data entails, among other tasks, checking the structural integrity and schematic consistency of its contents. In order to satisfy these requirements, we need to look into the files content and first determine whether our files are indeed CSV files, and secondly, and more importantly, we want to discover the type of data we have in order to properly validate it and thus evaluate and then hopefully improve the quality of our datasets. In this talk, we present our work on the automatic detection of data types within the columns of CSV files. Going beyond classic computer science data types (float, integer, date), we are also interested in detecting more specific in-domain data categories. Specifically, given a CSV file, we look into its columns and determine whether it contains postal codes, enterprise identifiers, geographic coordinates, days of the week, and so on. We will show how applied to data from the French open data platform that we maintain (data.gouv.fr.), these specific types allow users to better control data in order to join or link datasets between them. We see our work as a first step and important stepping stone towards data integrity and validation checks while also facilitating the discoverability and contextualization of open datasets. Our approaches are evaluated by annotating and testing over thousands of columns extracted from more than 15 000 CSV files found in data.gouv.fr.