Presentation Open Access
Today, open data platforms host a wide and heterogeneous catalog of datasets. However, these datasets are often neglected in Machine Learning (ML) and other related tasks. This mainly happens because there are few available open data catalogs specialized in ML applications and because it is often unclear whether Machine Learning algorithms would be adequate and well performing on such datasets. Therefore, several open datasets go unused while they could be leveraged by the ML community to explain, evaluate, and challenge existing methods on real open data. For instance, these real-world data could be used by professors teaching ML courses, by students taking these courses, by researchers testing current and novel ML approaches, and possibly to promote the intersection of open data, ML and public policy. In this talk we will show you how we are tackling this issue working on datasets from data.gouv.fr (DGF), the French open data government platform. We aim to answer the question of what makes a dataset suitable and well performing for Machine Learning tasks by leveraging open source tools. Our goal is to establish a first small empirical assessment of the characteristics of a dataset (size, balance of its categorical variables and so on) that make it a “good fit” for Machine Learning algorithms. Specifically, we first manually select an adequate subset of datasets from DGF. Then we perform a statistic profiling on each of these datasets. Thirdly, we automatically train and validate a set of ML algorithms on them and we cluster the datasets according to their evaluation results. These steps help us to better understand the nature of each dataset and thus determine which ones seem suitable for ML applications. Based on these datasets, and inspired by existing resources, we build the first version of a catalog of open datasets for ML. We hope that this platform will be a first stepping stone towards the reuse of open datasets in Machine Learning contexts.