Conference paper Open Access
Big Data Science, which combines large data sets with techniques from statistics and machine learning, is beginning to reach the social sciences. The promise of this approach to investigation are considerable, allowing researchers to establish correlations between variables over huge numbers of participants using data that has been gathered in a non-invasive fashion and in natural settings. Unlike large-data projects in the physical sciences, however, use of these data sets in the social sciences require that the subjects generating the data be treated in a fair an ethical fashion. This is often taken as requiring either compliance with the common rule, or that the data be de-identified to insure the privacy of the subjects. But de-identification turns out to be far more difficult than one might think. In particular, the ability to re-identify subjects from a set of attributes that can be linked to other data sets has led to a number of mechanisms, such as k-anonymity or l-diversity, that attempt to define technical solutions to the deidentification problem. But these mechanisms are not without their cost. Recent work has shown that de-identification of a data set can introduce statistical bias into that data, making the results extracted by analysis of the de-identified set vary significantly from those same analyses applied to the original set. In this paper, we will look at how this bias is introduced when a particular form of de-identification, kanonymity, is applied to a particular large data set generated by the Massive Open On-line Courses (MOOCs) offered by Harvard and MIT. We will discuss some of the tensions that arise between privacy and Big Data science as a result of this bias, and look at some of the ways that have been proposed to avoid the trade-off between accurate science and privacy. Finally, we will outline a promising new approach to de-identification which appears to avoid much of the bias introduction, at least on the data set in question.