Conference paper Open Access

Big Data and the Social Sciences: Can Accuracy and Privacy Co-Exist?

Waldo, Jim

MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="">
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">big data</subfield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">privacy</subfield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">k-anonymity</subfield>
  <controlfield tag="005">20200120165037.0</controlfield>
  <controlfield tag="001">832070</controlfield>
  <datafield tag="711" ind1=" " ind2=" ">
    <subfield code="d">15 - 16 September 2016</subfield>
    <subfield code="g">Data for Policy</subfield>
    <subfield code="a">Data for Policy 2016 - Frontiers of Data Science for Government: Ideas, Practices and Projections</subfield>
    <subfield code="c">Cambridge, United Kingdom</subfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">532709</subfield>
    <subfield code="z">md5:6a4499296b4b157c142bdc19ba730f55</subfield>
    <subfield code="u"> Waldo-v2.docx</subfield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="y">Conference website</subfield>
    <subfield code="u"></subfield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2017-05-04</subfield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire</subfield>
    <subfield code="p">user-dfp17</subfield>
    <subfield code="o"></subfield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Harvard University</subfield>
    <subfield code="a">Waldo, Jim</subfield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Big Data and the Social Sciences: Can Accuracy and Privacy Co-Exist?</subfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-dfp17</subfield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u"></subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2"></subfield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Big Data Science, which combines large data sets with techniques from statistics and machine learning, is beginning to reach the social sciences. The promise of this approach to investigation are considerable, allowing researchers to establish correlations between variables over huge numbers of participants using data that has been gathered in a non-invasive fashion and in natural settings. Unlike large-data projects in the physical sciences, however, use of these data sets in the social sciences require that the subjects generating the data be treated in a fair an ethical fashion. This is often taken as requiring either compliance with the common rule, or that the data be de-identified to insure the privacy of the subjects. But de-identification turns out to be far more difficult than one might think. In particular, the ability to re-identify subjects from a set of attributes that can be linked to other data sets has led to a number of mechanisms, such as k-anonymity or l-diversity, that attempt to define technical solutions to the deidentification problem. But these mechanisms are not without their cost. Recent work has shown that de-identification of a data set can introduce statistical bias into that data, making the results extracted by analysis of the de-identified set vary significantly from those same analyses applied to the original set. In this paper, we will look at how this bias is introduced when a particular form of de-identification, kanonymity, is applied to a particular large data set generated by the Massive Open On-line Courses (MOOCs) offered by Harvard and MIT. We will discuss some of the tensions that arise between privacy and Big Data science as a result of this bias, and look at some of the ways that have been proposed to avoid the trade-off between accurate science and privacy. Finally, we will outline a promising new approach to de-identification which appears to avoid much of the bias introduction, at least on the data set in question.&lt;/p&gt;</subfield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.604803</subfield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.832070</subfield>
    <subfield code="2">doi</subfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">publication</subfield>
    <subfield code="b">conferencepaper</subfield>
All versions This version
Views 256213
Downloads 9667
Data volume 51.4 MB35.7 MB
Unique views 240198
Unique downloads 9567


Cite as