Published September 28, 2022 | Version v1
Conference paper Open

De-Identification of Student Writing in Technologically Mediated Educational Settings

  • 1. Vanderbilt University
  • 2. Georgia State University
  • 3. University of Virginia
  • 4. Politechnica University of Bucharest

Description

When conducting research with data from smart learning systems, there is a need to protect user identities because the release of personally identifiable information (PII) poses a significant barrier to analyzing data and/or creating open datasets. Massive open online courses (MOOCs) are a good example of learning systems where PII concerns may hamper data analysis, the well-being of users, and system innovation. PII is particularly hard to locate and clean because of the variations in formatting, texts, and assignments found in unstructured data. In particular, identifying and removing students’ names has proven difficult (Bosch et al. 2020). This study examines the potential to use large, pre-trained language models to de-identify MOOC data and compares performance on these language models to human annotations. On a validation set, a pre-trained language model fine-tuned using spaCy default hyperparameters achieved 97% recall of student names in the validation set, including partial matches, and 30% precision. On a larger, unseen test set (n = 3,077), the model achieved 93% recall and 24% precision. The majority of the false positives leading to lower recall in the test set were less sensitive names belonging to authors and/or lecturers. The results of the ensemble approach used here show considerable promise on a difficult de-identification task and indicate that automated de-identification is likely mature enough for use on some education datasets. Clearing PII from smart learning systems would ethically protect learners within the systems, allowing for the release of large datasets that could be mined for intelligent insights to forward innovation within smart learning systems.

Files

Files (1.7 MB)

Name Size Download all
md5:56b6b3025689c2f3d5fd0bb9143ebd9b
1.7 MB Download