There is a newer version of the record available.

Published April 1, 2024 | Version v1
Dataset Open

Korean Dialogic Dataset for Personal Identifiable Information De-identification

Description

KDPII: A New Korean Dialogic Dataset for the Deidentification of Personally Identifiable Information


The rapid growth of
social media in the era of big data and artificial intelligence has raised significant safety concerns related to the communication of sensitive personal information. In modern society, awareness of the importance of preserving privacy is growing, so there is a rising advocacy for adopting language modeling technology to mitigate the risk of personal information leakage and to deidentify sensitive information depending on the situation. Thus far, several theoretical analyses of privacy protection in Korea have been conducted. However, the technical development of language model training resources for Korean has been slower than those of widely spoken languages such as English and Chinese. To address this problem, we developed a comprehensive and organized framework for classifying Korean personally identifiable information (PII) by investigating pertinent examples, such as “Text Anonymization Benchmark” and “Network Intrusion Detection Dataset,” from within and outside Korea. Subsequently, we created a new Korean dataset for PII deidentification, KDPII, which consists of many conversational texts incorporating plentiful Korean PII. Based on this, we examined the Korean PII processing performances of many representative language models that are available on the market. Finally, we found that although the performance of language models in identifying PII varied by model size, model architecture, and training source, most of them were significantly better at recognizing universal PII than language-specific PII, which indicates a prospective direction of expanding training data for implementing Korean-specific PII deidentification in the future.

Files

test.json

Files (62.2 MB)

Name Size Download all
md5:08eebd98593c1bad23fbf29053b3f9d2
6.1 MB Preview Download
md5:fa633403055e4e5918b92b2d11726434
49.8 MB Preview Download
md5:6f89923f873c840ca773d255d058150c
6.3 MB Preview Download