Published December 2024 | Version v2
Dataset Open

KDPII DATASET REVISED

Description

KDPII: A New Korean Dialogic Dataset for the Deidentification of Personally Identifiable Information


The rapid growth of
social media in the era of big data and artificial intelligence has raised significant safety concerns related to the communication of sensitive personal information. In modern society, awareness of the importance of preserving privacy is growing, so there is a rising advocacy for adopting language modeling technology to mitigate the risk of personal information leakage and to deidentify sensitive information depending on the situation. Thus far, several theoretical analyses of privacy protection in Korea have been conducted. However, the technical development of language model training resources for Korean has been slower than those of widely spoken languages such as English and Chinese. To address this problem, we developed a comprehensive and organized framework for classifying Korean personally identifiable information (PII) by investigating pertinent examples, such as “Text Anonymization Benchmark” and “Network Intrusion Detection Dataset,” from within and outside Korea. Subsequently, we created a new Korean dataset for PII deidentification, KDPII, which consists of many conversational texts incorporating plentiful Korean PII. Based on this, we examined the Korean PII processing performances of many representative language models that are available on the market. Finally, we found that although the performance of language models in identifying PII varied by model size, model architecture, and training source, most of them were significantly better at recognizing universal PII than language-specific PII, which indicates a prospective direction of expanding training data for implementing Korean-specific PII deidentification in the future.

Notes

"KDPII DATASET REVISED" is an integrated and revised version of the original dataset. There are no split train/dev/test subdatasets in this version. Instead, some equivocal entities have been reassessed for the sake of a better identification performance through LLMs.

Files

PII_dataset_V3.json

Files (33.3 MB)

Name Size Download all
md5:419921a5e153e8346b6d320d5f084a4a
33.3 MB Preview Download