Published February 5, 2026 | Version v1
Conference paper Open

Privacy Preservation in Textual Data: A Systematic Mapping Study on Differential Privacy and Semantic Similarity

  • 1. ROR icon Universidade de Brasília

Description

Background: Artificial Intelligence and Machine Learning solutions rely heavily on extracting value from data, often in textual form. Ethical considerations and data protection regulations have intensified the focus on safeguarding sensitive information. Disclosure risks in textual datasets, especially when analyzed through the lens of differential privacy are influenced by text frequency, semantic similarity, and the presence of rare events. Goal: This work aims to identify state-of-the-art techniques for privacy-preserving processing of textual data. The focus is on enabling the application of privacy-enhancing methods for unstructured data, particularly text, as well as on approaches for semantic similarity. Method: To achieve this objective, a Systematic Mapping Study (SMS) was conducted to investigate state-of-the-art privacy preservation techniques. Peer-reviewed studies published between 2010 and 2025 were retrieved from ACM Digital Library, IEEE Xplore, Scopus, and Web of Science. Techniques highlighted in a significant number of studies were selected for deeper analysis and potential application in software engineering. The methodology incorporates concepts from differential privacy, vector databases, semantic similarity, and rare event detection. Results: This study identifies state-of-the-art techniques for privacy-preserving textual data analysis and text similarity. It also investigates how data science methods, large language models, and agent-based AI systems support the implementation of privacy-preserving mechanisms. Additionally, it highlights techniques used for semantic similarity and rare event detection in text-based contexts. The identified techniques provide a foundation for defining guidelines, best practices, and validated methods that can enhance software engineering maturity throughout the lifecycle of textual data, including collection, storage, and processing while addressing privacy risks and regulatory compliance requirements.

Files

Search-String.pdf

Files (2.5 MB)

Name Size Download all
md5:1c6862cbef3fba988ac9c9a3cea74755
2.0 MB Download
md5:22e9c5c05e770ba221276f1c4777aec1
133.6 kB Download
md5:b3ea5c520c0402e20848a23d51f67a8e
28.8 kB Preview Download
md5:2186b25fcd5c1720b377877621d3a43b
28.3 kB Preview Download
md5:37ee3e01187aa1c729cfa4e0c0d7aade
26.8 kB Preview Download
md5:a6798eba2e0fa9b992426ee409383f01
25.7 kB Preview Download
md5:d86be2d008b160636552f44ff8838495
25.8 kB Preview Download
md5:0b2af2952a85db2f99b42777b9888b2a
27.4 kB Preview Download
md5:267d1b4dcdd6b54ae96408fbdbd0f936
52.7 kB Preview Download
md5:630a8e3b1e7a741547c2490181ff5b83
57.0 kB Preview Download
md5:8eb713e09e8174be0655eb74ed66b0cb
67.2 kB Preview Download
md5:84928a931c0b18b08ec5031c49f766e9
41.7 kB Preview Download