Published March 3, 2026 | Version v1
Poster Open

mailcom: Pseudonymization Tool for Textual Data

Description

The rapid growth of data and its usage by Artificial Intelligence applications leads to heightened concerns about data privacy. Researchers often need to analyze datasets that contain personal information, sometimes paired with sensitive attributes such as medical records or political views. To support such analyses without exposing identifiable content, the Scientific Software Center (SSC) of Heidelberg University developed the mailcom package for pseudonymization. This capability is especially important when employing web-hosted Large Language Models for downstream analysis.

As a use case, we applied mailcom to a multilingual email corpus in Spanish, French, and Portuguese contributed by multiple donors as part of a pilot study, in collaboration with the research group of Sybille Große (Department of Romance Studies, Heidelberg University). To protect donor privacy, sensitive information such as names, email addresses, and numbers is extracted and pseudonymized. The package processes text from email subjects and bodies in eml and html formats, as well as from csv rows, making it applicable to a wide range of textual data beyond email.

mailcom is built entirely on open-source libraries and is designed for configurability and extensibility. Its core features are: (i) language identification, (ii) named-entity recognition, (iii) extraction of temporal expressions, and (iv) de-identifying sensitive data via pseudonyms. Three aforementioned languages are supported by default, with options to add further languages and change back-end libraries via configuration.

We present these features in end-to-end processing pipelines using examples from our use case. The main parts include:

(1) General workflow from raw text to pseudonymized output,
(2) Default libraries and techniques (e.g. eml-parser, spaCy, langid, langdetect, transformers, and rule-based)
(3) Mechanisms for adapting to new languages, transformer pipelines, and spaCy models with minimal effort.

Since pseudonymized outputs still require human review to guarantee full anonymization, the package serves as a scalable pre-processing layer that reduces manual work while establishing a principled baseline of privacy protection. This reproducible, privacy-aware tool enables empirical research on digital text under current data-ethics and governance standards.

Files

deRSE26_mailcom_SSC_Heidelberg.pdf

Files (491.7 kB)

Name Size Download all
md5:cb95037f25715995ea3f57f194a898bb
491.7 kB Preview Download

Additional details

Software

Repository URL
https://github.com/ssciwr/mailcom
Programming language
Python
Development Status
Active

References

  • L. Bothe, S. Große, Datensammlung in der Romanistik–Eine Analyse von Normierung und Standardisierung in E-Mails, E-Science-Tage (2023) 132– 139. URL https://doi.org/10.11588/heibooks.1288.c18070
  • G. Toth, eml-parser: Python EML parser library (version 2.0.0) (2024). URL https://pypi.org/project/eml-parser/
  • M. Lui, langid.py: A standalone Language Identification (LangID) tool (version 1.1.6) (2016). URL https://github.com/saffsd/langid.py
  • M. M. Danilak, langdetect: Language detection library ported from Google's language-detection (version 1.0.9) (2021). URL https://pypi.org/project/langdetect/
  • L. Papariello, xlm-roberta-base-language-detection (Revision 9865598) (2024). doi:10.57967/hf/2064. URL https://huggingface.co/papluca/xlm-roberta-base-language-detection
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning at Scale, arXiv preprint arXiv:1911.02116 (2019).
  • Scrapinghub, dateparser: Python parser for human readable dates (version 1.2.1) (2025). URL https://pypi.org/project/dateparser/
  • M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial- strength Natural Language Processing in Python (2020). doi:10.5281/ zenodo.1212303.