Published November 23, 2025 | Version v1
Preprint Open

ConfidensIA : Système de pseudonymisation fine-grained pour le secteur médico-social français

  • 1. CondidensIA

Description

[VERSION FRANÇAISE]

Le secteur médico-social français produit massivement des écrits professionnels contenant des données hautement sensibles. Aucun système de pseudonymisation n'est adapté à son vocabulaire, à ses organisations et aux exigences du Code de l'Action Sociale et des Familles (CASF).

Cet article présente ConfidensIA, un système hybride combinant un modèle NER CamemBERT optimisé (distillation à 11 couches, pruning magnitude-based, quantization FP16), 338 règles expertes et 25 916 entrées de gazetteers. La contribution principale est une taxonomie fine-grained de 100 catégories d'entités identifiantes, couvrant établissements médico-sociaux, organismes publics, identifiants, adresses, associations et unités de service.

Sur un corpus gold standard de 330 phrases (448 entités), ConfidensIA obtient un F1 global de 86,1%, et 95,6% sur les entités critiques (NIR, dates de naissance, adresses complètes). Tous les objectifs de performance par criticité sont atteints ou dépassés : CRIT ≥ 95% (atteint : 95,6%), ELEV ≥ 85% (atteint : 97,7%), MOY ≥ 70% (atteint : 91,1%), FAIB ≥ 60% (atteint : 82,9%).

Le mécanisme de pseudonymisation réversible permet un usage RGPD-compliant des grands modèles de langage (LLM) sans exposition de données sensibles : les textes sont pseudonymisés localement, traités par des API externes avec tokens typés, puis dépseudonymisés localement. Cette architecture respecte les principes de privacy by design.

Le modèle distillé (student) est publié sous licence MIT sur Hugging Face : https://huggingface.co/jmdanto/titibongbong_camemBERT_NER

Le pipeline complet demeure privé pour raisons de développement commercial.

 

Abstract (English)

---

[ENGLISH VERSION]

This paper presents ConfidensIA, an automatic pseudonymization system designed for professional documents in the French social and healthcare sector (médico-social). Addressing GDPR requirements and the lack of dedicated solutions, we propose a hybrid approach combining a Named Entity Recognition (NER) model based on distilled CamemBERT, 338 expert rules, and 25,916 gazetteers.

The main contribution is a fine-grained taxonomy of 100 identifying entity categories covering the specificities of the French social services sector: specialized establishments (EHPAD, MECS, IME, SESSAD, SAVS, SAMSAH, FAM, MAS, ESAT, etc.), public organizations (ASE, MDPH, CCAS, CAF, CPAM, Conseil Départemental, etc.), associations, identifiers, addresses, service units, dates, and persons. This granularity enables distinguishing between generic organizations and sector-specific entities absent from standard NER taxonomies (PER, ORG, LOC).

The NER module is based on an optimized version of CamemBERT through distillation (11 layers), magnitude-based pruning (15.14%), and FP16 quantization. This optimization reduces model size by 53% (419 MB → 196 MB) while maintaining performance within 1.3% F1 of the original model. The system integrates 338 expert rules covering structured identifiers (NIR, phone numbers, emails), French postal addresses, public organizations, and healthcare establishments, along with 25,916 gazetteer entries for French cities, first names, departments, national associations, and known establishments.

On a gold standard corpus of 330 sentences containing 448 entities, ConfidensIA achieves an overall F1-score of 86.1% and 95.6% on critical entities (NIR, birth dates, complete addresses). All performance targets by criticality level are met or exceeded: CRIT ≥ 95% (achieved: 95.6%), ELEV ≥ 85% (achieved: 97.7%), MOY ≥ 70% (achieved: 91.1%), FAIB ≥ 60% (achieved: 82.9%).

The reversible pseudonymization mechanism enables GDPR-compliant use of Large Language Models (LLMs) without exposing sensitive data: texts are pseudonymized locally, processed by external APIs with typed tokens, then de-pseudonymized locally. This architecture respects privacy by design principles.

The distilled NER model is published under MIT license on Hugging Face: https://huggingface.co/jmdanto/titibongbong_camemBERT_NER

The complete pipeline remains proprietary for commercial development reasons. The methodology described enables conceptual reproduction by the research community.

This work paves the way for contextual and reversible pseudonymization of French social and healthcare documents, facilitating data exploitation by language models while ensuring GDPR compliance.

Files

Article_ConfidensIA_Corrige_v2.pdf

Files (36.0 kB)

Name Size Download all
md5:b83b495b1d22a3f871e734ca7f08041c
36.0 kB Preview Download

Additional details

Dates

Issued
2025-11-23