Telenor Nordics Customer Service Self-Help Corpus
Description
This is a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Accompanying paper is submitted to Nordic Machine Intelligence Journal, pending peer reivew.
Version 1.1
- Added a derived metadata.topic_classification field to every document
(zero-shot category, similarity score, model, text source, prompt language).
- Corpus size is now reported in spaCy word tokens and characters (previously
subword tokens); added per-language linguistic statistics and a length figure.
- Updated and simplified the reproduction code (analysis/lingcount, analysis/topicclass)
and the documentation.
- Document text, filtering, PII and content spans are unchanged from v1.0.
Files
tn_selfhelp_corpus-v1.1.zip
Files
(6.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:e3114eefdbfa3d58342e05bcc85e41e5
|
6.2 MB | Preview Download |
Additional details
Identifiers
- URL
- http://www.telenor.com
- arXiv
- arXiv:2605.26891
Dates
- Submitted
-
2026-04-10Submitted to Zenodo
- Updated
-
2026-04-10Updated version with Topic metadata
Software
- Repository URL
- https://github.com/tnresearch/tn_selfhelp_corpus