Published June 17, 2026 | Version 1.1

Telenor Nordics Customer Service Self-Help Corpus

  • 1. ROR icon Telenor (Norway)

Description

This is a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Accompanying paper is submitted to Nordic Machine Intelligence Journal, pending peer reivew.

Version 1.1
  - Added a derived metadata.topic_classification field to every document
    (zero-shot category, similarity score, model, text source, prompt language).
  - Corpus size is now reported in spaCy word tokens and characters (previously
    subword tokens); added per-language linguistic statistics and a length figure.
  - Updated and simplified the reproduction code (analysis/lingcount, analysis/topicclass)
    and the documentation.
  - Document text, filtering, PII and content spans are unchanged from v1.0.

Files

tn_selfhelp_corpus-v1.1.zip

Files (6.2 MB)

Name Size Download all
md5:e3114eefdbfa3d58342e05bcc85e41e5
6.2 MB Preview Download

Additional details

Dates

Submitted
2026-04-10
Submitted to Zenodo
Updated
2026-04-10
Updated version with Topic metadata