Telenor Nordics Customer Service Self-Help Corpus

Riess, Mike

doi:10.5281/zenodo.20732652

Published June 17, 2026 | Version 1.1

Dataset Open

Telenor Nordics Customer Service Self-Help Corpus

Riess, Mike (Researcher)¹

1. Telenor (Norway)

This is a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Accompanying paper is submitted to Nordic Machine Intelligence Journal, pending peer reivew.

Version 1.1
- Added a derived metadata.topic_classification field to every document
(zero-shot category, similarity score, model, text source, prompt language).
- Corpus size is now reported in spaCy word tokens and characters (previously
subword tokens); added per-language linguistic statistics and a length figure.
- Updated and simplified the reproduction code (analysis/lingcount, analysis/topicclass)
and the documentation.
- Document text, filtering, PII and content spans are unchanged from v1.0.

Files

tn_selfhelp_corpus-v1.1.zip

Files (6.2 MB)

Name	Size	Download all
tn_selfhelp_corpus-v1.1.zip md5:e3114eefdbfa3d58342e05bcc85e41e5	6.2 MB	Preview Download

Additional details

URL: http://www.telenor.com
arXiv: arXiv:2605.26891

Submitted: 2026-04-10

Submitted to Zenodo
Updated: 2026-04-10

Updated version with Topic metadata

Repository URL: https://github.com/tnresearch/tn_selfhelp_corpus

	All versions	This version
Views	86	10
Downloads	14	0
Data volume	82.3 MB	0 Bytes

tn_selfhelp_corpus-v1.1.zip

Files (6.2 MB)

Identifiers

Dates

Software

Telenor Nordics Customer Service Self-Help Corpus

Authors/Creators

Description

Files

tn_selfhelp_corpus-v1.1.zip

Files (6.2 MB)

Additional details

Identifiers

Dates

Software