Published November 17, 2024 | Version v1
Dataset Open

Northern Sami YleAreena Subtitle Corpus

  • 1. ROR icon Helsinki Metropolia University of Applied Sciences

Contributors

Data collector:

  • 1. ROR icon Helsinki Metropolia University of Applied Sciences

Description

A corpus of Northern Sami language subtitle sentences collected from YLE Sápmi broadcasts (March 31st 2021 - November 15th 2024).

 

The corpus includes:
- Full sentences, cleaned and collected from the subtitles
- Sentence IDs
- Complete source metadata


=== Basic Statistics ===
Total Words: 7776
Unique Words: 3033
Total Sentences: 835
Average Sentence Length: 9.31 words
Median Sentence Length: 8.0 words
Type-Token Ratio: 0.39
Sentence Length Std Dev: 5.62

=== Top 10 Most Frequent Words ===
lea: 239
ja: 227
dat: 202
leat: 186
ahte: 166
mun: 124
go: 105
ii: 71
dan: 70
mii: 69

Files

sami_subtitles.json

Files (135.2 kB)

Name Size Download all
md5:d5e130f4c2c7fff2b8fd11f389a2b813
135.2 kB Preview Download

Additional details

Dates

Collected
2024-11-15