Published November 17, 2024
| Version v1
Dataset
Open
Northern Sami YleAreena Subtitle Corpus
Description
A corpus of Northern Sami language subtitle sentences collected from YLE Sápmi broadcasts (March 31st 2021 - November 15th 2024).
The corpus includes:
- Full sentences, cleaned and collected from the subtitles
- Sentence IDs
- Complete source metadata
=== Basic Statistics ===
Total Words: 7776
Unique Words: 3033
Total Sentences: 835
Average Sentence Length: 9.31 words
Median Sentence Length: 8.0 words
Type-Token Ratio: 0.39
Sentence Length Std Dev: 5.62
=== Top 10 Most Frequent Words ===
lea: 239
ja: 227
dat: 202
leat: 186
ahte: 166
mun: 124
go: 105
ii: 71
dan: 70
mii: 69
Files
sami_subtitles.json
Files
(135.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:d5e130f4c2c7fff2b8fd11f389a2b813
|
135.2 kB | Preview Download |
Additional details
Dates
- Collected
-
2024-11-15