Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published August 29, 2021 | Version 1.0
Dataset Open

Data for Finnish Dialect Detection

  • 1. Rootroo Ltd
  • 2. University of Helsinki

Description

The data used in the paper "Finnish Dialect Identification: The Effect of Audio and Text".

If you use the data, please cite:

Hämäläinen, Mika; Alnajjar, Khalid; Partanen, Niko & Rueter, Jack (2021). Finnish Dialect Identification: The Effect of Audio and Text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).

The metadata.json contains the dialectal and normalized transcriptions, the length of the wav files in milliseconds, the path to the wav file, the role of the speaker and the dialect. The data.zip* files contain the wav files. They are partial zip files to make Zenodo upload easier. Code_models.zip contains the code for training the bimodal model and the final trained model presented in the paper. There is also an OpenNMT model that is the text only model described in the paper. You can use it by running: python3 -m onmt.bin.translate -model nmt-model_step_100000.pt -src source_test.txt -output pred.txt -replace_unk

Our dataset is licensed under CC BY NC ND 4.0. Academic use only.

 

The corpus is based on Suomenkielen näytteitä (CC BY  Kotimaisten kielten keskus).

Files

code_models.zip

Files (25.5 GB)

Name Size Download all
md5:02a52219b8e41900a67522c601e65221
4.8 GB Preview Download
md5:d818136c7f83e39d4c9345bbbd275cad
1.0 GB Download
md5:29172159604e30cad48ede07459d4e24
1.0 GB Download
md5:58cc1b931f4a98a48b88e54339482585
1.0 GB Download
md5:78f634bca3ae407b477b4d8dc72a2277
1.0 GB Download
md5:a518f6b8bdfc17ac1f579c3be475ba3e
1.0 GB Download
md5:8b5ec4f39f2dd5ffee712b2f33f544b0
1.0 GB Download
md5:50b504928d9e8968882ecacf0e1efdbe
1.0 GB Download
md5:04c4c664897fbe74220cdc567f653ca1
1.0 GB Download
md5:26205deb2c850f4da96edd6267aec86d
1.0 GB Download
md5:89e0c48892c84d5982364a801538c202
1.0 GB Download
md5:97faee5fada4b8fa8f15ce86501ab6c1
1.0 GB Download
md5:ef90892e21a92b5576bb0c6588e2082c
1.0 GB Download
md5:45d19d350513b3fe28fcee8775f6e672
1.0 GB Download
md5:51946111ab3ff762680da9d8141e4496
1.0 GB Download
md5:48e50d7aaee8bbfdc98a553ef922f869
1.0 GB Download
md5:26e2e5ad26a8b40ac6380e31ff286443
1.0 GB Download
md5:e259dd23ac8f12d953aaed5d0cf7b6b1
1.0 GB Download
md5:170b85263638781311f6644ac190ce5b
1.0 GB Download
md5:0b7b8395e4c2236a65bb9aee20a6450a
1.0 GB Download
md5:d75a413688439a57adce0e3f2d70dd40
1.0 GB Download
md5:3ad7a2877856962b3c6f078169ad8866
685.3 MB Download
md5:8d5e60d9742afe8228acacc31b1143ca
19.6 MB Preview Download

Additional details

References

  • Hämäläinen, Mika; Alnajjar, Khalid; Partanen, Niko & Rueter, Jack (Accepted). Finnish Dialect Identification: The Effect of Audio and Text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).