Published October 26, 2025 | Version v1.0.0
Dataset Open

German word2vec embeddings trained on OpenSubtitles Part 2

  • 1. ROR icon Harrisburg University of Science and Technology

Description

This dataset contains the subs2vec embeddings for German, as presented in https://zenodo.org/records/17243814. The embeddings were trained on large-scale subtitle corpora and represent semantic vector spaces derived from naturalistic language use in films and television from the OpenSubtitles 2018 datasets: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles

For this language, we provide all embedding variants explored in the study. Specifically, the dataset includes vectors generated under different combinations of:

  • Dimensionality: multiple vector sizes (e.g., 100, 200, 300, …)
  • Window size: varying context windows (e.g., 2, 5, 10, …)
  • Each file corresponds to a unique configuration (dimension × window size). 

Each file contains the vocabulary for that language (column 1) and then the embedding values (columns 2 through dimension size + 1). 

If you use this dataset, please cite:

Files

Files (49.0 GB)

Name Size Download all
md5:71519a5034b1fffb3f842ef7cbbfce4d
2.7 GB Download
md5:9a5f21e271c02bd56fa5e304fbde0cd4
2.7 GB Download
md5:9290e5296a592d9dfefb2c1ba0caec14
2.7 GB Download
md5:a7272ac1e4ba7090a2dc36a25ed28f82
4.1 GB Download
md5:97d164fe87abfccb073baab66894ec95
4.1 GB Download
md5:af987cff0ead93ed1fb873d0472c4307
4.1 GB Download
md5:1654c922a08b6e1b3c442095359d1a6b
4.1 GB Download
md5:2c38e9e7ef880b8bfb1e22b6cd3bc29a
4.1 GB Download
md5:412604a1c93c5ad22fc1987a489f98e6
4.1 GB Download
md5:69961c53536c9da76f3e2b18beb83685
4.1 GB Download
md5:fffeb94c78f6a0c7847b31a874d638ef
4.1 GB Download
md5:1f9b65cb090f7fb54cea14c81ee05513
4.1 GB Download
md5:8360f49714958bdb21f3a2b5e59e3a2e
4.1 GB Download

Additional details

Related works

Is supplement to
Standard: 10.5281/zenodo.17243812 (DOI)

Software