German word2vec embeddings trained on OpenSubtitles Part 2
Description
This dataset contains the subs2vec embeddings for German, as presented in https://zenodo.org/records/17243814. The embeddings were trained on large-scale subtitle corpora and represent semantic vector spaces derived from naturalistic language use in films and television from the OpenSubtitles 2018 datasets: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles.
For this language, we provide all embedding variants explored in the study. Specifically, the dataset includes vectors generated under different combinations of:
- Dimensionality: multiple vector sizes (e.g., 100, 200, 300, …)
- Window size: varying context windows (e.g., 2, 5, 10, …)
- Each file corresponds to a unique configuration (dimension × window size).
Each file contains the vocabulary for that language (column 1) and then the embedding values (columns 2 through dimension size + 1).
If you use this dataset, please cite:
- Manuscript: https://doi.org/10.5281/zenodo.17243812
- Data: This Zenodo dataset (using the DOI provided here)
Files
Files
(49.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:71519a5034b1fffb3f842ef7cbbfce4d
|
2.7 GB | Download |
|
md5:9a5f21e271c02bd56fa5e304fbde0cd4
|
2.7 GB | Download |
|
md5:9290e5296a592d9dfefb2c1ba0caec14
|
2.7 GB | Download |
|
md5:a7272ac1e4ba7090a2dc36a25ed28f82
|
4.1 GB | Download |
|
md5:97d164fe87abfccb073baab66894ec95
|
4.1 GB | Download |
|
md5:af987cff0ead93ed1fb873d0472c4307
|
4.1 GB | Download |
|
md5:1654c922a08b6e1b3c442095359d1a6b
|
4.1 GB | Download |
|
md5:2c38e9e7ef880b8bfb1e22b6cd3bc29a
|
4.1 GB | Download |
|
md5:412604a1c93c5ad22fc1987a489f98e6
|
4.1 GB | Download |
|
md5:69961c53536c9da76f3e2b18beb83685
|
4.1 GB | Download |
|
md5:fffeb94c78f6a0c7847b31a874d638ef
|
4.1 GB | Download |
|
md5:1f9b65cb090f7fb54cea14c81ee05513
|
4.1 GB | Download |
|
md5:8360f49714958bdb21f3a2b5e59e3a2e
|
4.1 GB | Download |
Additional details
Related works
- Is supplement to
- Standard: 10.5281/zenodo.17243812 (DOI)
Software
- Repository URL
- https://github.com/SemanticPriming/word2manylanguages
- Programming language
- Python , R