English word2vec embeddings trained on OpenSubtitles Part 6

Grim, Philip; Buchanan, Erin

doi:10.5281/zenodo.17618269

Published November 15, 2025 | Version v1.0.0

Dataset Open

English word2vec embeddings trained on OpenSubtitles Part 6

1. Harrisburg University of Science and Technology

This dataset contains the subs2vec embeddings for English, as presented in https://zenodo.org/records/17243814. The embeddings were trained on large-scale subtitle corpora and represent semantic vector spaces derived from naturalistic language use in films and television from the OpenSubtitles 2018 datasets: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles.

For this language, we provide all embedding variants explored in the study. Specifically, the dataset includes vectors generated under different combinations of:

Dimensionality: multiple vector sizes (e.g., 100, 200, 300, …)
Window size: varying context windows (e.g., 2, 5, 10, …)
Each file corresponds to a unique configuration (dimension × window size).

Each file contains the vocabulary for that language (column 1) and then the embedding values (columns 2 through dimension size + 1).

If you use this dataset, please cite:

Manuscript: https://doi.org/10.5281/zenodo.17243812
Data: This Zenodo dataset (using the DOI provided here)

sha256sums:

en_300_6_cbow_wxd.csv.bz2 72a94830d81ebbe28e7fa78465e02ad2bd7771ef5414f8f30f6de94565050167
en_300_6_sg_wxd.csv.bz2 b0d7db822f181a124758e55b0c33b47e0a249c37f8a778806f6166a5baf96cb3
en_500_1_cbow_wxd.csv.bz2 4196d84670045dc3cb65195f4045e543c78c1c35d28530b6c24e7711b8cbf23b

Files

README.md

Files (34.2 GB)

Name	Size	Download all
en_300_6_cbow_wxd_part_aa md5:0f8b4d5cb647029bb63b74a63eb9f9e7	1.1 GB	Download
en_300_6_cbow_wxd_part_ab md5:6d32afc9d8e06621e618fd7a1cd959b8	1.1 GB	Download
en_300_6_cbow_wxd_part_ac md5:54786e7ef540247c5894a4f7df18ce7b	1.1 GB	Download
en_300_6_cbow_wxd_part_ad md5:9af908c91bfaeb85c72c740e8ff37aed	1.1 GB	Download
en_300_6_cbow_wxd_part_ae md5:1c34014d61e2fad1cee329870280732c	1.1 GB	Download
en_300_6_cbow_wxd_part_af md5:43f4c6e65af39662a5c35ba91df1afca	1.1 GB	Download
en_300_6_cbow_wxd_part_ag md5:ae155f92c575eb1d9b851fd2c1c9ae59	1.1 GB	Download
en_300_6_cbow_wxd_part_ah md5:1d17e86a324a3f405ada39b0bb325268	1.1 GB	Download
en_300_6_cbow_wxd_part_ai md5:8db28826d6b0fb4760b1be58e298e04f	767.0 MB	Download
en_300_6_sg_wxd_part_aa md5:b1e14f00e8a4094d14af3a3567573d86	1.1 GB	Download
en_300_6_sg_wxd_part_ab md5:eb9381b14109bf5eb1a46975a86405e2	1.1 GB	Download
en_300_6_sg_wxd_part_ac md5:d7ea5450d8c820bca194a6709be0426d	1.1 GB	Download
en_300_6_sg_wxd_part_ad md5:0702f20ee8e8bcd0fa46fc451ad931cb	1.1 GB	Download
en_300_6_sg_wxd_part_ae md5:af3146314b32780f47659df8d3e75caf	1.1 GB	Download
en_300_6_sg_wxd_part_af md5:4271252da0e1a79431230a93c8e74516	1.1 GB	Download
en_300_6_sg_wxd_part_ag md5:68ed622610da52e8e9ca39eed5dd313d	1.1 GB	Download
en_300_6_sg_wxd_part_ah md5:65f79415cfc6d0f7a070befc9ec56467	1.1 GB	Download
en_300_6_sg_wxd_part_ai md5:c5836c20c9f904192b22821a5a2fe760	750.0 MB	Download
en_500_1_cbow_wxd_part_aa md5:5790e50ae0091c2d3f39e27a7e57eaa4	1.1 GB	Download
en_500_1_cbow_wxd_part_ab md5:193a4e19b3813a43bf2dd26e7d4d564a	1.1 GB	Download
en_500_1_cbow_wxd_part_ac md5:6f329da8efeb78dd346ead7843ebcfbb	1.1 GB	Download
en_500_1_cbow_wxd_part_ad md5:6b63b5879cf4ddcfc0199bd12d23b7f7	1.1 GB	Download
en_500_1_cbow_wxd_part_ae md5:887f2d21b70a050307b26e319e7519f9	1.1 GB	Download
en_500_1_cbow_wxd_part_af md5:9685475303cf0ddd2a481e875a52d15a	1.1 GB	Download
en_500_1_cbow_wxd_part_ag md5:183c93d00695405c29ae5c64d8a0aa76	1.1 GB	Download
en_500_1_cbow_wxd_part_ah md5:dc04e0593d00100daac20560e03c5b03	1.1 GB	Download
en_500_1_cbow_wxd_part_ai md5:56679287f3ca026937e4fe1ceb8cc7d1	1.1 GB	Download
en_500_1_cbow_wxd_part_aj md5:2967b468fc1e74eee98fdb04bf43cbbd	1.1 GB	Download
en_500_1_cbow_wxd_part_ak md5:bec53ee2b1d80899c34697fb27ab02c7	1.1 GB	Download
en_500_1_cbow_wxd_part_al md5:f325d18300ab8e26d14e5c3cbdfc9c9b	1.1 GB	Download
en_500_1_cbow_wxd_part_am md5:611b8e5098899d08f21a460d47383a4a	1.1 GB	Download
en_500_1_cbow_wxd_part_an md5:e76b1787bfba6a490f6b224ebefdb02d	1.1 GB	Download
en_500_1_cbow_wxd_part_ao md5:95360629230eabdfa7f791daa1abd273	515.0 MB	Download
FileChunker.ps1 md5:826f5465e694cf140b7a48209d422620	7.1 kB	Download
README.md md5:8864201e5e8f85f9bb348ad1be636f17	2.7 kB	Preview Download

Additional details

Is supplement to: Publication: 10.5281/zenodo.17243812 (DOI)

Repository URL: https://github.com/SemanticPriming/word2manylanguages
Programming language: Python , R

	All versions	This version
Views	93	93
Downloads	1,094	1,094
Data volume	1.1 TB	1.1 TB

README.md

Files (34.2 GB)

Related works

Software

English word2vec embeddings trained on OpenSubtitles Part 6

Authors/Creators

Description

Files

README.md

Files (34.2 GB)

Additional details

Related works

Software