Published October 11, 2025 | Version v1.0.0
Dataset Open

Arabic word2vec embeddings trained on OpenSubtitles Part 1

  • 1. ROR icon Harrisburg University of Science and Technology

Description

This dataset contains the subs2vec embeddings for Arabic, as presented in https://zenodo.org/records/17243814. The embeddings were trained on large-scale subtitle corpora and represent semantic vector spaces derived from naturalistic language use in films and television from the OpenSubtitles 2018 datasets: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles

For this language, we provide all embedding variants explored in the study. Specifically, the dataset includes vectors generated under different combinations of:

  • Dimensionality: multiple vector sizes (e.g., 100, 200, 300, …)
  • Window size: varying context windows (e.g., 2, 5, 10, …)
  • Each file corresponds to a unique configuration (dimension × window size). 

Each file contains the vocabulary for that language (column 1) and then the embedding values (columns 2 through dimension size + 1). 

If you use this dataset, please cite:

Files

Files (49.7 GB)

Name Size Download all
md5:94cd3796dc115bea1ebe501a643ac55c
721.2 MB Download
md5:ba52817503d051cf75127d0b96f70e8c
723.1 MB Download
md5:6247002e0a36bc6fd27f4961bbc92dde
720.7 MB Download
md5:d66687af1353bb614fda89331260a063
723.1 MB Download
md5:db6d61131a1bdd355bf273761aad3e3c
720.6 MB Download
md5:547e54267e576581f8d26605a5addd20
723.5 MB Download
md5:c339516616cd1f8b6e3fa62383ec96aa
721.0 MB Download
md5:fedd4b41e860171b287fd614c6319655
723.9 MB Download
md5:66e75426a581c058afbdb2ca41a90bd7
721.8 MB Download
md5:8dd5a277e70264daa0f5d7a8259b30a1
723.0 MB Download
md5:1f9430492de0b110011aeb856919dcb5
721.8 MB Download
md5:22b8b3fd2ce8b543fb43936b46a768f1
723.9 MB Download
md5:88a749954080fba024a25483a3de763b
1.4 GB Download
md5:c13ee883e5505995575691a96ced5d92
1.4 GB Download
md5:4ba1b19617dff57fc633b53507b91c7f
1.4 GB Download
md5:c5022a4f6526ca42be2e393492da5b0f
1.4 GB Download
md5:5c5f5fca04b3b2d82e9c151b1ce979ae
1.4 GB Download
md5:4699cd017eaf676211c609463c32944e
1.4 GB Download
md5:1c9f08819a413d11911855d24e9d0b38
1.4 GB Download
md5:01961aa338c1626d3f4275b973711df0
1.4 GB Download
md5:689f9f0db5ddcf7b19eb48c001edbcf0
1.4 GB Download
md5:bd8f0d0679b1a8f860320797cd30ce0c
1.4 GB Download
md5:6dc01cd380a050fcf53fd452f2da7073
1.4 GB Download
md5:62a66b9dc66d7bd78af4b30f319dec61
1.4 GB Download
md5:c6e625e48276b94160501b4bea43b3ba
2.2 GB Download
md5:e170b3ebc0ac05336f3268e0c76196a3
2.2 GB Download
md5:3332f7916f9369fc9299ef800781ecf0
2.2 GB Download
md5:a82a668ae7df2d572371b28f4431c883
2.2 GB Download
md5:941834297a9bf87090bf5c227ff3c914
2.2 GB Download
md5:0497a383055351d34f2815acf94d639f
2.2 GB Download
md5:ab2256a79b406b3e47e993e152e1bd04
2.2 GB Download
md5:91f3ad9281b6f1e9bec883a42a4a0231
2.2 GB Download
md5:e58e374f834e7a2deb1f6e36f5b5f4ef
2.2 GB Download
md5:1e11f2eff8efe632c02a6c3f96a7cbe0
365.4 MB Download
md5:1b7fb82c84d440b5c11a0ce7f5c0d264
366.5 MB Download
md5:445b82bc18cdcd81667d42283f92f450
365.7 MB Download
md5:5d4b3947e496bc00b541b132537adde6
366.5 MB Download
md5:5b295546ad31b5b37743f05bbb38b7ed
365.8 MB Download
md5:919978954204e01839fe699a135bcaa1
366.8 MB Download
md5:04b2107d3456b5906564e7610fd1fc58
365.7 MB Download
md5:abb7c84abd5288d8061e94417b2a63d5
366.6 MB Download
md5:e8b161d3413d8ebee4c6e7214edb5a1a
365.7 MB Download
md5:fd9db8d2415ff6853ce91cc89343c33c
366.6 MB Download
md5:6924cc6f9e7d089b92a32d43139f8e1f
366.0 MB Download
md5:8ea65d89e0a6935a2578a7cddb3582eb
366.4 MB Download

Additional details

Related works

Is supplement to
Standard: 10.5281/zenodo.17243812 (DOI)

Software

Repository URL
https://github.com/SemanticPriming/word2manylanguages
Programming language
Python, R