Bilingual English-German word embedding models for scientific text

doi:10.5281/zenodo.4467633

Published January 26, 2021 | Version v1

Dataset Open

Bilingual English-German word embedding models for scientific text

Donner, Paul¹

1. DZHW

This data set contains three word embedding models, constructed from the same training corpus of English and German parallel scientific texts (abstracts and research project descriptions). All text was pre-processed by language-specific stemming with the Porter stemming algorithm, removing numbers, and lower-casing.

The first model is a 1000-dimensional Latent Semantic Analysis model, constructed from concatenating the English and German texts. The input data was a m×n (297,852×923,864) document-term matrix of tf-idf weights. This was processed with truncated SVD. There are two files, the word vectors in file lsa_1000_Vmat.csv (the V* term by latent factors matrix of right singular values) and the dimension weights in lsa_1000_d_weights.csv (the 1000 values of the diagonal of the \(\Sigma\) matrix.

lsa_1000_Vmat.csv has two fields, the term and its vector representation in LSA space, separated by a "|" character. The structure looks like this:

tarifplural|{5.00599733151825e-08,-1.43071379136936e-08,8.32862290483082e-08,-6.08010721687266e-08,1.15831140150142e-07,-2.46470313387358e-08,3.43215595753282e-07,6.24301666802575e-07,-2.62907158945831e-07,-1.04120313981517e-07,4.5864574355164e-07,-2.31799632277312e-07,8.37354377858843e-07,8.22507467711628e-07,4.07585381069368e-07,-4.26358988941922e-08,-8.38652991154651e-07,1.98091851171759e-07,-3.94768548759816e-08,-4.28802181962385e-07, ...}

The other two models are a basic Random Indexing and a Reflective Random Indexing model, contained in same file, RI_training.csv. Both models have 1000 dimensions. The data structure is as follows.

language: either "en" (English) or "de" (German), the language of the term
term: the term as a character string
term_collection_count: integer, number of times the term occurred in the training data
c_vector: vector of 1000 reals, RI context vector of the term. formatted like this: "{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.12309149,0,0,-0.12309149,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ...}"
n_docs: integer, number of different documents which contained the term
c_vector_o2: vector of 1000 reals, RRI context vector of the term, formatted like c_vector above

1,034,860 rows.

All files are aggressively compressed with GNU gzip and will require much more disk space when uncompressed. Note the special formatting of the vector numeric variables, which are different for the two models.

Notes

Funding was provided by the German Federal Ministry of Education and Research [grant numbers 01PQ16004 and 01PQ17001

Files

Files (13.3 GB)

Name	Size	Download all
lsa_1000_d_weights.csv.gz md5:de6580b0dd70f6685e9cbd8b95fd97d0	7.9 kB	Download
lsa_1000_Vmat.csv.gz md5:7b3352fd7b6b28c7f712a25fb5cb50f1	8.3 GB	Download
RI_training.csv.gz md5:8b05209c9fe14e2bb8f1283abdc155fd	5.0 GB	Download

	All versions	This version
Views	465	464
Downloads	51	51
Data volume	288.8 GB	288.8 GB

Bilingual English-German word embedding models for scientific text

Creators

Description

Notes

Files

Files (13.3 GB)