Published January 26, 2021 | Version v1
Dataset Open

Bilingual English-German word embedding models for scientific text

  • 1. DZHW

Description

This data set contains three word embedding models, constructed from the same training corpus of English and German parallel scientific texts (abstracts and research project descriptions). All text was pre-processed by language-specific stemming with the Porter stemming algorithm, removing numbers, and lower-casing.

The first model is a 1000-dimensional Latent Semantic Analysis model, constructed from concatenating the English and German texts. The input data was a m×n (297,852×923,864) document-term matrix of tf-idf weights. This was processed with truncated SVD. There are two files, the word vectors in file lsa_1000_Vmat.csv (the V* term by latent factors matrix of right singular values) and the dimension weights in lsa_1000_d_weights.csv (the 1000 values of the diagonal of the \(\Sigma\) matrix.

lsa_1000_Vmat.csv has two fields, the term and its vector representation in LSA space, separated by a "|" character. The structure looks like this:

tarifplural|{5.00599733151825e-08,-1.43071379136936e-08,8.32862290483082e-08,-6.08010721687266e-08,1.15831140150142e-07,-2.46470313387358e-08,3.43215595753282e-07,6.24301666802575e-07,-2.62907158945831e-07,-1.04120313981517e-07,4.5864574355164e-07,-2.31799632277312e-07,8.37354377858843e-07,8.22507467711628e-07,4.07585381069368e-07,-4.26358988941922e-08,-8.38652991154651e-07,1.98091851171759e-07,-3.94768548759816e-08,-4.28802181962385e-07, ...}

The other two models are a basic Random Indexing and a Reflective Random Indexing model, contained in same file, RI_training.csv. Both models have 1000 dimensions. The data structure is as follows.

  • language: either "en" (English) or "de" (German), the language of the term
  • term: the term as a character string
  • term_collection_count: integer, number of times the term occurred in the training data
  • c_vector: vector of 1000 reals, RI context vector of the term. formatted like this: "{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.12309149,0,0,-0.12309149,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ...}"
  • n_docs: integer, number of different documents which contained the term
  • c_vector_o2: vector of 1000 reals, RRI context vector of the term, formatted like c_vector above

1,034,860 rows.

All files are aggressively compressed with GNU gzip and will require much more disk space when uncompressed.  Note the special formatting of the vector numeric variables, which are different for the two models.

Notes

Funding was provided by the German Federal Ministry of Education and Research [grant numbers 01PQ16004 and 01PQ17001

Files

Files (13.3 GB)

Name Size Download all
md5:de6580b0dd70f6685e9cbd8b95fd97d0
7.9 kB Download
md5:7b3352fd7b6b28c7f712a25fb5cb50f1
8.3 GB Download
md5:8b05209c9fe14e2bb8f1283abdc155fd
5.0 GB Download