Published March 18, 2019 | Version v2
Dataset Open

Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)

  • 1. EPFL
  • 2. MIT
  • 3. BlackRock

Description

Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)

This repository contains the pre-trained models released with the following paper.

"Crosslingual Document Embedding as Reduced-Rank Ridge Regression". Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 2019.

Cr5 embeds the words in a number of languages (2, 4 or 28, depending on the model), in a shared latent space. The repository contains the following models:

  • English - Italian model, trained in a pairwise setting (prefixed pairwise_2_en-it)
  • Danish - English model, trained in a pairwise setting (prefixed pairwise_2_da-en)
  • Danish - Vietnamese model, trained in a pairwise setting (prefixed pairwise_2_da-vi)
  • Danish - English - Italian - Vietnamese model, trained in a joint setting of 4 languages (prefixed joint_4)
  • 28 language model, trained in a joint setting of 28 languages (prefixed joint_28; the languages correspond to the following ISO 639 codes: bg, ca, cs, da, de, el, en, es, et, fi, fr, hr, hu, id, it, mk, nl, no, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi.

There is a tradeoff between the number of languages embedded in a single latent space and the performance achieved for each single language. Thus, in addition to the pre-trained models, the results from the document-level evaluation task (cf. Table 1 in paper) for the 28-language model are provided (when evaluating on the 4 languages considered in the paper, joint_28_full_performance.pdf). To choose the most optimal model for your use case, these results can be compared with the performance of the first four models, which are already evaluated in the paper.

For using the pre-trained models, follow these steps:

1. Download the model of choice for the desired languages (the dataset format is described in the readme)

2. Download the helper library cr5.py

3. Look at example.py for an example on how to use the library

If you found the provided resources useful, please cite the above paper. Here's a BibTeX entry you may use:

@inproceedings{josifoski-wsdm2019-cr5,
  title={Crosslingual Document Embedding as Reduced-Rank Ridge Regression},
  author={Josifoski, Martin and Paskov, Ivan S. and Paskov, Hristo S. and Jaggi, Martin and West, Robert},
  booktitle={Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining},
  organization={ACM},
  year={2019}
}


Any questions or suggestions?
Contact martin.josifoski@epfl.ch. 

Files

joint_28_full.zip

Files (35.6 GB)

Name Size Download all
md5:ce3d0069218cb76011293e1408887421
2.1 kB Download
md5:d851f85aa6975c18c6a104569d044f6b
522 Bytes Download
md5:ae99b293b127a8c1eaf2b8ccf31eff47
586.1 MB Download
md5:93f27eaefed1ee88a14607f7a84b5b51
584.4 MB Download
md5:6ab0bd65407e1c5a0ed40e2e32ba43e8
584.0 MB Download
md5:c4563f3a65c3d0d8067ba7b348d9e168
509.2 MB Download
md5:5158651f22307e156229be648007c7b9
584.0 MB Download
md5:4d9b3f75074437a4a39f9a95fffd1933
534.1 MB Download
md5:dc2540985608d1b79d1a3daa1f574431
584.6 MB Download
md5:1224ea2bab473cd93f9b8f0c23207327
584.1 MB Download
md5:4468b8ba41336575bc99fd833920318b
570.2 MB Download
md5:f8601f53696e21225669ad22dd00b99d
584.5 MB Download
md5:8b79c80ea89960e8e637bb38ebae10b8
583.9 MB Download
md5:4b5578bff2968c3b1d37d6b212353057
15.4 GB Preview Download
md5:97d2aadb76a0711b7abc89acf209eec6
158.8 kB Preview Download
md5:9d79fadc03e283f1c531b9c3a517ad9e
585.0 MB Download
md5:5a1023494a94c71c96aa9a769589a202
584.4 MB Download
md5:985bb812dd4dd8e6515307e82368c03e
444.8 MB Download
md5:2e3d4b27f5bc01db54926c20098e4373
583.8 MB Download
md5:8fd2575b9b24a82ee7f4a726132390aa
298.2 MB Download
md5:c1c9751f45ee7524b2ae08c0992c392a
584.6 MB Download
md5:5649c66a78ac151b949c5d783da050f1
585.1 MB Download
md5:6235f14c6f2a00a5cba85eb4857f21e5
583.4 MB Download
md5:0e4afec3d1399e7dcb8f2a5b215963fb
584.2 MB Download
md5:fc821c7e5e74782c3666c0ba4c1bdab1
568.0 MB Download
md5:4724af9acad0da96fdef71c3c56fc2a3
584.6 MB Download
md5:98e2879b88169d55492e9da894aa6b9a
585.3 MB Download
md5:43a39cba0471db74a21a5b7b1630894a
504.7 MB Download
md5:4ccd88156ef0402ad0ab748f8e961bec
585.1 MB Download
md5:420b9d0ada7d2fa8743aea689c19cb1f
585.1 MB Download
md5:59efac91f278038de61cdc8226b9e047
585.3 MB Download
md5:f64043075f90df4f27246ebbcdc5e46a
310.5 MB Download
md5:a2806f31e8f66611d48cddb10648367f
565.2 MB Download
md5:387b6fceb7035a952faaf2935ff6f3b8
577.2 MB Download
md5:7067a66ddbe766da20f7a17961e6c662
576.7 MB Download
md5:bf908f9734018d7a67595deba37ec278
372.8 MB Download
md5:9ff335dd19120dc596c25d204ab19ab2
502.6 MB Download
md5:91979c811eb6993fc40d2861398d2c06
577.2 MB Download
md5:204e528de0cd7a6d3f29cf7202a2315c
276.3 MB Download
md5:bfc5575967db2dff1cb1fc03001a9aa8
161.2 MB Download
md5:7ec7483b0caf5af428d8ba80dd951bd4
574.1 MB Download
md5:3726babaa090058af8254f4163770e62
572.9 MB Download
md5:aa8673ee793295daae1ed201a9a2e65a
310 Bytes Preview Download