Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)
Authors/Creators
- 1. EPFL
- 2. MIT
- 3. BlackRock
Description
Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)
This repository contains the pre-trained models released with the following paper.
"Crosslingual Document Embedding as Reduced-Rank Ridge Regression". Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 2019.
Cr5 embeds the words in a number of languages (2, 4 or 28, depending on the model), in a shared latent space. The repository contains the following models:
- English - Italian model, trained in a pairwise setting (prefixed pairwise_2_en-it)
- Danish - English model, trained in a pairwise setting (prefixed pairwise_2_da-en)
- Danish - Vietnamese model, trained in a pairwise setting (prefixed pairwise_2_da-vi)
- Danish - English - Italian - Vietnamese model, trained in a joint setting of 4 languages (prefixed joint_4)
- 28 language model, trained in a joint setting of 28 languages (prefixed joint_28; the languages correspond to the following ISO 639 codes: bg, ca, cs, da, de, el, en, es, et, fi, fr, hr, hu, id, it, mk, nl, no, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi.
There is a tradeoff between the number of languages embedded in a single latent space and the performance achieved for each single language. Thus, in addition to the pre-trained models, the results from the document-level evaluation task (cf. Table 1 in paper) for the 28-language model are provided (when evaluating on the 4 languages considered in the paper, joint_28_full_performance.pdf). To choose the most optimal model for your use case, these results can be compared with the performance of the first four models, which are already evaluated in the paper.
For using the pre-trained models, follow these steps:
1. Download the model of choice for the desired languages (the dataset format is described in the readme)
2. Download the helper library cr5.py
3. Look at example.py for an example on how to use the library
If you found the provided resources useful, please cite the above paper. Here's a BibTeX entry you may use:
@inproceedings{josifoski-wsdm2019-cr5,
title={Crosslingual Document Embedding as Reduced-Rank Ridge Regression},
author={Josifoski, Martin and Paskov, Ivan S. and Paskov, Hristo S. and Jaggi, Martin and West, Robert},
booktitle={Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining},
organization={ACM},
year={2019}
}
Any questions or suggestions?
Contact martin.josifoski@epfl.ch.
Files
joint_28_full.zip
Files
(35.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:ce3d0069218cb76011293e1408887421
|
2.1 kB | Download |
|
md5:d851f85aa6975c18c6a104569d044f6b
|
522 Bytes | Download |
|
md5:ae99b293b127a8c1eaf2b8ccf31eff47
|
586.1 MB | Download |
|
md5:93f27eaefed1ee88a14607f7a84b5b51
|
584.4 MB | Download |
|
md5:6ab0bd65407e1c5a0ed40e2e32ba43e8
|
584.0 MB | Download |
|
md5:c4563f3a65c3d0d8067ba7b348d9e168
|
509.2 MB | Download |
|
md5:5158651f22307e156229be648007c7b9
|
584.0 MB | Download |
|
md5:4d9b3f75074437a4a39f9a95fffd1933
|
534.1 MB | Download |
|
md5:dc2540985608d1b79d1a3daa1f574431
|
584.6 MB | Download |
|
md5:1224ea2bab473cd93f9b8f0c23207327
|
584.1 MB | Download |
|
md5:4468b8ba41336575bc99fd833920318b
|
570.2 MB | Download |
|
md5:f8601f53696e21225669ad22dd00b99d
|
584.5 MB | Download |
|
md5:8b79c80ea89960e8e637bb38ebae10b8
|
583.9 MB | Download |
|
md5:4b5578bff2968c3b1d37d6b212353057
|
15.4 GB | Preview Download |
|
md5:97d2aadb76a0711b7abc89acf209eec6
|
158.8 kB | Preview Download |
|
md5:9d79fadc03e283f1c531b9c3a517ad9e
|
585.0 MB | Download |
|
md5:5a1023494a94c71c96aa9a769589a202
|
584.4 MB | Download |
|
md5:985bb812dd4dd8e6515307e82368c03e
|
444.8 MB | Download |
|
md5:2e3d4b27f5bc01db54926c20098e4373
|
583.8 MB | Download |
|
md5:8fd2575b9b24a82ee7f4a726132390aa
|
298.2 MB | Download |
|
md5:c1c9751f45ee7524b2ae08c0992c392a
|
584.6 MB | Download |
|
md5:5649c66a78ac151b949c5d783da050f1
|
585.1 MB | Download |
|
md5:6235f14c6f2a00a5cba85eb4857f21e5
|
583.4 MB | Download |
|
md5:0e4afec3d1399e7dcb8f2a5b215963fb
|
584.2 MB | Download |
|
md5:fc821c7e5e74782c3666c0ba4c1bdab1
|
568.0 MB | Download |
|
md5:4724af9acad0da96fdef71c3c56fc2a3
|
584.6 MB | Download |
|
md5:98e2879b88169d55492e9da894aa6b9a
|
585.3 MB | Download |
|
md5:43a39cba0471db74a21a5b7b1630894a
|
504.7 MB | Download |
|
md5:4ccd88156ef0402ad0ab748f8e961bec
|
585.1 MB | Download |
|
md5:420b9d0ada7d2fa8743aea689c19cb1f
|
585.1 MB | Download |
|
md5:59efac91f278038de61cdc8226b9e047
|
585.3 MB | Download |
|
md5:f64043075f90df4f27246ebbcdc5e46a
|
310.5 MB | Download |
|
md5:a2806f31e8f66611d48cddb10648367f
|
565.2 MB | Download |
|
md5:387b6fceb7035a952faaf2935ff6f3b8
|
577.2 MB | Download |
|
md5:7067a66ddbe766da20f7a17961e6c662
|
576.7 MB | Download |
|
md5:bf908f9734018d7a67595deba37ec278
|
372.8 MB | Download |
|
md5:9ff335dd19120dc596c25d204ab19ab2
|
502.6 MB | Download |
|
md5:91979c811eb6993fc40d2861398d2c06
|
577.2 MB | Download |
|
md5:204e528de0cd7a6d3f29cf7202a2315c
|
276.3 MB | Download |
|
md5:bfc5575967db2dff1cb1fc03001a9aa8
|
161.2 MB | Download |
|
md5:7ec7483b0caf5af428d8ba80dd951bd4
|
574.1 MB | Download |
|
md5:3726babaa090058af8254f4163770e62
|
572.9 MB | Download |
|
md5:aa8673ee793295daae1ed201a9a2e65a
|
310 Bytes | Preview Download |