WTC1.0 (WikiTailor corpus v. 1.0)
- 1. DFKI GmbH
- 2. Università di Bologna
- 3. Amazon
Description
Content:
List of the 743 domains, their term vocabularies in 10 languages, and the Wikipedia articles associated to each domain extracted by the best model described in:
Cristina España-Bonet, Alberto Barrón-Cedeño and Lluís Màrquez. “Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction.” ArXiv abs/2005.01177 (2020)
https://github.com/cristinae/WikiTailor
Files Description:
- commonCats2015.enesdefrcaareuelrooc.tsv
Multilingual domains listed one per line, languages are separated by a tab in the order en, es, de, fr, ca, ar, eu, el, ro and oc. For each language we include the pair "ID categoryName" separated by a blank space.
- [LAN].0.tar.bz
A folder per domain for language [LAN] containing the vocabulary and IDs of the extracted articles by the Wikitailor model 50-WT100.
Files
Files
(369.4 MB)
Name | Size | Download all |
---|---|---|
md5:6180a94fdd59d3084ec01dcc369e132c
|
12.6 MB | Download |
md5:bfc519f381510dfa9feb587c311928b0
|
5.2 MB | Download |
md5:07591e64ec729f28bb023a4c579cd57b
|
151.3 kB | Download |
md5:24621e95a6aa108e85ddd0ff526054f4
|
13.3 MB | Download |
md5:c42d2a81e09178b4418317adac4b2b3f
|
1.7 MB | Download |
md5:6880fa386f33ae72033ac88b9ba4a5b3
|
259.5 MB | Download |
md5:6bd29b8477deba2d63d371bcd3b98224
|
30.1 MB | Download |
md5:0846ce241fdec98ac3046a8a1cf36565
|
1.7 MB | Download |
md5:6c5d057f1e6fc67c3a775e1d6f823a68
|
38.8 MB | Download |
md5:f053b86aa5133aac10d31fcbce61ffc1
|
486.8 kB | Download |
md5:909d63b62a13ac98bc8d4e54132166c3
|
5.7 MB | Download |
Additional details
Related works
- Is compiled by
- Software: https://github.com/cristinae/WikiTailor (URL)
- Is supplement to
- Preprint: arXiv:2005.01177 (arXiv)