WTC1.1 (WikiTailor corpus v. 1.1)
- 1. DFKI GmbH
- 2. Università di Bologna
- 3. Amazon
Description
Content:
List of the 743 domains, their term vocabularies in 10 languages, and the Wikipedia articles associated to each domain extracted by the best model described in:
Cristina España-Bonet, Alberto Barrón-Cedeño and Lluís Màrquez. "Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction." Knowledge and Information Systems, Volume 65, pages 1365-1397. 2023. Springer-Verlag, London Ldt. https://doi.org/10.1007/s10115-022-01767-5
https://github.com/cristinae/WikiTailor
Files Description:
- commonCats2015.enesdefrcaareuelrooc.tsv
Multilingual domains listed one per line, languages are separated by a tab in the order en, es, de, fr, ca, ar, eu, el, ro and oc. For each language we include the pair "ID categoryName" separated by a blank space.
- [LAN].0.tar.bz
A folder per domain for language [LAN] containing the vocabulary and IDs of the extracted articles by the Wikitailor model 50-WT100.
- extraction[LAN]0.tar.bz
A folder per domain for language [LAN] containing the text of the extracted articles. The name of the file corresponds to the IDs in [LAN].0.tar.bz.
Files
Files
(7.1 GB)
Name | Size | Download all |
---|---|---|
md5:6180a94fdd59d3084ec01dcc369e132c
|
12.6 MB | Download |
md5:bfc519f381510dfa9feb587c311928b0
|
5.2 MB | Download |
md5:07591e64ec729f28bb023a4c579cd57b
|
151.3 kB | Download |
md5:24621e95a6aa108e85ddd0ff526054f4
|
13.3 MB | Download |
md5:c42d2a81e09178b4418317adac4b2b3f
|
1.7 MB | Download |
md5:6880fa386f33ae72033ac88b9ba4a5b3
|
259.5 MB | Download |
md5:6bd29b8477deba2d63d371bcd3b98224
|
30.1 MB | Download |
md5:0846ce241fdec98ac3046a8a1cf36565
|
1.7 MB | Download |
md5:7ac7019908d4919c0836354d1b0da9db
|
150.6 MB | Download |
md5:a59cfea2ffdc5e3718767121fafe5fb2
|
222.9 MB | Download |
md5:43a2d081f1c02d74fad3c31f70fe86ec
|
1.3 GB | Download |
md5:5681729035faecee19d4aa35b13fbb87
|
84.6 MB | Download |
md5:80a0a4f734b7ceab134069f8b337d1b0
|
3.1 GB | Download |
md5:4f44ef2701b5643601b17d73daf49702
|
752.5 MB | Download |
md5:ff98d659a69a440d96274daeab9a0e7a
|
47.7 MB | Download |
md5:aca086b46fecd6f943908e8d4237c9a5
|
896.7 MB | Download |
md5:68f96e2d373fd8b56822cc7b4780312f
|
19.7 MB | Download |
md5:7d4d95edb151973c2a731d240052a199
|
99.9 MB | Download |
md5:6c5d057f1e6fc67c3a775e1d6f823a68
|
38.8 MB | Download |
md5:f053b86aa5133aac10d31fcbce61ffc1
|
486.8 kB | Download |
md5:909d63b62a13ac98bc8d4e54132166c3
|
5.7 MB | Download |
Additional details
Related works
- Is compiled by
- Software: https://github.com/cristinae/WikiTailor (URL)
- Is published in
- Dataset: 10.1007/s10115-022-01767-5 (DOI)
- Is supplement to
- Preprint: arXiv:2005.01177 (arXiv)
- Publication: 10.1007/s10115-022-01767-5 (DOI)