Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published May 12, 2020 | Version v1.1
Dataset Open

WTC1.1 (WikiTailor corpus v. 1.1)

  • 1. DFKI GmbH
  • 2. Università di Bologna
  • 3. Amazon

Description

 

Content:

List of the 743 domains, their term vocabularies in 10 languages, and the Wikipedia articles associated to each domain extracted by the best model described in:

  Cristina España-Bonet, Alberto Barrón-Cedeño and Lluís Màrquez. "Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction."  Knowledge and Information Systems, Volume 65, pages 1365-1397. 2023. Springer-Verlag, London Ldt. https://doi.org/10.1007/s10115-022-01767-5

  https://github.com/cristinae/WikiTailor

 

Files Description:

  • commonCats2015.enesdefrcaareuelrooc.tsv

Multilingual domains listed one per line, languages are separated by a tab in the order en, es, de, fr, ca, ar, eu, el, ro and oc. For each language we include the pair "ID categoryName" separated by a blank space.

  • [LAN].0.tar.bz

A folder per domain for language [LAN] containing the vocabulary and IDs of the extracted articles by the Wikitailor model 50-WT100.

  • extraction[LAN]0.tar.bz

A folder per domain for language [LAN] containing the text of the extracted articles. The name of the file corresponds to the IDs in [LAN].0.tar.bz.

 

 

Files

Files (7.1 GB)

Name Size Download all
md5:6180a94fdd59d3084ec01dcc369e132c
12.6 MB Download
md5:bfc519f381510dfa9feb587c311928b0
5.2 MB Download
md5:07591e64ec729f28bb023a4c579cd57b
151.3 kB Download
md5:24621e95a6aa108e85ddd0ff526054f4
13.3 MB Download
md5:c42d2a81e09178b4418317adac4b2b3f
1.7 MB Download
md5:6880fa386f33ae72033ac88b9ba4a5b3
259.5 MB Download
md5:6bd29b8477deba2d63d371bcd3b98224
30.1 MB Download
md5:0846ce241fdec98ac3046a8a1cf36565
1.7 MB Download
md5:7ac7019908d4919c0836354d1b0da9db
150.6 MB Download
md5:a59cfea2ffdc5e3718767121fafe5fb2
222.9 MB Download
md5:43a2d081f1c02d74fad3c31f70fe86ec
1.3 GB Download
md5:5681729035faecee19d4aa35b13fbb87
84.6 MB Download
md5:80a0a4f734b7ceab134069f8b337d1b0
3.1 GB Download
md5:4f44ef2701b5643601b17d73daf49702
752.5 MB Download
md5:ff98d659a69a440d96274daeab9a0e7a
47.7 MB Download
md5:aca086b46fecd6f943908e8d4237c9a5
896.7 MB Download
md5:68f96e2d373fd8b56822cc7b4780312f
19.7 MB Download
md5:7d4d95edb151973c2a731d240052a199
99.9 MB Download
md5:6c5d057f1e6fc67c3a775e1d6f823a68
38.8 MB Download
md5:f053b86aa5133aac10d31fcbce61ffc1
486.8 kB Download
md5:909d63b62a13ac98bc8d4e54132166c3
5.7 MB Download

Additional details

Related works

Is compiled by
Software: https://github.com/cristinae/WikiTailor (URL)
Is published in
Dataset: 10.1007/s10115-022-01767-5 (DOI)
Is supplement to
Preprint: arXiv:2005.01177 (arXiv)
Publication: 10.1007/s10115-022-01767-5 (DOI)