There is a newer version of the record available.

Published May 12, 2020 | Version v1.0
Dataset Open

WTC1.0 (WikiTailor corpus v. 1.0)

  • 1. DFKI GmbH
  • 2. Università di Bologna
  • 3. Amazon

Description

 

Content:

List of the 743 domains, their term vocabularies in 10 languages, and the Wikipedia articles associated to each domain extracted by the best model described in:

  Cristina España-Bonet, Alberto Barrón-Cedeño and Lluís Màrquez. “Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction.” ArXiv abs/2005.01177 (2020)

  https://github.com/cristinae/WikiTailor

 

Files Description:

  • commonCats2015.enesdefrcaareuelrooc.tsv

Multilingual domains listed one per line, languages are separated by a tab in the order en, es, de, fr, ca, ar, eu, el, ro and oc. For each language we include the pair "ID categoryName" separated by a blank space.

  • [LAN].0.tar.bz

A folder per domain for language [LAN] containing the vocabulary and IDs of the extracted articles by the Wikitailor model 50-WT100.

 

 

Files

Files (369.4 MB)

Name Size Download all
md5:6180a94fdd59d3084ec01dcc369e132c
12.6 MB Download
md5:bfc519f381510dfa9feb587c311928b0
5.2 MB Download
md5:07591e64ec729f28bb023a4c579cd57b
151.3 kB Download
md5:24621e95a6aa108e85ddd0ff526054f4
13.3 MB Download
md5:c42d2a81e09178b4418317adac4b2b3f
1.7 MB Download
md5:6880fa386f33ae72033ac88b9ba4a5b3
259.5 MB Download
md5:6bd29b8477deba2d63d371bcd3b98224
30.1 MB Download
md5:0846ce241fdec98ac3046a8a1cf36565
1.7 MB Download
md5:6c5d057f1e6fc67c3a775e1d6f823a68
38.8 MB Download
md5:f053b86aa5133aac10d31fcbce61ffc1
486.8 kB Download
md5:909d63b62a13ac98bc8d4e54132166c3
5.7 MB Download

Additional details

Related works

Is compiled by
Software: https://github.com/cristinae/WikiTailor (URL)
Is supplement to
Preprint: arXiv:2005.01177 (arXiv)