10.5281/zenodo.3339152
https://zenodo.org/records/3339152
oai:zenodo.org:3339152
Al-Khatib, Khalid
Khalid
Al-Khatib
Bauhaus-Universität Weimar
Wachsmuth, Henning
Henning
Wachsmuth
0000-0003-2792-621X
Bauhaus-Universität Weimar
Hagen, Matthias
Matthias
Hagen
0000-0002-9733-2890
Martin-Luther-Universität Halle-Wittenberg
Stein, Benno
Benno
Stein
0000-0001-9033-2217
Bauhaus-Universität Weimar
Lang, Kevin
Kevin
Lang
0000-0003-4084-514X
Bauhaus-Universität Weimar
Herpel, Jakob
Jakob
Herpel
Bauhaus-Universität Weimar
Webis-WikiDiscussions-18
Zenodo
2018
wikipedia
talk pages
discussions
tags
shortcuts
links
templates
2018-07-17
eng
10.5281/zenodo.3339151
https://zenodo.org/communities/webis
Creative Commons Attribution 4.0 International
Webis-WikiDiscussions-18 Corpus is the output of parsing the entire set of Wikipedia talk pages. The corpus contains about six million discussions, consisting of about 20 million turns. The turns comprise around 74,000 different tags with a total of about 100,000 instances, around 7000 different shortcuts with about 400,000 instances, and around 51,000 different inline templates with about 3.3 million instances.
The database has the following structure:
PAGES: PAGE-ID, URL, TITLE
DISCUSSIONS: DISCUSSION-ID, PAGE-ID, TITLE
COMMENTS: COMMENT-ID, DISCUSSION-ID, PARENT-ID, TEXT-RAW, TEXT-CLEAN, USER
TAGS: TAG-ID, COMMENT-ID, TAG-TEXT, TAG-CLASS
TEMPLATES: TEMPLATE-ID, DISCUSSION-ID, TEMPLATE-TEXT
SHORTCUTS: SHORTCUT-ID, COMMENT-ID, SHORTCUT-TEXT, SHORTCUT-CLASS
LINKS: LINK-ID, COMMENT-ID, LINK-TEXT
INLINE-TEMPLATES: IL-TEMPLATE-ID, COMMENT-ID, IL-TEMPLATE-TEXT, TYPE, DESCRIPTION