Dataset Open Access

Webis-WikiDiscussions-18

Al-Khatib, Khalid; Wachsmuth, Henning; Hagen, Matthias; Stein, Benno; Lang, Kevin; Herpel, Jakob

Webis-WikiDiscussions-18 Corpus is the output of parsing the entire set of Wikipedia talk pages. The corpus contains about six million discussions, consisting of about 20 million turns. The turns comprise around 74,000 different tags with a total of about 100,000 instances, around 7000 different shortcuts with about 400,000 instances, and around 51,000 different inline templates with about 3.3 million instances.

The database has the following structure:

  • PAGES: PAGE-ID, URL, TITLE
  • DISCUSSIONS: DISCUSSION-ID, PAGE-ID, TITLE
  • COMMENTS: COMMENT-ID, DISCUSSION-ID, PARENT-ID, TEXT-RAW, TEXT-CLEAN, USER
  • TAGS: TAG-ID, COMMENT-ID, TAG-TEXT, TAG-CLASS
  • TEMPLATES: TEMPLATE-ID, DISCUSSION-ID, TEMPLATE-TEXT
  • SHORTCUTS: SHORTCUT-ID, COMMENT-ID, SHORTCUT-TEXT, SHORTCUT-CLASS
  • LINKS: LINK-ID, COMMENT-ID, LINK-TEXT
  • INLINE-TEMPLATES: IL-TEMPLATE-ID, COMMENT-ID, IL-TEMPLATE-TEXT, TYPE, DESCRIPTION
Files (4.8 GB)
Name Size
Webis-WikiDiscussions-18-TSV.tar.gz
md5:3b008055bd84f5d4808931ce0797bdf8
4.8 GB Download
265
74
views
downloads
All versions This version
Views 265265
Downloads 7474
Data volume 351.9 GB351.9 GB
Unique views 256256
Unique downloads 5555

Share

Cite as