Dataset Open Access

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

The corpus has following structure:

  • wikipedia.tar.gz: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
  • within-wikipedia-tr-01.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
  • within-wikipedia-tr-02.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

Files (12.6 GB)
Name Size
wikipedia.zip
md5:16c86f1b90cbf5b550d46eea29e4b206
3.8 GB Download
within-wikipedia-tr-01.gz
md5:5214d3f942db3ae4de61c98f4d9a341b
4.7 GB Download
within-wikipedia-tr-02.gz
md5:a9e12fc02a9701ce6d524f2ea6bde7b1
4.0 GB Download
  • Milad Alshomary, Michael Völske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, and Martin Potthast. Wikipedia Text Reuse: Within and Without. In Leif Azzopardi et al, editors, Advances in Information Retrieval. 41st European Conference on IR Research (ECIR 2019) volume 11437 of Lecture Notes in Computer Science, pages 747-754, Berlin Heidelberg New York, April 2019. Springer.

23
37
views
downloads
All versions This version
Views 2323
Downloads 3737
Data volume 147.5 GB147.5 GB
Unique views 2121
Unique downloads 66

Share

Cite as