Published June 16, 2023 | Version v1
Dataset Open

Webis Wikipedia Innovation History 2023

  • 1. Leipzig University and ScaDS.AI
  • 2. DZHW Berlin
  • 3. Technische Universität Berlin
  • 4. Bauhaus-Universität Weimar

Description

History sections of science and technology articles on Wikipedia extracted from the Wikimedia dump from 1 January 2022. Articles retrieved using Wikipedia's category network. History sections extracted using a combination of section-heading-based heuristics and classifiers trained on articles with designated history sections.

If you use this corpus, please cite the following paper:

Wolfgang Kircheis, Marion Schmidt, Arno Simons, Benno Stein, and Martin Potthast. Mining the History Sections of Wikipedia Articles on Science and Technology. In 23rd ACM/IEEE Joint Conference on Digital Libraries (JCDL 2023), June 2023. [code] [corpus-viewer]

@InProceedings{kircheis:2023,
  author    = {Wolfgang Kircheis and Marion Schmidt and Arno Simons and Benno Stein and Martin Potthast},
  booktitle = {23rd {ACM/IEEE} Joint Conference on Digital Libraries (JCDL 2023)},
  codeurl   = {https://github.com/webis-de/JCDL-23},
  keywords  = {nlp, natural language processing},
  month     = jun,
  title     = {{Mining the History Sections of Wikipedia Articles on Science and Technology}},
  year      = 2023
}

Files

webis-WikiSciTech-23.json

Files (26.2 MB)

Name Size Download all
md5:6d060feac286ff7bc38290451f9aa818
26.2 MB Preview Download