Published June 28, 2019 | Version v1
Dataset Open

A Wikipedia dataset of 5 categories

Authors/Creators

  • 1. L3i, La Rochelle University

Description

A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

  • Economy : 44'876 articles
  • History : 92'041 articles
  • Informatics : 25'408 articles
  • Health : 22'143 articles
  • Law : 9'964 articles

Files

data_Wikipedia_Categorized_UTF8.zip

Files (664.9 MB)

Name Size Download all
md5:8b4dbe468344de1a78c1501326bd8fe0
664.9 MB Preview Download