Published June 28, 2019
| Version v1
Dataset
Open
A Wikipedia dataset of 5 categories
Description
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :
- Economy : 44'876 articles
- History : 92'041 articles
- Informatics : 25'408 articles
- Health : 22'143 articles
- Law : 9'964 articles
Files
data_Wikipedia_Categorized_UTF8.zip
Files
(664.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:8b4dbe468344de1a78c1501326bd8fe0
|
664.9 MB | Preview Download |