The Million Authors Corpus (MAC)
Creators
Description
We introduce the Million Authors Corpus (MAC), a novel dataset encompassing contributions from dozens of languages on Wikipedia. It includes only long and contiguous textual chunks taken from Wikipedia edits and links those texts to their authors.
The zipped file contains 60 directories, one per language. Within each directory, we share a csv file with meta-information about the content. This file includes a row-level data per each Wikipedia page we process.
In addition, each folder contains multiple zipped jsonl files with the raw-level content. Each raw in the jsonl file holds 12 fields of the records such as id, timestamp, page, and new_text. The latter is the new text added by the user to the Wikipedia page.
A concise version of the dataset, which was used for training and testing out proposed algorithms can be found in a HuggingFace repository: https://huggingface.co/datasets/Blablablab/MAC
Here are a few examples, taken from the South African language ('za'). We show a subset of the keys that are included per instance :
id | timestamp | user | page | sha1 | new_text |
99223 | 2023-01-25T13:12:48Z | {'id': 18564, 'text': 'Laasry'} | {'id': 607, 'title': 'Ikhasi Elikhulu', 'namespace': 1, 'restrictions': []} | bww0jvlrrfvemlf0x5zxx7rkzh0n8jm | [\':"uNomaxhama" should be the format as it is the capitalisation style adopted by lexicography units and...] |
11315 | 2009-01-20T13:43:23Z | {'id': 88, 'text': 'Andre Engels'} | {'id': 794, 'title': 'Uphiko Lwezilimi Kuzwelonke', 'namespace': 0, 'restrictions': []} | 80fh22mjubxw556yhbzsr4et3wgpi8f | ['Uphiko lweziLimi kuZweIonke (NLS) lugqugquzela futhi lwenza lula ukuxoxisana ngezilimi ezahlukene\...] |
65078 | 2020-08-31T20:55:00Z | {'text': '197.77.175.185'} | {'id': 1031, 'title': 'IsiZulu', 'namespace': 0, 'restrictions': []} | ep8r9ii1jri1zjjpfvj46zpwtdyio2d | ['Lolu limi lusukela noma luqanjwe ngowayeyiSilo samabandla onke iNkosi uShaka Zulu\\nLokhu kungenxa yegalelo noma iqhaza alibamba ekubumbeni isizwe esasihlukene...] |
Files
MAC-full-dataset.zip
Files
(24.0 GB)
Name | Size | Download all |
---|---|---|
md5:7fa18958cca2de5000f7ea3cc4e4d6d1
|
24.0 GB | Preview Download |