Published May 28, 2025 | Version v1
Dataset Open

The Million Authors Corpus (MAC)

  • 1. ROR icon University of Southern California
  • 2. ROR icon University of Michigan–Ann Arbor

Description

We introduce the Million Authors Corpus (MAC), a novel dataset encompassing contributions from dozens of languages on Wikipedia. It includes only long and contiguous textual chunks taken from Wikipedia edits and links those texts to their authors.

The zipped file contains 60 directories, one per language. Within each directory, we share a csv file with meta-information about the content. This file includes a row-level data per each Wikipedia page we process.

In addition, each folder contains multiple zipped jsonl files with the raw-level content. Each raw in the jsonl file holds 12 fields of the records such as id, timestamp, page, and new_text. The latter is the new text added by the user to the Wikipedia page.

A concise version of the dataset, which was used for training and testing out proposed algorithms can be found in a HuggingFace repository: https://huggingface.co/datasets/Blablablab/MAC

Here are a few examples, taken from the South African language ('za'). We show a subset of the keys that are included per instance :

id timestamp user page sha1 new_text
99223 2023-01-25T13:12:48Z {'id': 18564, 'text': 'Laasry'} {'id': 607, 'title': 'Ikhasi Elikhulu', 'namespace': 1, 'restrictions': []} bww0jvlrrfvemlf0x5zxx7rkzh0n8jm [\':"uNomaxhama" should be the format as it is the capitalisation style adopted by lexicography units and...]
11315 2009-01-20T13:43:23Z {'id': 88, 'text': 'Andre Engels'} {'id': 794, 'title': 'Uphiko Lwezilimi Kuzwelonke', 'namespace': 0, 'restrictions': []} 80fh22mjubxw556yhbzsr4et3wgpi8f ['Uphiko lweziLimi kuZweIonke (NLS) lugqugquzela futhi lwenza lula ukuxoxisana ngezilimi ezahlukene\...]
65078 2020-08-31T20:55:00Z {'text': '197.77.175.185'} {'id': 1031, 'title': 'IsiZulu', 'namespace': 0, 'restrictions': []} ep8r9ii1jri1zjjpfvj46zpwtdyio2d ['Lolu limi lusukela noma luqanjwe ngowayeyiSilo samabandla onke iNkosi uShaka Zulu\\nLokhu kungenxa yegalelo noma iqhaza alibamba ekubumbeni isizwe esasihlukene...]

 

Files

MAC-full-dataset.zip

Files (24.0 GB)

Name Size Download all
md5:7fa18958cca2de5000f7ea3cc4e4d6d1
24.0 GB Preview Download