Wikipedia Multilingual Vandalism Detection Dataset

Mykola Trokhymovych; Muniza Aslam; Ai-Jou Chou; Ricardo Baeza-Yates; Diego Saez-Trumper

doi:10.5281/zenodo.8174336

Published July 22, 2023 | Version 0.1

Dataset Open

Wikipedia Multilingual Vandalism Detection Dataset

1. Pompeu Fabra University
2. Wikimedia Foundation
3. EAI, Northeastern University

This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages.

The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies.

Dataset Details:

Number of Languages: 47
Observation period: 6 months training, one week hold-out testing
Use Case: The dataset is primarily intended for training and evaluating vandalism detection systems.
Features: Each record characterizes the corresponding revision of the Wikipedia page, including revision metadata, user details, text inserted, removed, or changed, and corresponding MLMs-based features.
Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.
Files: Training and hold-out testing datasets of anonymous and all users.

Related paper citation:

@inproceedings{10.1145/3580305.3599823,
author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego},
title = {Fair Multilingual Vandalism Detection System for Wikipedia},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599823},
doi = {10.1145/3580305.3599823},
abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {4981–4990},
numpages = {10},
location = {Long Beach, CA, USA},
series = {KDD '23}
}

Notes

This work has been funded by MCIN/AEI /10.13039/501100011033 under the Maria de Maeztu Units of Excellence Programme (CEX2021-001195-M)

Files

test_all_users.csv

Files (15.3 GB)

Name	Size	Download all
test_all_users.csv md5:59e5d168d580b5f834e0e2c47de3b0cb	1.1 GB	Preview Download
test_anon_users.csv md5:b0c154de34bcc36ac93be4cbd0cefc34	1.1 GB	Preview Download
train_all_users.csv md5:08617c687951be41dbaf2f4aa3f77961	9.2 GB	Preview Download
train_anon_users.csv md5:da36a824d1698238b01f1c4d75d88ba0	3.9 GB	Preview Download

Additional details

Is supplement to: Conference paper: 10.1145/3580305.3599823 (DOI); Preprint: 10.48550/arXiv.2306.01650 (DOI)

	All versions	This version
Views	332	332
Downloads	163	163
Data volume	526.0 GB	526.0 GB

Wikipedia Multilingual Vandalism Detection Dataset

Creators

Description

Notes

Files

test_all_users.csv

Files (15.3 GB)

Additional details

Related works