Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published July 22, 2023 | Version 0.1
Dataset Open

Wikipedia Multilingual Vandalism Detection Dataset

  • 1. Pompeu Fabra University
  • 2. Wikimedia Foundation
  • 3. EAI, Northeastern University

Description

This dataset accompanies a research paper that introduces a novel system designed to support the Wikipedia community in combating vandalism on the platform. The dataset has been prepared to enhance the accuracy and efficiency of Wikipedia patrolling in multiple languages.

The release of this comprehensive dataset aims to encourage further research and development in vandalism detection techniques, fostering a safer and more inclusive environment for the Wikipedia community. Researchers and practitioners can utilize this dataset to train and validate their models for vandalism detection and contribute to improving online platforms' content moderation strategies.

Dataset Details:

  • Number of Languages: 47
  • Observation period: 6 months training, one week hold-out testing
  • Use Case: The dataset is primarily intended for training and evaluating vandalism detection systems.
  • Features: Each record characterizes the corresponding revision of the Wikipedia page, including revision metadata, user details, text inserted, removed, or changed, and corresponding MLMs-based features. 
  • Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.
  • Files: Training and hold-out testing datasets of anonymous and all users. 

 

Related paper citation:

@inproceedings{10.1145/3580305.3599823,
author = {Trokhymovych, Mykola and Aslam, Muniza and Chou, Ai-Jou and Baeza-Yates, Ricardo and Saez-Trumper, Diego},
title = {Fair Multilingual Vandalism Detection System for Wikipedia},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599823},
doi = {10.1145/3580305.3599823},
abstract = {This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {4981–4990},
numpages = {10},
location = {Long Beach, CA, USA},
series = {KDD '23}
}

 

Notes

This work has been funded by MCIN/AEI /10.13039/501100011033 under the Maria de Maeztu Units of Excellence Programme (CEX2021-001195-M)

Files

test_all_users.csv

Files (15.3 GB)

Name Size Download all
md5:59e5d168d580b5f834e0e2c47de3b0cb
1.1 GB Preview Download
md5:b0c154de34bcc36ac93be4cbd0cefc34
1.1 GB Preview Download
md5:08617c687951be41dbaf2f4aa3f77961
9.2 GB Preview Download
md5:da36a824d1698238b01f1c4d75d88ba0
3.9 GB Preview Download

Additional details

Related works

Is supplement to
Conference paper: 10.1145/3580305.3599823 (DOI)
Preprint: 10.48550/arXiv.2306.01650 (DOI)