Published May 22, 2025 | Version v1
Dataset Open

Wikidata Vandalism Detection Dataset

  • 1. ROR icon Pompeu Fabra University

Description

Description

This dataset accompanies a research paper that introduces a new system designed to support the Wikidata community in combating vandalism on the platform. 

Keywords: Wikidata, content differences, vandalism detection, data mining, content analysis, computational social science, NLP.

Dataset Details:

  • Number of files: 20 (6.85 GB)
  • Format: CSV
  • License: CC BY 4.0
  • Use Case: data mining, vandalism detection and analysis, content moderation. The dataset is primarily intended for training and evaluating vandalism detection systems for Wikidata.
  • Observation period: 21 months of training, 3 months hold-out testing. Data from 01.09.2021 to 01.09.2023 (Snapshot from 2024-04)
  • Features: Each record characterizes the corresponding revision of the Wikidata record, including revision metadata, user details, content modifications (insert, remove, or change), and corresponding MLMs-based features. 
  • Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.
  • Files: 
    • 2024-04_content_batch_{i}.csv - Revision content features (split into 15 batches).
    • 2024-04_metadata.csv - Revision metadata features.
    • expert_scores.csv - revision labeled by expert (column label correspond to the expert label).
    • full_labels_2024-04_text_en.csv - Wikidata ID to English label mapping.
    • mlm_text_features.csv - pretrained MLM scores.
    • ores_scores.csv - ORES (previous model in production) scores.

Attribution

The dataset was compiled from the Wikidata dump. All structured data from the Wikidata main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License.

Related paper citation:

TBD

Files

2024-04_metadata.csv

Files (6.9 GB)

Name Size Download all
md5:2dc670d560ab5ab1d0b5cb004aa9b310
266.2 MB Preview Download
md5:f532ca47916acd360cf1cc7fec1b56a2
264.6 MB Preview Download
md5:b84a4723d0e0891bfb5262fe6af119d3
266.6 MB Preview Download
md5:fc15eba69e73e22b183a0e56a6a36ef2
266.1 MB Preview Download
md5:d655c3bf327eaab635ea78b43b6e0ccc
265.6 MB Preview Download
md5:0fd0b3a31e6cd4821b66f1b0aa0c5ceb
265.0 MB Preview Download
md5:4905b333d10585ecadf6f73a7f35256f
266.1 MB Preview Download
md5:ec9da2e94ecf43041173f92cd15883d7
264.1 MB Preview Download
md5:e32cd259804133e8ef17f140e889acea
265.6 MB Preview Download
md5:29163d8e1951674dbc6ed277e4cb7152
265.3 MB Preview Download
md5:244939bf04213232bbdd449a3fe207ce
265.3 MB Preview Download
md5:f6f6dab034bb3d431a3e80106a5892d3
264.6 MB Preview Download
md5:748932a8b08fb35ab73fa558d539505d
265.5 MB Preview Download
md5:077056e97b42035b51b55d15433e1463
264.9 MB Preview Download
md5:97cce640b23703efe912d80251c69e47
264.5 MB Preview Download
md5:11a0004cdd7873a7afbd85f3104d9ca1
2.2 GB Preview Download
md5:86c5c3602685f519ff9b31af5a694f7e
42.0 kB Preview Download
md5:3c11b1556c7d59f16cefb24fb0e36dc9
132.9 MB Preview Download
md5:3f7e06e30910106127fe49da9cd55555
328.8 MB Preview Download
md5:f0dc2fd9ab8a34103e7c9a451c26f88a
180.0 MB Preview Download