Published May 22, 2025
| Version v1
Dataset
Open
Wikidata Vandalism Detection Dataset
Description
Description
This dataset accompanies a research paper that introduces a new system designed to support the Wikidata community in combating vandalism on the platform.
Keywords: Wikidata, content differences, vandalism detection, data mining, content analysis, computational social science, NLP.
Dataset Details:
- Number of files: 20 (6.85 GB)
- Format: CSV
- License: CC BY 4.0
- Use Case: data mining, vandalism detection and analysis, content moderation. The dataset is primarily intended for training and evaluating vandalism detection systems for Wikidata.
- Observation period: 21 months of training, 3 months hold-out testing. Data from 01.09.2021 to 01.09.2023 (Snapshot from 2024-04)
- Features: Each record characterizes the corresponding revision of the Wikidata record, including revision metadata, user details, content modifications (insert, remove, or change), and corresponding MLMs-based features.
- Data Filtering and Feature Engineering: Advanced filtering and feature engineering techniques were applied to ensure the dataset's quality and relevance for effectively training the vandalism detection system.
- Files:
2024-04_content_batch_{i}.csv
- Revision content features (split into 15 batches).2024-04_metadata.csv
- Revision metadata features.expert_scores.csv
- revision labeled by expert (columnlabel
correspond to the expert label).full_labels_2024-04_text_en.csv
- Wikidata ID to English label mapping.mlm_text_features.csv
- pretrained MLM scores.ores_scores.csv
- ORES (previous model in production) scores.
Attribution
The dataset was compiled from the Wikidata dump. All structured data from the Wikidata main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License.
Related paper citation:
TBD
Files
2024-04_metadata.csv
Files
(6.9 GB)
Name | Size | Download all |
---|---|---|
md5:2dc670d560ab5ab1d0b5cb004aa9b310
|
266.2 MB | Preview Download |
md5:f532ca47916acd360cf1cc7fec1b56a2
|
264.6 MB | Preview Download |
md5:b84a4723d0e0891bfb5262fe6af119d3
|
266.6 MB | Preview Download |
md5:fc15eba69e73e22b183a0e56a6a36ef2
|
266.1 MB | Preview Download |
md5:d655c3bf327eaab635ea78b43b6e0ccc
|
265.6 MB | Preview Download |
md5:0fd0b3a31e6cd4821b66f1b0aa0c5ceb
|
265.0 MB | Preview Download |
md5:4905b333d10585ecadf6f73a7f35256f
|
266.1 MB | Preview Download |
md5:ec9da2e94ecf43041173f92cd15883d7
|
264.1 MB | Preview Download |
md5:e32cd259804133e8ef17f140e889acea
|
265.6 MB | Preview Download |
md5:29163d8e1951674dbc6ed277e4cb7152
|
265.3 MB | Preview Download |
md5:244939bf04213232bbdd449a3fe207ce
|
265.3 MB | Preview Download |
md5:f6f6dab034bb3d431a3e80106a5892d3
|
264.6 MB | Preview Download |
md5:748932a8b08fb35ab73fa558d539505d
|
265.5 MB | Preview Download |
md5:077056e97b42035b51b55d15433e1463
|
264.9 MB | Preview Download |
md5:97cce640b23703efe912d80251c69e47
|
264.5 MB | Preview Download |
md5:11a0004cdd7873a7afbd85f3104d9ca1
|
2.2 GB | Preview Download |
md5:86c5c3602685f519ff9b31af5a694f7e
|
42.0 kB | Preview Download |
md5:3c11b1556c7d59f16cefb24fb0e36dc9
|
132.9 MB | Preview Download |
md5:3f7e06e30910106127fe49da9cd55555
|
328.8 MB | Preview Download |
md5:f0dc2fd9ab8a34103e7c9a451c26f88a
|
180.0 MB | Preview Download |