Webis Clickbait Spoiling Corpus 2022

Matthias Hagen; Maik Fröbe; Artur Jurk; Martin Potthast

doi:10.5281/zenodo.6362726

Published March 16, 2022 | Version 1.0.0

Dataset Open

Webis Clickbait Spoiling Corpus 2022

1. Martin-Luther-Universität Halle-Wittenberg
2. Leipzig University

# Webis Clickbait Spoiling Corpus 2022

The Webis Clickbait Spoiling Corpus 2022 (Webis-Clickbait-22) contains 5,000 spoiled clickbait posts crawled from Facebook, Reddit, and Twitter.
This corpus supports the task of clickbait spoiling, which deals with generating a short text that satisfies the curiosity induced by a clickbait post.

This dataset contains the clickbait posts and manually cleaned versions of the linked documents, and extracted spoilers for each clickbait post.
Additionally, the spoilers are categorized into three types: short phrase spoilers, longer passage spoilers, and multiple non-consecutive pieces of text.

We want to organize a shared task on clickbait spoiling. Hence, we omit the 1,000 test post from this version of the dataset and will publish the test posts later.

# Overview

The dataset comes with predefined train/validation/test splits:

training.jsonl contains 3,200 posts for training
validation.jsonl contains 800 posts for validation
test.jsonl contains 1,000 posts for testing
- The test set is ommitted from this version of the dataset since we want to organize a shared task on clickbait spoiling and for this we want to keep the test set private until the end of the shared task.
clickbait-spoiling-21.jsonl contains the complete corpus with 5,000 clickbait posts
- The clickbait-spoiling-21.jsonl file is ommitted from this version of the dataset since we want to organize a shared task on clickbait spoiling and for this we want to keep the test set private until the end of the shared task.

Files

webis-clickbait-22.zip

Files (9.2 MB)

Name	Size	Download all
webis-clickbait-22.zip md5:fe802000121d0cb1a36e3c4b9c1eb4fe	9.2 MB	Preview Download

	All versions	This version
Views	4,662	3,882
Downloads	1,927	1,700
Data volume	19.6 GB	17.0 GB

Webis Clickbait Spoiling Corpus 2022

Authors/Creators

Description

Files

webis-clickbait-22.zip

Files (9.2 MB)