Webis-TLDR-17 Corpus

Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno

doi:10.5281/zenodo.1043504

Published November 7, 2017 | Version v1

Dataset Open

Webis-TLDR-17 Corpus

1. Bauhaus-Universität Weimar

This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:

author: string (nullable = true)
body: string (nullable = true)
normalizedBody: string (nullable = true)
content: string (nullable = true)
content_len: long (nullable = true)
summary: string (nullable = true)
summary_len: long (nullable = true)
id: string (nullable = true)
subreddit: string (nullable = true)
subreddit_id: string (nullable = true)
title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.

Files

corpus-webis-tldr-17.zip

Files (3.1 GB)

Name	Size
corpus-webis-tldr-17.zip md5:e2fb1d5026cdb895ea640bdb134d0398	3.1 GB	Preview Download

Additional details

Is documented by: https://www.uni-weimar.de/en/media/chairs/computer-science-and-media/webis/corpora/corpus-webis-tldr-17/ (URL)

	All versions	This version
Views	6,520	6,482
Downloads	7,782	7,761
Data volume	70.1 TB	70.0 TB

Webis-TLDR-17 Corpus

Authors/Creators

Description

Files

corpus-webis-tldr-17.zip

Files (3.1 GB)

Additional details

Related works