Dataset Open Access

Dataset for generating TL;DR

Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno

Citation Style Language JSON Export

  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.1168855", 
  "language": "eng", 
  "title": "Dataset for generating TL;DR", 
  "issued": {
    "date-parts": [
  "abstract": "<p>This is the dataset for the TL;DR challenge containing posts&nbsp;from the Reddit corpus, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:</p>\n\n<ul>\n\t<li>author: string (nullable = true)</li>\n\t<li>body: string (nullable = true)</li>\n\t<li>normalizedBody: string (nullable = true)</li>\n\t<li>content: string (nullable = true)</li>\n\t<li>content_len: long (nullable = true)</li>\n\t<li>summary: string (nullable = true)</li>\n\t<li>summary_len: long (nullable = true)</li>\n\t<li>id: string (nullable = true)</li>\n\t<li>subreddit: string (nullable = true)</li>\n\t<li>subreddit_id: string (nullable = true)</li>\n\t<li>title: string (nullable = true)</li>\n</ul>\n\n<p>Specifically, the <strong>content</strong> and <strong>summary</strong> fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,084,410 posts with an average length of 211 words for content, and 25&nbsp;words for the summary.</p>\n\n<p><strong>Note :&nbsp;</strong>As this is the complete dataset for the challenge, it is up to the participants to split it into training and validation sets accordingly.</p>", 
  "author": [
      "family": "Syed, Shahbaz"
      "family": "Voelske, Michael"
      "family": "Potthast, Martin"
      "family": "Stein, Benno"
  "type": "dataset", 
  "id": "1168855"
All versions This version
Views 730731
Downloads 540540
Data volume 1.2 TB1.2 TB
Unique views 669670
Unique downloads 440440


Cite as