Dataset Open Access

Dataset for generating TL;DR

Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.1168855", 
  "language": "eng", 
  "title": "Dataset for generating TL;DR", 
  "issued": {
    "date-parts": [
      [
        2018, 
        2, 
        8
      ]
    ]
  }, 
  "abstract": "<p>This is the dataset for the TL;DR challenge containing posts&nbsp;from the Reddit corpus, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:</p>\n\n<ul>\n\t<li>author: string (nullable = true)</li>\n\t<li>body: string (nullable = true)</li>\n\t<li>normalizedBody: string (nullable = true)</li>\n\t<li>content: string (nullable = true)</li>\n\t<li>content_len: long (nullable = true)</li>\n\t<li>summary: string (nullable = true)</li>\n\t<li>summary_len: long (nullable = true)</li>\n\t<li>id: string (nullable = true)</li>\n\t<li>subreddit: string (nullable = true)</li>\n\t<li>subreddit_id: string (nullable = true)</li>\n\t<li>title: string (nullable = true)</li>\n</ul>\n\n<p>Specifically, the <strong>content</strong> and <strong>summary</strong> fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,084,410 posts with an average length of 211 words for content, and 25&nbsp;words for the summary.</p>\n\n<p><strong>Note :&nbsp;</strong>As this is the complete dataset for the challenge, it is up to the participants to split it into training and validation sets accordingly.</p>", 
  "author": [
    {
      "family": "Syed, Shahbaz"
    }, 
    {
      "family": "Voelske, Michael"
    }, 
    {
      "family": "Potthast, Martin"
    }, 
    {
      "family": "Stein, Benno"
    }
  ], 
  "type": "dataset", 
  "id": "1168855"
}
638
494
views
downloads
All versions This version
Views 638638
Downloads 494494
Data volume 1.1 TB1.1 TB
Unique views 584584
Unique downloads 405405

Share

Cite as