Dataset Open Access

Dataset for generating TL;DR

Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>This is the dataset for the TL;DR challenge containing posts&nbsp;from the Reddit corpus, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:</p>\n\n<ul>\n\t<li>author: string (nullable = true)</li>\n\t<li>body: string (nullable = true)</li>\n\t<li>normalizedBody: string (nullable = true)</li>\n\t<li>content: string (nullable = true)</li>\n\t<li>content_len: long (nullable = true)</li>\n\t<li>summary: string (nullable = true)</li>\n\t<li>summary_len: long (nullable = true)</li>\n\t<li>id: string (nullable = true)</li>\n\t<li>subreddit: string (nullable = true)</li>\n\t<li>subreddit_id: string (nullable = true)</li>\n\t<li>title: string (nullable = true)</li>\n</ul>\n\n<p>Specifically, the <strong>content</strong> and <strong>summary</strong> fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,084,410 posts with an average length of 211 words for content, and 25&nbsp;words for the summary.</p>\n\n<p><strong>Note :&nbsp;</strong>As this is the complete dataset for the challenge, it is up to the participants to split it into training and validation sets accordingly.</p>", 
  "license": "http://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "Bauhaus-Universit\u00e4t Weimar", 
      "@type": "Person", 
      "name": "Syed, Shahbaz"
    }, 
    {
      "affiliation": "Bauhaus-Universit\u00e4t Weimar", 
      "@type": "Person", 
      "name": "Voelske, Michael"
    }, 
    {
      "affiliation": "Bauhaus-Universit\u00e4t Weimar", 
      "@type": "Person", 
      "name": "Potthast, Martin"
    }, 
    {
      "affiliation": "Bauhaus-Universit\u00e4t Weimar", 
      "@type": "Person", 
      "name": "Stein, Benno"
    }
  ], 
  "url": "https://zenodo.org/record/1168855", 
  "citation": [
    {
      "@id": "https://doi.org/10.5281/zenodo.1043504", 
      "@type": "CreativeWork"
    }
  ], 
  "datePublished": "2018-02-08", 
  "keywords": [
    "tl;dr challenge", 
    "abstractive summarization", 
    "social media", 
    "user-generated content"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/651566fd-7e6f-470b-acb6-0158d65da6d8/tldr-challenge-dataset.zip", 
      "@type": "DataDownload", 
      "fileFormat": "zip"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.1168855", 
  "@id": "https://doi.org/10.5281/zenodo.1168855", 
  "@type": "Dataset", 
  "name": "Dataset for generating TL;DR"
}
740
549
views
downloads
All versions This version
Views 740741
Downloads 549549
Data volume 1.2 TB1.2 TB
Unique views 679680
Unique downloads 446446

Share

Cite as