00000nmm##2200000uu#4500 1168855 doi 10.5281/zenodo.1168855 oai:zenodo.org:1168855 Voelske, Michael Bauhaus-Universität Weimar Potthast, Martin Bauhaus-Universität Weimar Stein, Benno Bauhaus-Universität Weimar Dataset for generating TL;DR Syed, Shahbaz Bauhaus-Universität Weimar doi:10.5281/zenodo.1043504 info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 spdx tl;dr challenge abstractive summarization social media user-generated content This is the dataset for the TL;DR challenge containing posts from the Reddit corpus, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: <ul> <li>author: string (nullable = true)</li> <li>body: string (nullable = true)</li> <li>normalizedBody: string (nullable = true)</li> <li>content: string (nullable = true)</li> <li>content_len: long (nullable = true)</li> <li>summary: string (nullable = true)</li> <li>summary_len: long (nullable = true)</li> <li>id: string (nullable = true)</li> <li>subreddit: string (nullable = true)</li> <li>subreddit_id: string (nullable = true)</li> <li>title: string (nullable = true)</li> </ul> Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,084,410 posts with an average length of 211 words for content, and 25 words for the summary. Note : As this is the complete dataset for the challenge, it is up to the participants to split it into training and validation sets accordingly. eng Zenodo 2018-02-08 info:eu-repo/semantics/other 20200124192611.0 2157862847 md5:28951b6f3d5c6fd6f97e1f6314be3661 https://zenodo.org/records/1168855/files/tldr-challenge-dataset.zip open 10.5281/zenodo.1043504 Cites doi 10.5281/zenodo.1168854 isVersionOf doi