autogoal/datasets: CNN dailymail

Ernesto Luis Estevanell Valladares; Alejandro Piad

doi:10.5281/zenodo.15359887

Published May 7, 2025 | Version cnn-dailymail

Software Open

autogoal/datasets: CNN dailymail

1. University of Havana (@matcom)

CNN/Daily Mail is a dataset for text summarization. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which the system is expected to answer the fill-in the-blank question. The authors released the scripts that crawl, extract and generate pairs of passages and questions from these websites.

In all, the corpus has 286,817 training pairs, 13,368 validation pairs and 11,487 test pairs, as defined by their scripts. The source documents in the training set have 766 words spanning 29.74 sentences on an average while the summaries consist of 53 words and 3.72 sentences.

@article{nallapati2016abstractive, title={Abstractive text summarization using sequence-to-sequence rnns and beyond}, author={Nallapati, Ramesh and Zhou, Bowen and Gulcehre, Caglar and Xiang, Bing and others}, journal={arXiv preprint arXiv:1602.06023}, year={2016} }

Files

autogoal/datasets-cnn-dailymail.zip

Files (1.5 kB)

Name	Size	Download all
autogoal/datasets-cnn-dailymail.zip md5:294ec39a00eb8199143aba18a8ad1935	1.5 kB	Preview Download

Additional details

Is supplement to: Software: https://github.com/autogoal/datasets/tree/cnn-dailymail (URL)

Repository URL: https://github.com/autogoal/datasets

	All versions	This version
Views	1,377	353
Downloads	201	21
Data volume	299.6 kB	37.9 kB

autogoal/datasets: CNN dailymail

Authors/Creators

Description

Files

autogoal/datasets-cnn-dailymail.zip

Files (1.5 kB)

Additional details

Related works

Software