Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published January 10, 2019 | Version 1.0
Dataset Open

RepLab Summarization Dataset

  • 1. Universidad Nacional de Eduación a Distancia (UNED)


RepLab Summarization Dataset

This package contains the dataset generated in the research published in the paper:

"Javier Rodríguez-Vidal, Jorge Carrillo-de-Albornoz, Enrique Amigó, Laura Plaza, Julio Gonzalo and Felisa Verdejo. 2019. Automatic Generation of Entity-Oriented Summaries for Reputation Management. Ambient Intelligence & Humanized Computing."

The dataset is available for research purpose. If you use it, please, cite us.

This README file contains: 

1) A brief description of the corpus
2) A description of the contents of each directory in this package.

1. Description of RepLab Summarization Dataset

The RepLab summarization dataset contains companies data from the RepLab 2013 dataset (, where users from Twitter talk about different topics of the companies. 
Each topic consists of a different number of tweets posted by Twitter users.

The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets

For each entity, tweets are groupped in topics and for each topic three different summaries are manually generated: abstractive english, abstractive spanish and extractive.

Please see the paper for further details.


2. Description of the contents of this package


This directory includes the information of each organization in order to create a summary. Each .xml file corresponds to an entity and includes the following information:

     -”Corpus entity”: Id of the entity.
     -”cluster”: each one of the topics of the entity.
        -"label": name of the topic.
        -"priority": level of relevance of the topic: Alert (the highest priority being a reputation alert, i.e., an issue that requires an immediate response from the entity), Midly_important (relevant for the entity, an intermediate priority)
                     or unimportant (the lowest priority).     
        -”tweet”: Information about the tweets.
            -"id": Id of the tweet.
            -"date": When the tweet was written.
            -"followers": Of the author of the tweet.
            -"polarity": Of the tweet.
            -"text": Text of the tweet.
        -"summary": Information about the summary:
            -"abstract_EN": Abstractive summary in English.
            -"abstract_ES": Abstractive summary in Spanish.
            -"tweet": Id of the tweet(s) selected for the extractive summary (if it is not filled, the extractive summary is the one of the tweets in the topic).


Files (1.4 MB)

Name Size Download all
1.4 MB Preview Download