Published January 10, 2019 | Version 1.0
Dataset Open

RepLab Summarization Dataset

  • 1. Universidad Nacional de Eduación a Distancia (UNED)

Description

RepLab Summarization Dataset

This package contains the dataset generated in the research published in the paper:

"Javier Rodríguez-Vidal, Jorge Carrillo-de-Albornoz, Enrique Amigó, Laura Plaza, Julio Gonzalo and Felisa Verdejo. 2019. Automatic Generation of Entity-Oriented Summaries for Reputation Management. Ambient Intelligence & Humanized Computing."

The dataset is available for research purpose. If you use it, please, cite us.

This README file contains: 

1) A brief description of the corpus
2) A description of the contents of each directory in this package.


1. Description of RepLab Summarization Dataset


The RepLab summarization dataset contains companies data from the RepLab 2013 dataset (http://nlp.uned.es/replab2013/), where users from Twitter talk about different topics of the companies. 
Each topic consists of a different number of tweets posted by Twitter users.

The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets

For each entity, tweets are groupped in topics and for each topic three different summaries are manually generated: abstractive english, abstractive spanish and extractive.

Please see the paper for further details.

 

2. Description of the contents of this package

./entities:

This directory includes the information of each organization in order to create a summary. Each .xml file corresponds to an entity and includes the following information:

     -”Corpus entity”: Id of the entity.
     -”cluster”: each one of the topics of the entity.
        -"label": name of the topic.
        -"priority": level of relevance of the topic: Alert (the highest priority being a reputation alert, i.e., an issue that requires an immediate response from the entity), Midly_important (relevant for the entity, an intermediate priority)
                     or unimportant (the lowest priority).     
        -”tweet”: Information about the tweets.
            -"id": Id of the tweet.
            -"date": When the tweet was written.
            -"followers": Of the author of the tweet.
            -"polarity": Of the tweet.
            -"text": Text of the tweet.
        -"summary": Information about the summary:
            -"abstract_EN": Abstractive summary in English.
            -"abstract_ES": Abstractive summary in Spanish.
            -"tweet": Id of the tweet(s) selected for the extractive summary (if it is not filled, the extractive summary is the one of the tweets in the topic).

Files

RepLab_summarization_dataset-V1.0.zip

Files (1.4 MB)

Name Size Download all
md5:914afd7955c110add8f94027426d3148
1.4 MB Preview Download