RepLab Summarization Dataset

Rodríguez-Vidal, Javier; Carrillo-de-Albornoz, Jorge; Amigó, Enrique; Plaza, Laura; Gonzalo, Julio; Verdejo, Felisa

doi:10.5281/zenodo.2536801

Published January 10, 2019 | Version 1.0

Dataset Open

RepLab Summarization Dataset

1. Universidad Nacional de Eduación a Distancia (UNED)

RepLab Summarization Dataset

This package contains the dataset generated in the research published in the paper:

"Javier Rodríguez-Vidal, Jorge Carrillo-de-Albornoz, Enrique Amigó, Laura Plaza, Julio Gonzalo and Felisa Verdejo. 2019. Automatic Generation of Entity-Oriented Summaries for Reputation Management. Ambient Intelligence & Humanized Computing."

The dataset is available for research purpose. If you use it, please, cite us.

This README file contains:

1) A brief description of the corpus
2) A description of the contents of each directory in this package.

1. Description of RepLab Summarization Dataset

The RepLab summarization dataset contains companies data from the RepLab 2013 dataset (http://nlp.uned.es/replab2013/), where users from Twitter talk about different topics of the companies.
Each topic consists of a different number of tweets posted by Twitter users.

The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets

For each entity, tweets are groupped in topics and for each topic three different summaries are manually generated: abstractive english, abstractive spanish and extractive.

Please see the paper for further details.

2. Description of the contents of this package

./entities:

This directory includes the information of each organization in order to create a summary. Each .xml file corresponds to an entity and includes the following information:

-”Corpus entity”: Id of the entity.
   -”cluster”: each one of the topics of the entity.
       -"label": name of the topic.
       -"priority": level of relevance of the topic: Alert (the highest priority being a reputation alert, i.e., an issue that requires an immediate response from the entity), Midly_important (relevant for the entity, an intermediate priority)
                   or unimportant (the lowest priority).
       -”tweet”: Information about the tweets.
           -"id": Id of the tweet.
           -"date": When the tweet was written.
           -"followers": Of the author of the tweet.
           -"polarity": Of the tweet.
           -"text": Text of the tweet.
       -"summary": Information about the summary:
           -"abstract_EN": Abstractive summary in English.
           -"abstract_ES": Abstractive summary in Spanish.
           -"tweet": Id of the tweet(s) selected for the extractive summary (if it is not filled, the extractive summary is the one of the tweets in the topic).

Files

RepLab_summarization_dataset-V1.0.zip

Files (1.4 MB)

Name	Size	Download all
RepLab_summarization_dataset-V1.0.zip md5:914afd7955c110add8f94027426d3148	1.4 MB	Preview Download

	All versions	This version
Views	1,213	1,212
Downloads	155	155
Data volume	230.0 MB	230.0 MB

RepLab Summarization Dataset

Authors/Creators

Description

Files

RepLab_summarization_dataset-V1.0.zip

Files (1.4 MB)