Dataset Open Access
Rodríguez-Vidal, Javier;
Carrillo-de-Albornoz, Jorge;
Amigó, Enrique;
Plaza, Laura;
Gonzalo, Julio;
Verdejo, Felisa
RepLab Summarization Dataset
This package contains the dataset generated in the research published in the paper:
"Javier Rodríguez-Vidal, Jorge Carrillo-de-Albornoz, Enrique Amigó, Laura Plaza, Julio Gonzalo and Felisa Verdejo. 2019. Automatic Generation of Entity-Oriented Summaries for Reputation Management. Ambient Intelligence & Humanized Computing."
The dataset is available for research purpose. If you use it, please, cite us.
This README file contains:
1) A brief description of the corpus
2) A description of the contents of each directory in this package.
1. Description of RepLab Summarization Dataset
The RepLab summarization dataset contains companies data from the RepLab 2013 dataset (http://nlp.uned.es/replab2013/), where users from Twitter talk about different topics of the companies.
Each topic consists of a different number of tweets posted by Twitter users.
The collection comprises tweets about 31 entities from two domains: automotive and banking. As a result, our subset of RepLab 2013 comprises 71,303 English and Spanish tweets
For each entity, tweets are groupped in topics and for each topic three different summaries are manually generated: abstractive english, abstractive spanish and extractive.
Please see the paper for further details.
2. Description of the contents of this package
./entities:
This directory includes the information of each organization in order to create a summary. Each .xml file corresponds to an entity and includes the following information:
-”Corpus entity”: Id of the entity.
-”cluster”: each one of the topics of the entity.
-"label": name of the topic.
-"priority": level of relevance of the topic: Alert (the highest priority being a reputation alert, i.e., an issue that requires an immediate response from the entity), Midly_important (relevant for the entity, an intermediate priority)
or unimportant (the lowest priority).
-”tweet”: Information about the tweets.
-"id": Id of the tweet.
-"date": When the tweet was written.
-"followers": Of the author of the tweet.
-"polarity": Of the tweet.
-"text": Text of the tweet.
-"summary": Information about the summary:
-"abstract_EN": Abstractive summary in English.
-"abstract_ES": Abstractive summary in Spanish.
-"tweet": Id of the tweet(s) selected for the extractive summary (if it is not filled, the extractive summary is the one of the tweets in the topic).
Name | Size | |
---|---|---|
RepLab_summarization_dataset-V1.0.zip
md5:914afd7955c110add8f94027426d3148 |
1.4 MB | Download |
All versions | This version | |
---|---|---|
Views | 422 | 422 |
Downloads | 59 | 59 |
Data volume | 84.3 MB | 84.3 MB |
Unique views | 377 | 377 |
Unique downloads | 57 | 57 |