Published October 29, 2020 | Version 1
Dataset Open

Portuguese Comparative Sentences: A Collection of Labeled Sentences on Twitter and Buscapé

Description

More and more customers demand online reviews of products and comments on the Web to make decisions about buying a product over another. In this context, sentiment analysis techniques constitute the traditional way to summarize a user’s opinions that criticizes or highlights the positive aspects of a product. Sentiment analysis of reviews usually relies on extracting positive and negative aspects of products, neglecting comparative opinions. Such opinions do not directly express a positive or negative view but contrast aspects of products from different competitors. 

Here, we present the first effort to study comparative opinions in Portuguese, creating two new Portuguese datasets with comparative sentences marked by three humans. This repository consists of three important files: (1) lexicon that contains words frequently used to make a comparison in Portuguese; (2) Twitter dataset with labeled comparative sentences; and (3) Buscapé dataset with labeled comparative sentences.

The lexicon is a set of 176 words frequently used to express a comparative opinion in the Portuguese language. In these contexts, the lexicon is aggregated in a filter and used to build two sets of data with comparative sentences from two important contexts: (1) Social Network Online; and (2) Product reviews.

For Twitter, we collected all Portuguese tweets published in Brazil on 2018/01/10 and filtered all tweets that contained at least one keyword present in the lexicon, obtaining 130,459 tweets. Our work is based on the sentence level. Thus, all sentences were extracted and a sample with 2,053 sentences was created, which was labeled for three human manuals, reaching an 83.2% agreement with Fleiss' Kappa coefficient. For Buscapé, a Brazilian website (https://www.buscape.com.br/) used to compare product prices on the web, the same methodology was conducted by creating a set of 2,754 labeled sentences, obtained from comments made in 2013. This dataset was labeled by three humans, reaching an agreement of 83.46% with the Fleiss Kappa coefficient.

The Twitter dataset has 2,053 labeled sentences, of which 918 are comparative. The Buscapé dataset has 2,754 labeled sentences, of which 1,282 are comparative.

The datasets contain these labeled properties:

  • text: the sentence extracted from the review comment.

  • entity_s1: the first entity compared in the sentence.

  • entity_s2: the second entity compared in the sentence.

  • keyword: the comparative keyword used in the sentence to express comparison.

  • preferred_entity: the preferred entity.

  • id_start: the keyword's initial position in the sentence.

  • id_end: the keyword's final position in the sentence.

  • type: the sentence label, which specifies whether the phrase is a comparison.

Additional Information:

1 - The sentences were separated using a sentence tokenizer.

2 - If the compared entity is not specified, the field will receive a value: "__".

3 - The property "type" can contain five values, they are:

  • 0: Non-comparative (Não Comparativa).

  • 1: Non-Equal-Gradable (Gradativa com Predileção).

  • 2: Equative (Equitativa).

  • 3: Superlative (Superlativa).

  • 4: Non-Equal-Gradable (Não Gradativa).

 

If you use this data, please cite our paper as follows: 

"Daniel Kansaon, Michele A. Brandão, Julio C. S. Reis, Matheus Barbosa,Breno Matos, and Fabrício Benevenuto. 2020. Mining Portuguese Comparative Sentences in Online Reviews. In Brazilian Symposium on Multimedia and the Web (WebMedia ’20), November 30-December 4, 2020, São Luís, Brazil. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3428658.3431081"

--------------

Plus Information:

We make the raw sentences available in the dataset to allow future work to test different pre-processing steps. Then, if you want to obtain the exact sentences used in the paper above, you must reproduce the pre-processing step described in the paper (Figure 2). 

For each sentence with more than one keyword in the dataset: 

  • You need to extract three words before and three words after the comparative keyword, creating a new sentence that will receive the existing value in the “type” field as a label;
  • The original sentence will be divided into n new sentences. (n) is the number of keywords in the sentence;
  • The stopwords should not be accounted for as part of this range (3 words);

Note that: the final processed sentence can have more than six words because the stopwords are not counted as part of the range.

Files

dataset_buscape.json

Files (2.3 MB)

Name Size Download all
md5:f68a9659743de0f6aecf80cdadaf295e
1.6 MB Preview Download
md5:e83cadc99a8d8363e46a75b4089bdd5d
698.3 kB Preview Download
md5:02569345e500e95237c01f4d53ace25f
1.8 kB Preview Download