A Benchmark Study on Sentiment Analysis for Software Engineering Research

1. Does the paper propose a new opinion mining approach?

No

2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?

This is a benchmark study. The authors assess the performance of three SE-specific polarity classifiers, namely SentiCr, Senti4Sd, and SentiStrength-SE. They use the ge purpose tool SentiStrength as a baseline. SentiCR and Senti4SD are supervised classifiers based on machine learning. SentiCR exploits bag-of-words (BoW) as features while Senti4Sd combines lexical, semantics, and keyword-based features. SentiStrength-SE is a rule-based classifier. It basically enhances SentiStrength by adapting the positive/negative scores of words in its lexicon to the software engineering domain.

3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.

The tools used in the benchmark study are not original contributions of this specific study. All tools evaluated in the study are publicly available. Specifically, the authors evaluate: - Senti4Sd: 
https://github.com/collab-uniba/Senti4SD - SentiCr: https://github.com/senticr/SentiCR - SentiStrength-SE: https://laser.cs.uno.edu/Projects/Projects.html - SentiStrength (baseline): http://sentistrength.wlv.ac.uk/

4. What is the main goal of the whole study?

The main goal of this study is to assess SE-specific sentiment analysis tools on four SE gold standard datasets.

5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?

NA

6. Which dataset(s) the technique is applied on?

Four SE-specific datasets. Two of them are annotated following a model-driven approach, that is by adopting guidelines inspired by a theoretical model of emotion, after preliminary training. It is the case of the Jira (Ortu et al., 2016) and Stack Overflow datasets (see paper 19 in our collection).
The other two are annotated using an ad hoc approach, i.e. the raters are required to provide polarity labels according to their subjective perception of the semantic orientation of the text. The authors name these two datasets as 'Code Review' (developed by the authors of SentiCR) and 'Java Libraries' .

7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.

Stack Overow dataset: (paper 15): 
https://github.com/collab-uniba/Senti4SD Jira dataset (paper 29): http://ansymore.uantwerpen.be/system/les/uploads/ artefacts/alessandro/MSR16/archive3.zip Code Review (paper 93): https://github.com/senticr/SentiCR/
 Java Libraries (paper 1): https://sentiment-se.github.io/replication.zip

8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?

No

9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?

Yes The performance of the three SE-specific tools and the Sentistrength baseline are evaluated in terms of precision, recall, and F1. The authors also provide the weighted Cohen's Kappa to assess 1) agreement between tools (in pairs), and 2) agreement between the manual labels and the output of the tools. Main findings: 1) reliable sentiment analysis in software engineering is possible, provided that manual annotation of gold standards is inspired by theoretical models of affect. 2) Regardless of the approach adopted for annotation, SE-specific customization does provide a boost in term of classification performance with respect to the baseline approach represented by an off-the-shelf tool (SentiStrength, general purpose-tool). Specifically, the best performance is observed for supervised approaches trained in a within-platform setting, i.e., the train and test set are from the same data source; 3) The authors recommend performing custom retraining of classifiers whenever a gold set is available. Whenever retraining is not possible due to the unavailability of a gold standard, the lexicon-based approach provides comparable performance. The authors complemented the quantitative assessment of performance with the results of qualitative error analysis.

10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).

Yes, the paper replicates the study by Jongeling et al. (EMSE, 2017 - paper 94) to assess the performance and reliability of three sentiment analysis tools, which have been specifically optimized for the domain of software development and were not available at the time of the original study. The original study compares the predictions of widely used off-the-shelf sentiment analysis tools, showing not only how these tools disagree with human annotation of developers’ communication channels, but also how they disagree with each other. The current study enhances the state of the art by investigating to what extent fine-tuning sentiment analysis tools for the software engineering (SE) domain do succeed in improving the accuracy of emotion detection, with respect to the use of general-purpose tools trained on data collected from generalistic social media (see main findings above).

11. What success metrics are used?

Precision, Recall, F1, Weighted Cohen Kappa

12. Write down any other comments/notes here.

-