Sentiment Analysis for Software Engineering: How Far Can We Go?

1. Does the paper propose a new opinion mining approach?

No

2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?

SentiStrength NLTK StanfordCoreNLP SentiStrength-SE

3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.

SentiStrength: http://sentistrength.wlv.ac.uk/ NLTK: https://www.nltk.org/ StanfordCoreNLP: https://stanfordnlp.github.io/CoreNLP/ SentiStrength-SE: https://laser.cs.uno.edu/Projects/Projects.html

4. What is the main goal of the whole study?

Determine the accuracy of commonly used tools when determining the sentiment of SE texts & see how different datasets influence different tools.

5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?

Opinion mining.

6. Which dataset(s) the technique is applied on?

Stackoverflow discussions, Mobile app reviews, JIRA issue comments.

7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.

Mobile app reviews: Lorenzo Villarroel, Gabriele Bavota, Barbara Russo, Rocco Oliveto, and Massimiliano Di Penta. 2016. Release planning of mobile apps based on user reviews. In Proceedings of ICSE 2016 (38th International Conference on Software Engineering). 14–24. JIRA issue comments: Marco Ortu, Alessandro Murgia, Giuseppe Destefanis, Parastou Tourani, Roberto Tonelli, Michele Marchesi, and Bram Adams. 2016. The emotional side of software developers in JIRA. In Proceedings of MSR 2016 (13th International Conference on Mining Software Repositories). IEEE, 480–483.

8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?

yes, some tools (NLTK, StanfordCoreNLP, Sentistrength) are not specifically made for SE.

9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?

yes, verified on three different datasets. Showing that there were some cases (neutral app reviews for instance) where all tools scored a very low precision and recall.

10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).

-

11. What success metrics are used?

Precision and recall

12. Write down any other comments/notes here.

-