On negative results when using sentiment analysis tools for software engineering research

1. Does the paper propose a new opinion mining approach?

No

2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?

SentiStrength: Thelwall M, Buckley K, Paltoglou G, Cai D, Kappas A (2010) Sentiment in short strength detection informal text. J Am Soc Inf Sci Technol 61(12):2544–2558 Alchemy NLTK: Bird S, Loper E, Klein E (2009) Natural language processing with Python. O’Reilly Media Inc StanfordNLP: Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Empirical Methods in Natural Language Processing, pp 1631–1642. Ass. for Comp. Linguistics

3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.

Alchemy: http://www.alchemyapi.com/products/alchemylanguage/sentiment-analysis/ Sentistrength: http://sentistrength.wlv.ac.uk/ Stanford: https://stanfordnlp.github.io/CoreNLP/ NLTK: https://www.nltk.org/

4. What is the main goal of the whole study?

Compare different sentiment mining tools on SE datasets to see whether the tools agree or not.

5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?

Determine whether sentiment mining tools agree with each other.

6. Which dataset(s) the technique is applied on?

Labeled comments (Murgia). Android issue tracker data, GNOME issue tracker data, GNOME-related SO discussions, ASF Issue tracker.

7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.

All datasets are available here: http://ow.ly/HvC5302N4oK

8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?

Yes, all techniques have not been designed for SE, let alone SE specific tasks.

9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?

Comparison is done using Cohen's Kappa, ARI, and F1 score. Results show that the tools do not agree, and that the F1 score is in general low for some, or all, classes. Moreover, when looking only at datapoints where the tools all agree with each other scores tend to be on the low side for the remaining datapoints.

10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).

Pletea etal. find that comments and discussion related to security tend to be more negative, and that security related discussions tend to be more emotional. The work confirms the first find, but does not find support for the second finding, when using different tools. Guzman etal. find that emotional variations in the tone of commit messages occur based on day, time and programming language. The authors find no support for the claims made by Guzman etal. when using different tools.

11. What success metrics are used?

-

12. Write down any other comments/notes here.

-