Sentiment and Politeness Analysis Tools on Developer Discussions Are Unreliable, but So Are People
1. Does the paper propose a new opinion mining approach?
No
2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?
SentiStrength Alchemy NLTK Stanford NLP Senti4SD SentiCR ConvoKit (Politeness Tool)
3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.
SentiStrength Alchemy NLTK Stanford NLP Senti4SD SentiCR ConvoKit (Politeness Tool): Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. arXiv preprint arXiv:1306.6078 (2013). http://www.cs.cornell.edu/~cristian/Politeness.html
4. What is the main goal of the whole study?
To study the reliability of popular sentiment analysis and politeness tools in the context of developer discussions
5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?
To calculate the sentiment polarity of GitHub comments: SentiStrength, Alchemy, NLTK, Stanford NLP, Senti4SD, SentiCR To calculate the politeness of GitHub comments: ConvoKit (Politeness Tool)
6. Which dataset(s) the technique is applied on?
589 GitHub comments
7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.
589 GitHub comments: https://github.com/DeveloperLiberationFront/AffectAnalysisToolEvaluation
8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?
close to SentiCR, which is trained on code comments, different with others. Politeness tool trained on requests from Wikipedia editor's talk pages and the StackExchange question-answering communities
9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?
check consistency between human coders, apply sentiment analysis tools and the politeness tool to manually labeled 589 GitHub comments. human ratings had low sentiment and politeness consistency on GitHub comments. tools had low sentiment reliability on GitHub comments. tool had low politeness reliability on GitHub comments.
10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).
No
11. What success metrics are used?
Inter-rater Agreement (Weighted Cohen’s Kappa) F-measure
12. Write down any other comments/notes here.
-