Evaluating the performance of machine learning sentiment analysis algorithms in software engineering
1. Does the paper propose a new opinion mining approach?
No
2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?
The paper compares three techniques: - Logistic regression - SVM - NB Classifier
3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.
-
4. What is the main goal of the whole study?
Determine how the dataset influences the accuracy of the chosen three methods.
5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?
Evaluate the opinion mining techniques.
6. Which dataset(s) the technique is applied on?
The Stackoverflow dataset.
7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.
Nicole Novielli, Fabio Calefato, and Filippo Lanubile. A gold standard for emotion annotation in stack overflow. arXiv preprint arXiv:1803.02300, 2018.
8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?
No.
9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?
Yes, for testing for the presence of emotion: Algorithm Precision Recall F1 Score Logistic Regression 0.752 0.750 0.750 SVM 0.609 0.606 0.601 Naive Bayes 0.709 0.705 0.705 For determining positive or negative: Algorithm Precision Recall F1 Score Logistic Regression 0.896 0.892 0.893 SVM 0.671 0.671 0.669 Naive Bayes 0.819 0.815 0.815
10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).
No.
11. What success metrics are used?
Precision, Recall, F-Measure
12. Write down any other comments/notes here.
I suspect that there are great problems with the paper, for instance no mentioning how the dataset is actually processed.