A Gold Standard for Emotion Annotation in Stack Overflow
1. Does the paper propose a new opinion mining approach?
No
2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?
The paper does not propose a new opinion mining techniques. This is a data paper describing the creation and characteristics of a gold standard dataset for emotion annotation. However, the authors report they used SentiStrength (lexicon-based sentiment annotation) to create the annotation sample through opportunistic sampling.
3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.
None
4. What is the main goal of the whole study?
Create and distribute a gold standard dataset for emotion annotation in the software development domain. The authors release a dataset of 4,800 questions, answers, and comments from Stack Overflow, manually annotated for emotions. The annotation is performed according to the Shaver framework of emotions.
5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?
NA
6. Which dataset(s) the technique is applied on?
This paper does not propose a new technique. The authors release a gold standard dataset for public use.
7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.
Yes. The gold standard dataset and guidelines for annotation are available for download at: https://github.com/collab-uniba/EmotionDatasetMSR18
8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?
No
9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?
Does not apply. However, the authors assess the interrater agreement using standard metrics. Specifically, the observed agreement ranges from .86 for Joy to .98 for surprise while Fleiss' Kappa range from .30 for surprise to .66 for Love.
10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).
No but it uses the Shaver framework for the annotation of emotions, which was already adopted by Ortu et al. (MSR 2014)
11. What success metrics are used?
Observed agreement, i.e. the percentage of times the human judges agree with each other Fleiss' Kappa, the variant of Cohen's Kappa used for computing agreement in presence of more than two raters for each document
12. Write down any other comments/notes here.
This dataset can support sentiment analysis research in the software engineering domain. It has been used by the same authors for training a Senti4SD and EmoTxt both part of the emotion mining toolkit described in paper 6 of our collection.