On the automatic classification of app reviews

1. Does the paper propose a new opinion mining approach?

No

2. Which opinion mining techniques are used (list all of them, clearly stating their name/reference)?

text classification (Naive Bayes vs. Decision Tree, vs. Maximum Entropy) for review type classification sentistrength for sentiment analysis

3. Which opinion mining approaches in the paper are publicly available? Write down their name and links. If no approach is publicly available, leave it blank or None.

sentistrength

4. What is the main goal of the whole study?

To compare the accuracy of the techniques for classifying app reviews into four types: bug reports, feature requests, user experiences, and text ratings

5. What the researchers want to achieve by applying the technique(s) (e.g., calculate the sentiment polarity of app reviews)?

text classification for review type classification sentistrength for sentiment analysis

6. Which dataset(s) the technique is applied on?

4400 sampled app reviews from 1,303,182 reviews collected from online app stores

7. Is/Are the dataset(s) publicly available online? If yes, please indicate their name and links.

sampled reviews: https://mast.informatik.uni-hamburg.de/app-review-analysis/

8. Is the application context (dataset or application domain) different from that for which the technique was originally designed?

text classification: trained with new data SentiStrength: Yes,

9. Is the performance (precision, recall, run-time, etc.) of the technique verified? If yes, how did they verify it and what are the results?

They evaluated the different techniques introduced, while varying the classification features and the machine learning algorithms. Results are obtained using the Monte Carlo cross-validation method with 10 runs and random 70:30 split ratio. When combined with simple text classification and natural language preprocessing of the text—particularly with bigrams and lemmatization—the classification precision for all review types got up to 88–92% and the recall up to 90–99%.

10. Does the paper replicate the results of previous work? If yes, leave a summary of the findings (confirm/partially confirms/contradicts).

No, but extension to their previous paper: Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app reviews. In: 2015 IEEE 23rd international requirements engineering conference (RE), pp 116–125

11. What success metrics are used?

Precision, recall, F1

12. Write down any other comments/notes here.

-