Published May 16, 2015 | Version v1
Dataset Open

Dataset used for "A Recommender System of Buggy App Checkers for App Store Moderators"

  • 1. University of Lille / Inria

Description

This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.

Dataset Collection
We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014.
Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice.
The dataset D1 contains available apps in the Google Play Store in January 2014.
Then, we created a new snapshot (D2) of the Google Play Store in March 2014.

The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a  median number of 1.978 apps per category.

For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.

In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus
several reviews lack some of these details.
From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).

 

Dataset Stats
Some stats about the datasets:

- D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.

- D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.

Additional stats about the datasets are available here.


Dataset Description
To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges.  We chose a graph database because the graph visualization helps to identify connections among data (e.g.,
clusters of apps sharing similar sets of permission requests).

In particular, our dataset graph contains six types of nodes:
APP nodes containing metadata of each app,
PERMISSION nodes describing permission types,
CATEGORY nodes describing app categories,
SUBCATEGORY nodes describing app subcategories,
- USER_REVIEW nodes storing user reviews.
- TOPIC topics mined from user reviews (using LDA).

Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:

- USES_PERMISSION relationships between APP and PERMISSION nodes
- HAS_REVIEW between APP and USER_REVIEW nodes
- HAS_TOPIC between USER_REVIEW and TOPIC nodes
BELONGS_TO_CATEGORY between APP and CATEGORY nodes
- BELONGS_TO_SUBCATEGORY between APP and SUBCATEGORY nodes


Dataset Files Info

  • Neo4j 2.0 Databases
    • googlePlayDB1-Jan2014_neo4j_2_0.rar
    • googlePlayDB2-Mar2014_neo4j_2_0.rar
      We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).
       
  • Neo4j 3.5 Databases
    • googlePlayDB1-Jan2014_neo4j_3_5_28.rar
    • googlePlayDB2-Mar2014_neo4j_3_5_28.rar
      Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28.
      The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'.
      The tool can be downloaded from the official
      Neo4j Donwload page.

      In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide.

      First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j 

       

 

Files

Files (1.4 GB)

Name Size Download all
md5:816dd11b5d59e33fbcef71362b2849c5
161.0 MB Download
md5:5f0812f36addb6679720052c8b6b939a
450.8 MB Download
md5:5fefafd0968d68b97372aaee5e62ccaf
154.0 MB Download
md5:1e7db5d827304d12c7ee3aaba6ca1ead
590.5 MB Download