Published May 26, 2019 | Version v1
Dataset Open

Mining Rule Violations in JavaScript Code Snippets

Description

Content of this repository
This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge

Github Repository with the software used : here.
=======

DATASET
The dataset was retrived utilizing google bigquery and dumped to a csv
file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information :
1. The Id of the question (PostId)
2. The Content (in this case the code block)
3. the lenght of the code block
4. the line count of the code block
5. The score of the post
6. The title

A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database.

Filtered Dataset:

Extracting code from CSV
We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files.

Running ESlint
Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files.

Number of Violations per Rule
This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset.

Number of violations per Category
As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script.

Individual Reports
This information was extracted from the json reports, it's a csv file with PostID and violations per rule.

Rules
The file Rules with categories contains all the rules used and their categories.

 

Files

IndividualReportPerPost.csv

Files (504.7 MB)

Name Size Download all
md5:913630170212bbb88291773823cd41c3
50.5 MB Preview Download
md5:3d7e9c830aea22cf111ffc3b7d5924e9
454.2 MB Preview Download
md5:fb709bcb6316e355d5b949bce7a57b19
2.9 kB Preview Download
md5:256aeddbd6f9956b51340b506efad07d
145 Bytes Preview Download
md5:0f2589cf070d9f2fb01262d3c794afc8
31.4 kB Preview Download
md5:d789a91578f11283de38dedcb4c44d07
12.5 kB Download

Additional details

Related works