Dataset Open Access

Replication package for the paper "What do Developers Discuss about Code Comments"

Anonymous

# RP-commenting-practices-multiple-sources
Replication package for the paper "What do Developers Discuss about Code Comments?"

## Structure
```
Appendix.pdf
Tags-topics.md
Stack-exchange-query.md

RQ1/
    LDA_input/
        combined-so-quora-mallet-metadata.csv
        topic-input.mallet

    LDA_output/
    	Mallet/
    		output_csv/
    			docs-in-topics.csv
    			topic-words.csv
    			topics-in-docs.csv
    			topics-metadata.csv
    		output_html/
    			all_topics.html
    			Docs/
    			Topics/

RQ2/
   datasource_rawdata/
    	quora.csv
    	stackoverflow.csv
    manual_analysis_output/
    	stackoverflow_quora_taxonomy.xlsx
```

## Contents of the Replication Package
---
- **Appendix.pdf**- Appendix of the paper containing supplement tables

- **Tags-topics.md** tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2)

- **Stack-exchange-query.md** the query interface used to extract the posts from stack exchnage explorer. 

- **RQ1/** - contains the data used to answer RQ1
 - **LDA_input/** - input data used for LDA analysis
    - `combined-so-quora-mallet-metadata.csv` - Stack overflow and Quora questions used to perform LDA analysis
    - `topic-input.mallet` - input file to the mallet tool
 - **LDA_output/**
    - **Mallet/** - contains the LDA output generated by MALLET tool
         - **output_csv/**
            - `docs-in-topics.csv` - documents per topic
            - `topic-words.csv` - most relevant topic words
            - `topics-in-docs.csv` - topic probability per document
            - `topics-metadata.csv` - metadata per document and topic probability
        - **output_html/** - Browsable results of mallet output
            - `all_topics.html`
            - `Docs/`
            - `Topics/`

- **RQ2/** - contains the data used to answer RQ2
  - **datasource_rawdata/** - contains the raw data for each source
    - `quora.csv` - contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.
    - `stackoverflow.csv` - contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.
  - **manual_analysis_output/**
    - `stackoverflow_quora_taxonomy.xlsx` - contains the classified dataset of stackoverflow and quora and description of taxonomy.
        - `Taxonomy` - contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by `|` symbol. 
        - `stackoverflow-posts` - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.
         - `quota-posts` - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.         
---

 

Files (47.6 MB)
Name Size
SCAM-developer-commenting-practices-discussions.zip
md5:878b495cc182bfd689c2e32092e2e33b
47.6 MB Download
122
26
views
downloads
All versions This version
Views 12291
Downloads 2620
Data volume 1.2 GB951.7 MB
Unique views 11188
Unique downloads 2620

Share

Cite as