There is a newer version of the record available.

Published January 26, 2021 | Version v1.0.0
Dataset Open

Replication package for the paper "What do Developers Discuss about Code Comment Conventions"

Creators

  • 1. Anonymous

Description

# RP-commenting-conventions-multiple-sources
Replication Package for the paper "What do Developers Discuss about Code Comment Conventions?"

## Structure
```
Appendix.pdf

RQ1/
    LDA_input/
        stackoverfow_raw_dataset.csv

    LDA_output/
    	Mallet/
    		output_csv/
    			docs-in-topics.csv
    			topic-words.csv
    			topics-in-docs.csv
    			topics-metadata.csv
    		output_html/
    			all_topics.html
    			Docs/
    			Topics/

RQ2/
   datasource_rawdata/
   		mailing_lists_selection_criteria.csv
    	quora.csv
    	stackoverflow.csv
    manual_analysis_output/
    	stackoverflow_quora_taxonomy.xlsx
```

## Contents of the Replication Package
---
- **Appendix.pdf**- Appendix of the paper containing supplement tables

- **RQ1/** - contains the data used to answer RQ1
 - **LDA_input/** - input data used for LDA analysis
    - `stackoverfow_raw_dataset.csv` - stackoverflow questions used to perform LDA analysis
 - **LDA_output/**
    - **Mallet/** - contains the LDA output generated by MALLET tool
         - **output_csv/**
            - `docs-in-topics.csv` - documents per topic
            - `topic-words.csv` - most relevant topic words
            - `topics-in-docs.csv` - topic probability per document
            - `topics-metadata.csv` - metadata per document and topic probability
        - **output_html/** - Browsable results of mallet output
            - `all_topics.html`
            - `Docs/`
            - `Topics/`

- **RQ2/** - contains the data used to answer RQ2
  - **datasource_rawdata/** - contains the raw data for each source
    - `mailing_lists_selection_criteria.csv` - criteria used to select mailing_lists.
    - `quora.csv` - contains the processed dataset (like removing HTML tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using [Makar](https://github.com/maethub/makar) tool.
    - `stackoverflow.csv` - contains the processed Stack Overflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using [Makar](https://github.com/maethub/makar) tool.
  - **manual_analysis_output/**
    - `stackoverflow_quora_taxonomy.xlsx` - contains the classified dataset of Stack Overflow and quora and a description of taxonomy.
        - `Taxonomy` - contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by `|` symbol. 
        - `stackoverflow-posts` - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.
         - `quota-posts` - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.         
---

 

Files

ICPC-developer-comment-convention-discussions.zip

Files (36.0 MB)