# FSE'23 Artifact

> This is the artifact for the FSE'23 paper: An Exploratory Empirical Study of Trust \& Safety Engineering in Open-Source Social Media Platforms

# About

This repository contains all of the artifacts for our FSE 2023 conference submission.

Our artifact includes the following (ordered by sections):

| Item | Description | Corresponding content in the paper |
|------|-------------|------------------------------------|
| Mining Tool | How to use and setup the tool used to collect issues. | - |
| Reproduce Results - Issue Collection | Steps to collect initial set of Mastodon and Diaspora issues. | $4.2 |
| Reproduce Results - Keyword Development | Steps to reproduce the keyword development process. | $4.2.1, $4.2.2 |
| Reproduce Results - Issue Filtering | Steps to collect a filtered set of issues and sample them. | $4.2.3 |
| Reproduce Results - Issue Analysis | Steps to analyzed the filtered set of issues. | $4.3.1, $4.3.2 |
| Reproduce Results - Interrater Agreement | Discusses data used during the agreement process. | $4.3.3 |
| Reproduce Results - Results | Datasets that were used in final results. | $5 |

# Mining Tool

The tool used to collect and filter data is located in the `mining_tool/` folder.

## Tool Setup

To setup the tool, it is recommended to use Python 3.9+ and install the requirements in `requirements.txt`

A few additional commands need to be run to set up dependencies:
```
python -m spacy download en_core_web_sm
python
import nltk
>>> nltk.download('wordnet')
>>> nltk.download('stopwords')
>>> nltk.download('omw-1.4')
>>> nltk.download('punkt')
```

# Reproduce Results

It is possible to reproduce the results reported in our paper.

## Issue Collection
We provide commands below to extract data for Diaspora.
To collect the data for Mastodon, replace `diaspora` with `mastodon` in the commands.

To get the issue lists and comments, run:
```
python -m mining_tool --github diaspora/diaspora --exp data/diaspora/issues_all.csv --comments-exp data/diaspora/issues/issue
python -m mining_tool --github mastodon/mastodon --exp data/mastodon/issues_all.csv --comments-exp data/mastodon/issues/issue
```

## Keyword Development

The initial keywords from the T&S journal are in `data/filtering/tsjournal/`.
The initial list of 12 keywords in is `data/filtering/tsjournal/keywords_ts_regex`.

Data used to tailor keywords to each repository is in `data/REPO/ts_test/`.
Each subfolder is the round of the keyword tailoring process.
Round 1 used the keywords from the T&S list of 12 words (located in `data/filtering/tsjournal/keywords_ts_regex`).

The sample batch is located in the round's folder along with calucations of the recall rate after each round.

The command used to sample between rounds is of the form:
```
python -m mining_tool --imp data/diaspora/issues_all.csv \\
    --sample-amount 100 \\
    --exp data/diaspora/ts_test/roundX/issues_sample.csv
```

The command used to detect which issues matched the keyword list is:
```
python -m mining_tool --imp data/diaspora/ts_test/roundX/issues_sample.csv \\
    --comments-imp data/diaspora/issues/issue \\
    --comments-wl data/diaspora/ts_test/roundX/keywords.txt \\\
    --comments-bl data/filtering/all.txt \\
    --exp data/diaspora/ts_test/roundX/issues_ts.csv
```

The last round's folder was copied to `data/REPO/keywords.txt` and used for subsequent the analysis.

## Issue Filtering
We provide commands below to filter data for Diaspora.
To filter the data for Mastodon, replace `diaspora` with `mastodon` in the commands.

To extract T&S issues, run:
```
python -m mining_tool --imp data/diaspora/issues_all.csv \\
    --comments-imp data/diaspora/issues/issue \\
    --comments-wl data/diaspora/keywords.txt \\
    --comments-bl data/filtering/all.txt \\
    --exp data/diaspora/issues_ts.csv
```

To filter the comments for only T&S issues, run:
```
python -m mining_tool --imp data/diaspora/issues_ts.csv \\
    --comments-imp data/diaspora/issues/issue \\
    --comments-exp data/diaspora/issues_ts/issue
```

To split issue comments into sentences, run:
```
python -m mining_tool --comments-imp data/diaspora/issues_ts/issue \\
    --comments-utterances-exp data/diaspora/issues_ts_utterances/issue
```

To randomly order the issues and enforce minimum comments, run:
```
python -m mining_tool --imp data/diaspora/issues_ts.csv \\
    --min-comments 5 \\
    --randomize \\
    --exp data/diaspora/issues_ts_min5_rnd.csv
```

When processing issues, we started from the top of `data/diaspora/issues_ts_min5_rnd.csv` copied issue comment files from `data/diaspora/issues_ts_utterances/` to `data/diaspora/issues_ts_utterances_analyzed/`

## Issue Analysis

We collected all issue IDs for both projects into `data/analysis/queries/issues.xlsx`.
The codebook that was developed is in `data/analysis/codebook.xlsx`.

First, issue-level codes were assigned to each issue in `data/analysis/queries/issues.xlsx` using the `Issue Codes` sheet.

Next, issue discussions were coded using the `Discussion Model` sheet of the codebook and stored in `data/diaspora/issues_ts_utterances_analyzed/`.

After issues were modeled, we performed queries on the data for `risks`, `options`, and `rationales`.

Commands below are shown for Diaspora but the corresponding Mastodon commands can be derived by swaping the repo names.

Risks statements were collected across all issue comments:
```
python -m mining_tool --imp data/diaspora/issues_ts_min5_rnd_diaspora.xls \\
    --comments-imp data/diaspora/issues_ts_utterances_analyzed/issue \\
    --comments-query lambda x: x['RID'] \\
    --comments-query-exp data/analysis/queries/rtv.xls
```
We used the `risk memos` sheet of the codebook to memo and code these statements.

Options statements were collected across all issue comments as well:
```
python -m mining_tool --imp data/diaspora/issues_ts_min5_rnd_diaspora.xls \\
    --comments-imp data/diaspora/issues_ts_utterances_analyzed/issue \\
    --comments-query lambda x: x['OPT'] \\
    --comments-query-exp data/analysis/queries/patterns.xls
```
We used the `pattern memos` sheet of the codebook to memo and code these statements.

Finally, rationales were collected across all issue comments:
```
python -m mining_tool --imp data/diaspora/issues_ts_min5_rnd_diaspora.xls \\
    --comments-imp data/diaspora/issues_ts_utterances_analyzed/issue \\
    --comments-query lambda x: x['RAT'] \\
    --comments-query-exp data/analysis/queries/rationales.xls
```
We used the `rationale memos (merged)` and `rationale memos (no action)` sheets of the codebook to memo and code these statements.

## Interrater Agreement
The data for the interrater agreement analysis is located in `data/interrater_agreement/`:
* The discussion model agreement data is in `data/interrater_agreement/discussion_model/`.
* The pattern agreement data is in `data/interrater_agreement/patterns/`.
* The rationale agreement data is in `data/interrater_agreement/rationales/`.
* The risk agreement data is in `data/interrater_agreement/rtv/`.

Each round of agreement is in subfolders and `eval.xls` in each folder contains kappa score calculations for each round.

## Results

For table 2, we reported the final keyword list size in `data/REPO/keywords.txt`, the final precision and recall rates in `data/REPO/ts_test/roundX/issues_sample.xlsx`, the final number of issues in `data/REPO/issues_ts_min5.csv`, and the percent of issues analyzed in `data/REPO/issues_ts_min5_rnd.xls`.

Figure 2 data was taken from `data/mastodon/issues_ts_analyzed/issue9791.xls`.

For RQ1:
* SMP feature frequencies, issue result counts, and issue closure times are on `Sheet2` in the `data/analysis/queries/issues.xls` file.
* Figure 3 is derived from `Sheet3` in the `data/analysis/queries/issues.xls` file.
* Table 3 is derived from `Sheet6` in the `data/analysis/queries/issues.xls` file.
* Data about feature presence over time is in `Sheet5`.

For RQ2: 
* Table 4 data is on `Sheet6` in the `data/analysis/queries/rtv.xls` file.
* Risk statements over time are on `Sheet5` in the `data/analysis/queries/rtv.xls` file.

For RQ3:
* Table 5 data is on `Sheet2` in the `data/analysis/queries/patterns.xls` file.
* Table 6 data is on `Sheet2` in the `data/analysis/queries/rationales.xls` file.
* Figure 5 is derived from `Sheet4` in the `data/analysis/queries/issues.xls` file.
* Issue status and age calculations are on `Sheet2` in the `data/analysis/queries/issues.xls` file.