# About

PySStuBs is dataset comprised of 73,013 single-statement bugs (also known as simple stupid bugs, or SStuBs) matching 23 different patterns collected from 1,000 popular open-source python projects on GitHub (as measured by their number of stars in January 2021).

We mined these SStuBs from 1,844,369 bug-fixing commits collected from the 1,000 projects using [World of Code (WoC)](https://github.com/woc-hack/tutorial)[[1]](#1). We automatically identified single-statement changes on 148,450 file pairs by comparing diverging nodes in their Abstract Syntax Trees (ASTs) and manually classified the types of different nodes into SStuB patterns. We used the patterns previously defined by Karampatsis and Sutton [[2]](#2) for Java projects, and characterized 7 new patterns, some of them unique to Python. Finally, we removed trivial refactoring changes, such as function renamings and changes to string values. This dataset is thus comprised of the 73,013 (58% of the 126,912 non-refactoring single-statement changes we identified) that fit one of the 23 SStuB patterns.

# SStuB patterns

Following are the descriptions and statistics of the SStuB patterns present in the dataset.

## New SStuBs

- **Change Attribute Used** - When developers change the attribute accessed from an object. For example, `person.name` changes to `person.age`.
    
- **Add Function Around Expression** - When developers put an expression inside a function call, often for modifying the returned value. For example, `human = person` changes to `human = is_human(person)`.
    
- **Add Elements to Iterable** - When developers add an element to a hard-coded iterable, such as a `list` or a `tuple`. For example, `info = (name, age)` changes to `info = (name, age, height)`.
    
- **Change Keyword Argument Used** - When developers change the keyword argument used in a function call or object instantiation. For example, `Person(name=20)` changes to `Person(age=20)`.
    
- **Add Method Call** - When developers add a method call to an expression which references an object, changing the return value. For example, `year = person` changes to `year = person.birth_year()`.
    
- **Change Constant Type** - When developers change the type of a hard-coded constant. For example, `person.age = '10'` changes to `person.age = 10`.
    
- **Add Attribute Access** - When developers access the attribute of an object instead of the object itself. For example, `say_hello_to(person)` changes to `say_hello_to(person.name)`.

## Java SStuBs

As described by Karampatsis and Sutton [[2]](#2):

- **Change Identifier Used** - Checks whether an identifier appearing in
some expression in the statement was replaced with another one. It is easy for developers to by accident utilize a different identifier than the intended one that has the same type. Copy pasting code is a potential source of such errors. Similarly named identifiers may further contribute to the occurrence of such errors.

- **Change Numeric Literal** - Checks whether a numeric literal was replaced with another one. It is easy for developers to mix two numeric values in their program.

- **Change Boolean Literal** - Checks whether a Boolean literal was replaced. True is replaced with False and vice-versa. In many cases developers use the opposite Boolean value than the intended one.


- **Wrong Function Name** - Checks if a function with the same parameter list but the wrong name was called. This is a usual pitfall.

- **Same Function More Args** - Checks whether an overloaded version of the function with more arguments was called. Functions with multiple overload can often confuse developers.

- **Same Function Less Args** - Checks whether an overloaded version of the function with less arguments was called. For instance, a developer can forget to specify one of the arguments and not realize it if the code still compiles due to function overloading.

- **Same Function Change Caller** - Checks whether in a function call expression the caller object for it was replaced with another one. When there are multiple variables with the same type a developer can accidentally perform an operation. Copy pasting code or mixing similar variables are common cases of such errors.

- **Same Function Swap Args** - Checks whether a function was called with two of its arguments swappe. When multiple function arguments are of the same type, developers can easily swap two of them without realizing.

- **Change Binary Operator** - Checks whether a binary operand was accidentally replaced with another one of the same type. For example, developers very often mix comparison operators in expressions.

- **Change Unary Operator** - Checks whether a unary operand was accidentally replaced with another one of the same type (e.g., developers often forget the ! operator in a boolean expression).

- **Change Operand** - Checks whether one of the operands in a binary
operation was wrong.

- **More Specific If** - Checks whether an extra condition (&& operand)
was added in an if statement’s condition.

- **Less Specific If** - Checks whether an extra condition which either this or the original one needs to hold was added in an if statement’s condition.

## SStuB counts

Below is the distribution of SStuB types in the dataset.

| Pattern name                            | Python     | %    |
|-----------------------------------------|------------|------|
| Same Function More Args                 | 9,958      | 14   |
| Wrong Function/Method Name              | 9,091      | 12   |
| Change Identifier Used                  | 8,973      | 12   |
| \textbf{Add Function Around Expression} | 6,363      | 9    |
| \textbf{Change Attribute Used}          | 5,229      | 7    |
| Change Numeric Literal                  | 4,775      | 7    |
| Change Operand                          | 4,657      | 6    |
| Same Function Less Args                 | 3,381      | 5    |
| \textbf{Add Method Call}                | 3,338      | 5    |
| \textbf{Add Elements to Iterable}       | 2,541      | 3    |
| More Specific If                        | 2,443      | 3    |
| \textbf{Change Constant Type}           | 2,199      | 3    |
| Change Unary Operator                   | 2,187      | 3    |
| \textbf{Change Keyword Argument Used}   | 1,554      | 2    |
| Change Boolean Literal                  | 1,466      | 2    |
| \textbf{Add Attribute Access}           | 1,439      | 2    |
| Same Function Wrong Caller              | 1,163      | 2    |
| Change Binary Operator                  | 976        | 1    |
| Less Specific If                        | 943        | 1    |
| Same Function Swap Args                 | 336        | $>$1 |
| Total                                   | 73,013     | 100  |

# Use

If you use this dataset in a research publication, please cite:

```
@inproceedings{athur2021pysstubs,
  author = {Arthur V. Kamienski and Luisa Palechor and Cor-Paul Bezemer and Abram Hindle},
  title = {PySStuBs: Characterizing Single-Statement Bugs in Popular Open-Source Python Projects},
  booktitle = {MSR Mining Challenge},
  year = {2021},
  pages = {1--5}
}
```

# Data

| Column              	| Type      | Description                                                                                                                |
|-----------------------|-----------|----------------------------------------------------------------------------------------------------------------------------|
| `project_name`       	| `String`  | The name of the project from where the SStuB was collected on GitHub, given as *{author}/{repository}*.                    |
| `commit`              | `String`  | The SHA-1 hash identifying the commit that originated the SStuB.                                                           |
| `file_before_woc_sha` | `String`  | The SHA-1 hash in World of Code (WoC) identifying the file before the change that originated the SStuB.                    |
| `file_after_woc_sha`  | `String`  | The SHA-1 hash in World of Code (WoC) identifying the file after the change that originated the SStuB.                     |
| `line_changed`        | `Integer` | The number of the line where the change was made in the file.                                                              |
| `line_before`         | `String`  | The content of the line of code before the change that originated the SStuB.                                               |
| `line_after`          | `String`  | The content of the line of code after the change that originated the SStuB.                                                |
| `is_java_sstub`       | `Boolean` | Boolean indicating if the SStuB fits a pattern previously identified in Java projects by Karampatsis and Sutton [[2]](#2). |
| `sstub_pattern`       | `String`  | The name/type of pattern that describes the SStuB.                                                                         |

# References

<a id="1">[1]</a>
```
@inproceedings{ma2019world,
  title={World of code: an infrastructure for mining the universe of open source VCS data},
  author={Ma, Yuxing and Bogart, Chris and Amreen, Sadika and Zaretzki, Russell and Mockus, Audris},
  booktitle={2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)},
  pages={143--154},
  year={2019},
  organization={IEEE}
}
```

<a id="2">[2]</a>
```
@inproceedings{karampatsis2020often,
  title={How often do single-statement bugs occur? The manysstubs4j dataset},
  author={Karampatsis, Rafael-Michael and Sutton, Charles},
  booktitle={Proceedings of the 17th International Conference on Mining Software Repositories},
  pages={573--577},
  year={2020}
}
```

