Published August 31, 2024
| Version v2
Dataset
Open
Online Appendix of "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance"
Authors/Creators
Description
Abstract
This is the dataset of the paper entitled "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance", presented at SCAM 2024. It contains information related to the target commits and their attributes, as well as simulation results pertaining to the research questions in the paper.
The target projects and commits in the dataset are based on a prior study: Nugroho, et al.: "How different are different diff algorithms in Git?: Use --histogram for code changes", Empirical Software Engineering, 2020, https://doi.org/10.1007/s10664-019-09772-z
Survey Overview
1. Filtration
We collected attributes related to the changes for filtering and RQ purposes.
- Number of lines
- Number of changed lines
- Similarity distance
- Number of mismatch diff area
2. RQ1
We investigate the minimum number of feedback actions needed to correct the initial diffs to the target diffs. Regarding the simulation, two types of heuristic functions (non-admissible and admissible functions) have been used to reduce costs. When the search with the non-admissible heuristic function of the initial state does not match the ideal optimal result, i.e., when there is room for improvement in the number of feedback actions, we applied another A* search with the admissible heuristic function.
We obtained the following results through search:
- Number of feedback actions (A* search with non-admissible heuristic)
- Number of feedback actions (A* search with admissible heuristic)
3. RQ2
We investigated the various effects that feedbacks have on the diffs by examining the diffs at depth 1 of the search tree. The dataset records the maximum, minimum, median, mean, and standard deviation for each search problem.
The study yielded the following results:
- Similarity distance
- Number of mismatch diff area
Dataset Columns
The following are the contents represented by the columns in the CSV file and their descriptions.
| Column name | Description |
| project_name | Name of the project associated with the data. |
| filename | Name of the file being analyzed. |
| filepath | Path to the file within the project. |
| commit_id | Commit hash representing the new version of the file. |
| parent_commit | Commit hash representing the old version of the file. |
| error_commit | An error occurred when retrieving the commit from the repository. |
| error_setup | Any error when generating the new and old versions of the file. |
| error_analyze | An error when collecting information for filtering. |
| new_loc | Lines of the new version of the source code. |
| old_loc | Lines of the old version of the source code. |
| histogram_len | Path length of the diff when using the Histogram algorithm. |
| myers_len | Path length of the diff when using the Myers algorithm. |
| histogram-myers#edge | Number of difference edges between Histogram and Myers diff. |
| histogram-dp#edge | Number of difference edges between Histogram and initial diff. |
| myers-dp#edge | Number of difference edges between Myers and initial diff. |
| histogram-myers#area | Number of mismatch diff areas between Histogram and Myers diff. |
| histogram-dp#area | Number of mismatch diff areas between Histogram and initial diff. |
| myers-dp#area | Number of mismatch diff areas between Myers and initial diff. |
| dp#candidate | Number of feedback candidates of initial diff (similarity distance). |
| #insert | Number of lines added in the change. |
| #delete | Number of lines deleted in the change. |
| #change | #insert + #delete. |
| error_Asearch | An error occurring during A* search with a non-admissible heuristic. |
| #feedback_A | Number of feedback actions for A* search with a non-admissible heuristic. |
| time_A | Time taken for A* search with a non-admissible heuristic (ms). |
| RQ1_error_iteration | An error when iterations exceed the limit (10,000,000) during A* search with an admissible heuristic. |
| RQ1_error_timeout | An error when the search time exceeds the limit (1,800 seconds). |
| RQ1_error_other | Other errors encountered during A* search with an admissible heuristic. |
| RQ1#feedback | Number of feedback actions for A* search with an admissible heuristic. |
| RQ1#iter | Number of iterations for A* search with an admissible heuristic. |
| RQ1_time | Time taken for A* search with an admissible heuristic. |
| RQ2_error_exceed | An error due to exceeding time or iteration limits in RQ2. |
| RQ2_error_other | Other errors encountered during RQ2. |
| RQ2#children | Number of children nodes of the initial state (= similarity distance). |
| RQ2#candidate_min | Minimum similarity distance among the generated diffs. |
| RQ2#candidate_max | Maximum similarity distance among the generated diffs. |
| RQ2#candidate_ave | Average of similarity distance among the generated diffs. |
| RQ2#candidate_median | Median of similarity distance among the generated diffs. |
| RQ2#candidate_sd | Standard deviation of similarity distance among the generated diffs. |
| RQ2#area_min | Minimum number of mismatch diff areas among the generated diffs. |
| RQ2#area_max | Maximum number of mismatch diff areas among the generated diffs. |
| RQ2#area_ave | Average number of mismatch diff areas among the generated diffs. |
| RQ2#area_median | Median number of mismatch diff areas among the generated diffs. |
| RQ2#area_sd | Standard deviation of number of mismatch diff areas among the generated diffs. |
| is_used | Indicates whether this data is used in the results of RQ1 and RQ2. |
Files
scam2024_diff_dataset.csv
Files
(2.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:bc91e02019c7576055eaebc0585b8d1e
|
2.5 MB | Preview Download |