Published August 31, 2024 | Version v2
Dataset Open

Online Appendix of "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance"

Description

Abstract
This is the dataset of the paper entitled "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance", presented at SCAM 2024.  It contains information related to the target commits and their attributes, as well as simulation results pertaining to the research questions in the paper.
 
The target projects and commits in the dataset are based on a prior study: Nugroho, et al.: "How different are different diff algorithms in Git?: Use --histogram for code changes", Empirical Software Engineering, 2020, https://doi.org/10.1007/s10664-019-09772-z
 
Survey Overview
1. Filtration
We collected attributes related to the changes for filtering and RQ purposes.
  • Number of lines
  • Number of changed lines
  • Similarity distance
  • Number of mismatch diff area
2. RQ1
We investigate the minimum number of feedback actions needed to correct the initial diffs to the target diffs. Regarding the simulation, two types of heuristic functions (non-admissible and admissible functions) have been used to reduce costs. When the search with the non-admissible heuristic function of the initial state does not match the ideal optimal result, i.e., when there is room for improvement in the number of feedback actions, we applied another A* search with the admissible heuristic function.
 
We obtained the following results through search:
  • Number of feedback actions (A* search with non-admissible heuristic)
  • Number of feedback actions (A* search with admissible heuristic)
 
3. RQ2
We investigated the various effects that feedbacks have on the diffs by examining the diffs at depth 1 of the search tree. The dataset records the maximum, minimum, median, mean, and standard deviation for each search problem.

The study yielded the following results:
  • Similarity distance
  • Number of mismatch diff area
 
Dataset Columns
The following are the contents represented by the columns in the CSV file and their descriptions.
 
Column name Description
project_name Name of the project associated with the data.
filename Name of the file being analyzed.
filepath Path to the file within the project.
commit_id Commit hash representing the new version of the file.
parent_commit Commit hash representing the old version of the file.
error_commit An error occurred when retrieving the commit from the repository.
error_setup Any error when generating the new and old versions of the file.
error_analyze An error when collecting information for filtering.
new_loc Lines of the new version of the source code.
old_loc Lines of the old version of the source code.
histogram_len Path length of the diff when using the Histogram algorithm.
myers_len Path length of the diff when using the Myers algorithm.
histogram-myers#edge Number of difference edges between Histogram and Myers diff.
histogram-dp#edge Number of difference edges between Histogram and initial diff.
myers-dp#edge Number of difference edges between Myers and initial diff.
histogram-myers#area Number of mismatch diff areas between Histogram and Myers diff.
histogram-dp#area Number of mismatch diff areas between Histogram and initial diff.
myers-dp#area Number of mismatch diff areas between Myers and initial diff.
dp#candidate Number of feedback candidates of initial diff (similarity distance).
#insert Number of lines added in the change.
#delete Number of lines deleted in the change.
#change #insert + #delete.
error_Asearch An error occurring during A* search with a non-admissible heuristic.
#feedback_A Number of feedback actions for A* search with a non-admissible heuristic.
time_A Time taken for A* search with a non-admissible heuristic (ms).
RQ1_error_iteration An error when iterations exceed the limit (10,000,000) during A* search with an admissible heuristic.
RQ1_error_timeout An error when the search time exceeds the limit (1,800 seconds).
RQ1_error_other Other errors encountered during A* search with an admissible heuristic.
RQ1#feedback Number of feedback actions for A* search with an admissible heuristic.
RQ1#iter Number of iterations for A* search with an admissible heuristic.
RQ1_time Time taken for A* search with an admissible heuristic.
RQ2_error_exceed An error due to exceeding time or iteration limits in RQ2.
RQ2_error_other Other errors encountered during RQ2.
RQ2#children Number of children nodes of the initial state (= similarity distance).
RQ2#candidate_min Minimum similarity distance among the generated diffs.
RQ2#candidate_max Maximum similarity distance among the generated diffs.
RQ2#candidate_ave Average of similarity distance among the generated diffs.
RQ2#candidate_median Median of similarity distance among the generated diffs.
RQ2#candidate_sd Standard deviation of similarity distance among the generated diffs.
RQ2#area_min Minimum number of mismatch diff areas among the generated diffs.
RQ2#area_max Maximum number of mismatch diff areas among the generated diffs.
RQ2#area_ave Average number of mismatch diff areas among the generated diffs.
RQ2#area_median Median number of mismatch diff areas among the generated diffs.
RQ2#area_sd Standard deviation of number of mismatch diff areas among the generated diffs.
is_used Indicates whether this data is used in the results of RQ1 and RQ2.

Files

scam2024_diff_dataset.csv

Files (2.5 MB)

Name Size Download all
md5:bc91e02019c7576055eaebc0585b8d1e
2.5 MB Preview Download