Online Appendix of "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance"

Yagi, Tsukasa; Hayashi, Shinpei

doi:10.5281/zenodo.13622697

Published August 31, 2024 | Version v2

Dataset Open

Online Appendix of "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance"

Abstract

This is the dataset of the paper entitled "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance", presented at SCAM 2024. It contains information related to the target commits and their attributes, as well as simulation results pertaining to the research questions in the paper.

The target projects and commits in the dataset are based on a prior study: Nugroho, et al.: "How different are different diff algorithms in Git?: Use --histogram for code changes", Empirical Software Engineering, 2020, https://doi.org/10.1007/s10664-019-09772-z

Survey Overview

1. Filtration

We collected attributes related to the changes for filtering and RQ purposes.

Number of lines
Number of changed lines
Similarity distance
Number of mismatch diff area

2. RQ1

We investigate the minimum number of feedback actions needed to correct the initial diffs to the target diffs. Regarding the simulation, two types of heuristic functions (non-admissible and admissible functions) have been used to reduce costs. When the search with the non-admissible heuristic function of the initial state does not match the ideal optimal result, i.e., when there is room for improvement in the number of feedback actions, we applied another A* search with the admissible heuristic function.

We obtained the following results through search:

Number of feedback actions (A* search with non-admissible heuristic)
Number of feedback actions (A* search with admissible heuristic)

3. RQ2

We investigated the various effects that feedbacks have on the diffs by examining the diffs at depth 1 of the search tree. The dataset records the maximum, minimum, median, mean, and standard deviation for each search problem.

The study yielded the following results:

Similarity distance
Number of mismatch diff area

Dataset Columns

The following are the contents represented by the columns in the CSV file and their descriptions.

Column name	Description
project_name	Name of the project associated with the data.
filename	Name of the file being analyzed.
filepath	Path to the file within the project.
commit_id	Commit hash representing the new version of the file.
parent_commit	Commit hash representing the old version of the file.
error_commit	An error occurred when retrieving the commit from the repository.
error_setup	Any error when generating the new and old versions of the file.
error_analyze	An error when collecting information for filtering.
new_loc	Lines of the new version of the source code.
old_loc	Lines of the old version of the source code.
histogram_len	Path length of the diff when using the Histogram algorithm.
myers_len	Path length of the diff when using the Myers algorithm.
histogram-myers#edge	Number of difference edges between Histogram and Myers diff.
histogram-dp#edge	Number of difference edges between Histogram and initial diff.
myers-dp#edge	Number of difference edges between Myers and initial diff.
histogram-myers#area	Number of mismatch diff areas between Histogram and Myers diff.
histogram-dp#area	Number of mismatch diff areas between Histogram and initial diff.
myers-dp#area	Number of mismatch diff areas between Myers and initial diff.
dp#candidate	Number of feedback candidates of initial diff (similarity distance).
#insert	Number of lines added in the change.
#delete	Number of lines deleted in the change.
#change	#insert + #delete.
error_Asearch	An error occurring during A* search with a non-admissible heuristic.
#feedback_A	Number of feedback actions for A* search with a non-admissible heuristic.
time_A	Time taken for A* search with a non-admissible heuristic (ms).
RQ1_error_iteration	An error when iterations exceed the limit (10,000,000) during A* search with an admissible heuristic.
RQ1_error_timeout	An error when the search time exceeds the limit (1,800 seconds).
RQ1_error_other	Other errors encountered during A* search with an admissible heuristic.
RQ1#feedback	Number of feedback actions for A* search with an admissible heuristic.
RQ1#iter	Number of iterations for A* search with an admissible heuristic.
RQ1_time	Time taken for A* search with an admissible heuristic.
RQ2_error_exceed	An error due to exceeding time or iteration limits in RQ2.
RQ2_error_other	Other errors encountered during RQ2.
RQ2#children	Number of children nodes of the initial state (= similarity distance).
RQ2#candidate_min	Minimum similarity distance among the generated diffs.
RQ2#candidate_max	Maximum similarity distance among the generated diffs.
RQ2#candidate_ave	Average of similarity distance among the generated diffs.
RQ2#candidate_median	Median of similarity distance among the generated diffs.
RQ2#candidate_sd	Standard deviation of similarity distance among the generated diffs.
RQ2#area_min	Minimum number of mismatch diff areas among the generated diffs.
RQ2#area_max	Maximum number of mismatch diff areas among the generated diffs.
RQ2#area_ave	Average number of mismatch diff areas among the generated diffs.
RQ2#area_median	Median number of mismatch diff areas among the generated diffs.
RQ2#area_sd	Standard deviation of number of mismatch diff areas among the generated diffs.
is_used	Indicates whether this data is used in the results of RQ1 and RQ2.

Files

scam2024_diff_dataset.csv

Files (2.5 MB)

Name	Size	Download all
scam2024_diff_dataset.csv md5:bc91e02019c7576055eaebc0585b8d1e	2.5 MB	Preview Download

	All versions	This version
Views	95	81
Downloads	43	27
Data volume	149.8 MB	96.0 MB

Online Appendix of "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance"

Authors/Creators

Description

Files

scam2024_diff_dataset.csv

Files (2.5 MB)