A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source Projects

Yutong Zhao; Lu Xiao; Andre B. Bondi; Bihuan Chen; Yang Liu

doi:10.5281/zenodo.6383167

Published March 24, 2022 | Version v1

Journal article Open

A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source Projects

1. Stevens Institute of Technology
2. Fudan University
3. Nanyang Technological University

1. The spreadsheet "Perf Issue Empirical Data Package.xlsx" contains the details of data extraction and annotation of the performance issues.

The three tabs in the above spreadsheet, i.e., “Java Projects Issues”, “Python Projects Issues”, and “C++ Projects Issues”, contain the performance issues in each programming language.

In the following, we elaborate the data extraction and analysis process for each column in the above spreadsheet:

D1 (Issue ID) shows the ID of the performance issue. It is directly extracted from the issue tracking system e.g., Apache JIRA.
D4 (Root Cause) shows our categorization of the root cause of the issue. This categorization is based on the following information:
- D2 (Root Cause Text) that shows the extracted text from the Apache JIRA issue report that describes the root cause of the performance issue;
- D3 (Open Coding) that shows the result of manual open coding process for summarizing the root cause;
- D7 (PC Summary) that shows the summary of production code revision based on manual inspection, to gain in-depth understanding of root causes.
D6 (Optimization Level) shows the optimization level, i.e., localized or design-level, of the code revision. The level of optimization is firstly distinguished by:
- D5 (# PC Files) that shows the number of revised production code files, which is extracted from the project version repository, Github, by searching the Issues ID in the commit message.

i.e., 1) localized optimization that revises a single production code file; 2) design-level optimization code that simultaneously revises a group of related production code files. Admittedly, simultaneously revising a group of files does not always imply a design-level optimization. For instance, developers may combine multiple change requests, e.g., fixing a functional bug, with the performance optimization. Thus, we manually verify and exclude the code revision where a group of source files are not revised simultaneously due to performance optimization.

D8 (Design-Level Pattern) shows the categorized design-level optimization patterns, such as Classic Design Pattern, Change Propagation, Optimization Clone, and Parallel Optimization. This categorization is based on:
- D7 (PC Summary) that shows the summary of production code revision based on manually reviewing to gain in-depth understanding of logics of production code revision.
D9 (# TC Files) shows the number of revised test code files. It is directly extracted from the project version repository, Github, by searching the Issues ID in the commit message. This data is used to determine if a performance issue involves test code change.
D10 (PC-TC Linkage) shows the relationship between the production code revision and the test code revision. It is identified based on 1) the revision logic in the test code and production code, and 2) the revised test case name and the production code name.
D12 (Co-Change Pattern) shows the categorized test-and-production code co-change patterns. This categorization is based on:
- D11 (TC Summary) shows the summarized nature of key changes to the test code.
D12 (Improvement Factor) shows the improvement factor after performance optimization. The improvement factor is calculated by comparing the based on the profiling data retrieved from the performance issue reports before and after code revision.
D13 (# Developers) shows the number of involved developers for resolving the performance issue.

This information is retrieved from the Apache JIRA issue tracking system. We implemented a program to download the issue report, including description and comments (discussions) by the developers, in format of .xml files. Then, we used an XML parsing program to extract the number of involved authors based on the tag <assignee> in the downloaded .xml files.

D14 (# Discussions) shows the number of discussions in the performance issue. Similar to D13 (# Developers), it is also used to measure the human effort for resolving a performance issue. This information is also retrieved from the Apache JIRA issue tracking system and extracted by the XML parsing program by collecting the number of comments submitted by the involved developers. This information is used to measure the human effort for resolving a performance issue.
D15 (Other Concerns) shows the other aspects of concerns while resolving the performance issue, such as maintainability, readability, etc. This information is manually extracted from the description and discussion in the issue reports retrieved from the Apache JIRA issue tracking system.

2. The tab named “Literature Review”, with the following information of each paper:

PD1 (Title) shows the title of the paper.
PD2 (Year) shows the year when the study is published.
PD3 (Tool) shows the name of the proposed tool.
PD4 (Root Cause) shows the root cause of involved performance issues.
PD5 (Usage) shows if the tool can automatically detect and/or fix performance issues.
PD6 (Language) shows the programming language of performance issues.
PD7 (Link) shows the web link for downloading or installing the tool if available.

3. The .zip file contains the Diff-DSM files.

Files

Diff-DSM Files.zip

Files (1.0 MB)

Name	Size	Download all
Diff-DSM Files.zip md5:fc6e824ff0eb61d15f362c8d51e8c042	885.2 kB	Preview Download
Perf Issue Empirical Data Package.xlsx md5:55d3f7ebf1245e4de18deae8eb2c4917	152.2 kB	Download

	All versions	This version
Views	110	109
Downloads	55	54
Data volume	25.1 MB	25.0 MB

A Large-Scale Empirical Study of Real-Life Performance Issues in Open Source Projects

Creators

Description

Files

Diff-DSM Files.zip

Files (1.0 MB)