The Sampling Problem when Mining Inter-Library Usage Patterns

Anonymous

doi:10.5281/zenodo.11526062

Published June 7, 2024 | Version v1

Dataset Open

The Sampling Problem when Mining Inter-Library Usage Patterns

Anonymous

Tool support in software engineering often depends on relationships, regularities, patterns, or rules, mined from sampled code. Examples are approaches to bug prediction, code recommendation, and code autocompletion. Samples are relevant to scale the analysis of data. Many such samples consist of software projects taken from GitHub; however, the specifics of sampling might influence the generalization of the patterns.

In this paper, we focus on how to sample software projects that are clients of libraries and frameworks, when mining for interlibrary usage patterns. We notice that when limiting the sample to a very specific library, inter-library patterns in the form of implications from one library to another may not generalize well. Using a simulation and a real case study, we analyze different sampling methods. Most importantly, our simulation shows that only when sampling for the disjunction of both libraries involved in the implication, the implication generalizes well. Second, we show that real empirical data sampled from GitHub does not behave as we would expect it from our simulation. This identifies a potential problem with the usage of such API for studying inter-library usage patterns.

Files

Archive.zip

Files (6.9 MB)

Name	Size	Download all
Archive.zip md5:6f4313def8ced0a8997e4f9ce0f1eef9	6.9 MB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	50	22
Downloads	11	4
Data volume	76.6 MB	27.8 MB

The Sampling Problem when Mining Inter-Library Usage Patterns

Creators

Description

Files

Archive.zip

Files (6.9 MB)