Published February 26, 2021 | Version 1
Thesis Open

Large-scale Java GitHub analysis of test case presence and testing frameworks usage

  • 1. Technical university of Košice

Contributors

Supervisor:

  • 1. Technical university of Košice

Description

Dataset of large-scale GitHub analysis based on GHTorrent list of repositories from May 2019. The dataset contain 6.6M repositories, from which the analysis was successfully executed on 6.3M. Each repository was downloaded, scanned by the proposed script, which collected:

  • number of "test" occurences vs. executable test cases in each file of project containing "test" in the file content
  • number of imports in such files of 26 analyzed testing frameworks

The dataset was obtained between 2021-01-22 and 2021-02-24. Dataset is a mysql dump of 1 table, containing the following columns:

  • id - internal table ID
  • full_name - full name of GitHub repository
  • java_kt_processed_files - how many files analyzed in total
  • searched_test_words_in_java_kt - sum of "test" occurrences in the whole project
  • real_tests_java_kt - number of identified test cases
  • real_tests_java_kt_execution_time - time of test case identification in ms
  • loc - line of files processed during test case identification
  • cloc_output - full json output of cloc
  • frameworks_occurrence - json format of particular framework imports occurrences, e.g. {"junit4": 2}, meaning 2 imports found in the whole project. Search was provided only in files containing "test".
  • ratios - ratios in json format of "test" occurrence and number of test cases for each file, e.g. ["2": 1], meaning 2x "test" occurrences, 1x test case
  • prediction_from_annotations - number of test cases detected by annotation
  • prediction_from_starts_with_test - number of test cases detected by "test" at the beginning of test case method 
  • prediction_from_public_methods_in_root - number of test cases detected by count of public methods
  • prediction_from_apache_cactus - number of test cases detected by Cactus framework by Apache
  • unable_to_download - unavailable, changed visibility, deleted, etc.
  • created_at - date of data fetch

Notes

This work was supported by project VEGA No. 1/0762/19: Interactive pattern- driven language development.

Files

Files (1.0 GB)

Name Size Download all
md5:e205743b005b159435dd5e6b9ce5ba7b
1.0 GB Download