Published August 17, 2020 | Version v1
Software Open

Artifacts for Hansie: Hybrid and Consensus Regression Test Prioritization

  • 1. Indian Institute of Technology Madras

Description

Traditionally, given a test-suite and the underlying system-under-test, existing test-case prioritization heuristics report a permutation of the original test-suite that is seemingly best according to their criteria. However, we observe that a single heuristic does not perform optimally in all possible scenarios, given the diverse nature of software and its changes. Hence, multiple individual heuristics exhibit effectiveness differently. Interestingly, together, the heuristics bear the potential of improving the overall regression test selection across scenarios. In this paper, we pose the test-case prioritization as a rank aggregation problem from social choice theory. Our solution approach, named Hansie, is two-flavored: one involving priority-aware hybridization, and the other involving priority-blind computation of a consensus ordering from individual prioritizations. To speed-up test-execution, Hansie executes the aggregated test-case orderings in a parallel multi-processed manner leveraging regular windows in the absence of ties, and irregular windows in the presence of ties. We show the benefit of test-execution after prioritization and introduce a cost-cognizant metric (EPL) for quantifying overall timeline latency due to load-imbalance arising from uniform or non-uniform parallelization windows. We evaluate Hansie on 20 open-source subjects totaling 287,530 lines of source code, 69,305 test-cases, and with parallelization support of up to 40 logical CPUs.

We evaluate Hansie on an Intel Xeon CPU E5-2640 v4 system having 20 physical cores (40 logical cores) clocked at 2.40GHz with 64GB RAM running CentOS Linux release 7.5.1804 (Core) 64-bit operating system. The subjects in our evaluation were compiled using clang 3.9.0 frontend of LLVM. Hansie was compiled using g++ 5.3.1 with support for OpenMP parallel programming. We also installed libraries for building 32-bit executables on 64-bit systems using -m32 flag, as two benchmarks {c4, xc} were 32-bit applications.

This is the replication package for experiments conducted in the manuscript.

===================================================================================================
 Minimum system requirements for Host:
  [1] Total logical count of CPUs: 8
  [2] Physical Memory (RAM): 8GB
  [3] Hard-disk space: 500GB
  [4] Oracle VM VirtualBox {https://www.virtualbox.org/wiki/Downloads}

 Minimum system requirements for VM:
  [1] Total logical count of CPUs: 4
  [2] Physical Memory (RAM): 5000MB
  [3] Hard-disk space: 60GB

 Recommended system requirements for Host:
  [1] Total logical count of CPUs: >= 40
  [2] Physical Memory (RAM): >= 64GB
  [3] Hard-disk space: >= 500GB
  [4] Oracle VM VirtualBox {https://www.virtualbox.org/wiki/Downloads}

 Recommended system requirements for VM:
  [1] Total logical count of CPUs: >= 36
  [2] Physical Memory (RAM): >= 8GB
  [3] Hard-disk space: >= 60GB
  [4] Oracle VM VirtualBox {https://www.virtualbox.org/wiki/Downloads}

 NOTE: While importing [Hansie_VM.ova], CPU-count and RAM can be set via the GUI mode provided by Oracle VM VirtualBox. Default setting is 4 CPUs  and 5000MB of RAM. The user is expected to set respective values to >= 36 and >= 8GB when using recommended settings.

===================================================================================================
 Directory structure:

  |hansie_2|: {root folder}.
   ->|benchmarks|: {Benchmarks (dataset #1 and #2) hierarchically organized for ease of interfacing with our C/C++ implementation}.
   ->|raw_data_scripts|:
     -> |generic| (initial JSS submission): {Generic source-codes, eval-scripts, and directories to replicate and store detailed results (current machine) in folders (hansie_2/consensus) and (hansie_2/results), and summarized results (current machine) in folder (hansie_2/compiled)}.
     -> |p100| (initial JSS submission): {Contains source-codes, raw-data, and logs from program execution on the machine denoted as "p100", used for our evaluation. Detailed results (basic-block level) from p100 are stored in (hansie_2/raw_data_scripts/p100/cons_gran_bb/consensus) and (hansie_2/raw_data_scripts/p100/results). Summarized results (p100) are stored in (hansie_2/raw_data_scripts/p100/compiled)}.
     -> |generic2| (JSS submission R1): {Generic source-codes, eval-scripts, and directories to replicate and store detailed results (current machine) in folders (hansie_2/consensus) and (hansie_2/results), and summarized results (current machine) in folder (hansie_2/compiled)}.
     -> |p100_2| (JSS submission R1): {Contains source-codes, raw-data, and logs from program execution on the machine denoted as "p100", used for our evaluation. Detailed results (basic-block level) from p100 are stored in (hansie_2/raw_data_scripts/p100_2/consensus) and (hansie_2/raw_data_scripts/p100_2/results). Summarized results (p100) are stored in (hansie_2/raw_data_scripts/p100_2/compiled)}.


 NOTE: 
  ~ (initial JSS submission) means raw_data_scripts as submitted during our initial JSS submission.
  ~ (JSS submission R1) means extended raw_data_scripts as submitted during our revised JSS submission R1. We have also included comparison with Adaptive Random (ART) prioritization (geomean taken after 30 runs). This is located under (p100_2/[benchmark]/ART). The directory (p100_2/[benchmark]/normal) contains p100's results when Hansie is operated without ART prioritization. Under (p100_2/.../ART) or (p100_2/.../normal), three sub-directories are present: (i) results, (ii) consensus, and (iii) compiled, each of which has the same structure as explained in the directory structure of our root folder (hansie_2). 
  ~ The directory (hansie_2/raw_data_scripts) now contains consolidated artifacts across both the JSS submissions.
  ~ The directory (hansie_2/raw_data_scripts/p100_2/faults_failures) contain detailed calculations about computing APFD in terms of faults versus failures.

 Usage instructions:

  ~ To test a subject, "tcas" using Hansie (all except ART prioritization), run [make -s hansie_tcas] from root folder (hansie_2). The Makefile is designed to provide the user with messages for locating replicated results. Evaluation of any other benchmark may be subsequently made by running [make -s hansie_{benchmark}] from (hansie_2). Please note that doing so would further populate the sub-directories: (compiled), (consensus), and (results). To evaluate all benchmarks (shown below) run [make -s hansie_all] from (hansie_2). In all cases: (i) (hansie_2/compiled) contains summarized results, and (ii) (hansie_2/consensus) and (hansie_2/results) contain detailed results, for the current machine at basic-block granularity.

  ~ To test a subject, "tcas", using state-of-the-art ART prioritization (30 repetitions) only, run [make -s hansie_tcas_art] from root folder (hansie_2). The Makefile is designed to provide the user with messages for locating replicated results. Evaluation of any other benchmark may be subsequently made by running [make -s hansie_{benchmark}_art] from (hansie_2). Please note that doing so would further populate the sub-directories: (compiled), (consensus), and (results). To evaluate all benchmarks (shown below) run [make -s hansie_all_art] from (hansie_2). In all cases: (i) (hansie_2/compiled) contains summarized results, and (ii) (hansie_2/consensus) and (hansie_2/results) contain detailed results, for the current machine at basic-block granularity.
   
  ~ After evaluation of at least one benchmark, (hansie_2/consensus) will consist of detailed consensus prioritization results, and 
    (hansie_2/results) will contain detailed results from individual prioritization from the consensus. However, (hansie_2/results) will contain results only for comparison purposes. Our intended results will reside in (hansie_2/consensus) and (hansie_2/compiled). 

  ~ Example hierarchy after evaluating [tcas] on p100 with {sequential, 10, 20, 30} threads.

compiled
    └── tcas                        // results for [tcas]
        ├── borda                   // results for borda-consensus
        │   ├── borda-consensus.res.ods
        │   ├── load1.prof.ods      // test execution cost at version [v1/tcas.c]
        │   ├── load2.prof.ods 
        │   ├── load3.prof.ods 
        │   ├── load4.prof.ods
        │   ├── load5.prof.ods
        │   ├── report_par_10.ods   // summarized report for consensus-prioritized test-execution with 10-threaded (uniform) parallel-windows 
        │   ├── report_par_20.ods
        │   ├── report_par_30.ods 
        │   ├── report_par_resp.ods // summarized report for consensus-prioritized test-execution with non-uniform parallel-windows each of size
         |    |                        // equal to number of ties for a rank
        │   ├── report_seq.ods      // summarized report for consensus-prioritized sequential test-execution
        │   ├── shuffles1.perm.ods
        │   ├── shuffles2.perm.ods  // test-ids for the resulting test-suite permutation after consensus prioritization for version [v2/tcas.c]
        │   ├── shuffles3.perm.ods
        │   ├── shuffles4.perm.ods
        │   ├── shuffles5.perm.ods
        │   ├── timeline1.fails.ods
        │   ├── timeline2.fails.ods
        │   ├── timeline3.fails.ods // test-outcomes denoted as '0' for pass and '1' for failure along the sequential-timeline of 
         |    |                        // consensus-prioritized test-execution for version [v3/tcas.c]  
        │   ├── timeline4.fails.ods
        │   └── timeline5.fails.ods
        ├── gm ---                  // results for geometric-mean-consensus (**contents suppressed**)
        ├── hm ---                  // results for harmonic-mean-consensus (**contents suppressed**)
        ├── ky ---                  // results for kemeny-young-consensus (**contents suppressed**)
        ├── median ---              // results for median-consensus (**contents suppressed**)
        ├── sm ---                  // results for arithmetic-mean-consensus (**contents suppressed**)
        │   
        ├── tcas_10_90.relcon.ods
        ├── tcas_20_80.relcon.ods
        ├── tcas_30_70.relcon.ods
        ├── tcas_40_60.relcon.ods   // summarized report for sequential test-execution due to hybrid-prioritization 
         |                            // {weighted sum of (40%--relevance-score, 60%--confinedness-score)}
        ├── tcas_50_50.relcon.ods   // these hybrids form a spectrum with {relevance} and {confinedness} at extreme ends. 
        ├── tcas_60_40.relcon.ods   // these are also treated as individual prioritizations taking part in the consensus
        ├── tcas_70_30.relcon.ods
        ├── tcas_80_20.relcon.ods
        ├── tcas_90_10.relcon.ods
        ├── tcas.con.ods            // summarized report for sequential test-execution due to individual prioritization by {confinedness}
        ├── tcas.cost.ods           // summarized report for sequential test-execution due to individual prioritization by {cost-only
         |                           // (cost determined by cachegrind)}
        ├── tcas.ga.ods             // summarized report for sequential test-execution due to individual prioritization by {greedy additional}
        └── tcas.rel.ods            // summarized report for sequential test-execution due to individual prioritization by {relevance}

     NOTE: For each report (except load.profs, timeline.fails, and shuffles.perm), all content below the line "summary (geomean-across-versions)" summarizes that entire report (self-contained) by geometric-mean data across versions.     

  ~ Possible values of {benchmark} in [make -s hansie_{benchmark}] and their corresponding URLs (for unprocessed versions) are as follows:
       
    [1] tcas (https://sir.csc.ncsu.edu/content/sir.php)
    [2] totinfo (https://sir.csc.ncsu.edu/content/sir.php)
    [3] schedule (https://sir.csc.ncsu.edu/content/sir.php)
    [4] schedule2 (https://sir.csc.ncsu.edu/content/sir.php)
    [5] printtokens (https://sir.csc.ncsu.edu/content/sir.php)
    [6] grep (https://sir.csc.ncsu.edu/content/sir.php)
    [7] flex (https://sir.csc.ncsu.edu/content/sir.php)
    [8] sed (https://sir.csc.ncsu.edu/content/sir.php)
    [9] gzip (https://sir.csc.ncsu.edu/content/sir.php)
   [10] printtokens2 (https://sir.csc.ncsu.edu/content/sir.php)
   [11] replace (https://sir.csc.ncsu.edu/content/sir.php)
   [12] space (https://sir.csc.ncsu.edu/content/sir.php)

   [13] c4 (https://github.com/rswier/c4)
   [14] xc (https://github.com/lotabout/write-a-C-interpreter)
   [15] mlisp (https://github.com/rui314/minilisp)
   [16] cf (https://github.com/begeekmyfriend/CuckooFilter)
   [17] gravity (https://github.com/marcobambini/gravity)
   [18] scd (https://github.com/cr-marcstevens/sha1collisiondetection)
   [19] slre (https://github.com/cesanta/slre)
   [20] xxhash (https://github.com/Cyan4973/xxHash)

~ To begin a fresh evaluation from scratch, run [make -s destroy_all] from (hansie_2).

Notes

This is the replication package for experiments conducted in the manuscript.

Files

Files (17.3 GB)

Name Size Download all
md5:d84c160934566e7802bb34d05e68ff5c
17.3 GB Download