Getting Started Guide ============================================================== Com-CAS is a compiler-directed cache apportioning system that provides dynamic cache allocations for co-executing applications: 1. a "backend" compiler and training part 2. a "frontend" runtime system containing the scheduler that uses Intel CAT We provide the source code for the backend and frontend. We provided our executables with our outputs that we got by running it on our machine. However, you re-compile everything using the provided scripts and rerun everything if you want. We have provided you with a docker container that is stable on one of our machines. Successfully tested on: Ubuntu 18.04.6 LTS Docker version 20.10.17, build 100c701 First make sure you have docker installed. We followed the instructions here: https://docs.docker.com/install/linux/docker-ce/ubuntu/ Once you have docker installed and the daemon is running (perhaps by verifying via the install guide that the hello world example works), load the pact:review docker image that we provided: $ tar xvzf 12.tar.gz $ cd 12 $ docker import pact_review.tar pact:review --you should see it loading the layers. When it finishes, you should see it in your images list: $ docker images Now run the pact:review docker container, jumping into its shell: $ docker run -it --privileged pact:review bash Navigate to the top-level folder of the project and the set the environment variables correctly: $ cd /root $ source /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler/setenv.sh $ cp /root/benchmarks/polybench-c-4.2.1-beta/RegCoeffGen.py /root/benchmarks/gapbs/. $ cp /root/benchmarks/polybench-c-4.2.1-beta/RegCoeffGen.py /root/benchmarks/rodinia/rodinia_3.1/openmp/. In the top-level folder, youn should have five folders: 1) our_outputs: We put all our timings and outputs from our runs for all four schemes 2) libraries: These are the libraries that we added and installed into the docker folder for Bcache and Kpart to work 3) benchmarks: This folder contains all the 3 benchmarks (Gaps, Polybench, Rodinia) that you can run with any scheme 4) kpart: Then kpart implementation that we used to compare our Bcache implementation 5) llvm3.8: It contains our Bcache code and the llvm executables to run everything. Mix Number Mapping ============================================================== In the paper, we have mixes ranging from 1 to 35. We excluded Spec because our main paper showed the benchmarks from Polybench, Gaps, and Rodinia. Below, we have the mapping from the mix number to the folder name which we tried to keep the same across all five schemes. - Mix 1 : mix0 [PolyBench] - Mix 2 : mix1 [PolyBench] - Mix 3 : mix2 [PolyBench] - Mix 4 : mix3 [PolyBench] - Mix 5 : mix4 [PolyBench] - Mix 6 : mix5 [PolyBench] - Mix 7 : mix6 [PolyBench] - Mix 8 : mix7 [PolyBench] - Mix 9 : mix8 [PolyBench] - Mix 10 : mix9 [PolyBench] - Mix 11 : mix10 [PolyBench] - Mix 12 : mix11 [PolyBench] - Mix 13 : mix12 [PolyBench] - Mix 14 : mix13 [PolyBench] - Mix 15 : mix14 [PolyBench] - Mix 16 : mixg1 [Gaps] - Mix 17 : mixg2 [Gaps] - Mix 18 : mixg3 [Gaps] - Mix 19 : mixg4 [Gaps] - Mix 20 : mixg5 [Gaps] - Mix 21 : mixg6 [Gaps] - Mix 22 : mixg7 [Gaps] - Mix 23 : mixg8 [Gaps] - Mix 24 : mixg9 [Gaps] - Mix 25 : mixg10 [Gaps] - Mix 26 : mixr1 [Rodinia] - Mix 27 : mixr2 [Rodinia] - Mix 28 : mixr3 [Rodinia] - Mix 29 : mixr4 [Rodinia] - Mix 30 : mixr5 [Rodinia] - Mix 31 : mixr6 [Rodinia] - Mix 32 : mixr7 [Rodinia] - Mix 33 : mixr8 [Rodinia] - Mix 34 : mixr9 [Rodinia] - Mix 35 : mixr10 [Rodinia] Step-by-Step Instructions ============================================================== A. Build LLVM and Benchmarks (Optional) ------------------------ For Reactive and Bcache, we built everything based on our machine specifications so you can run the below schemes immediately. The only issue is the timings and improvement will be different and will be affected by the different cache in your system. However, if you want to rebuild everything which will take a few hours, you need to build llvm then build each benchmarks. For llvm, $ cd /root/llvm3.8 $ mkdir build $ cd build $ cmake ../llvm We would recommend you start with 'make' then do 'make -jX' because there is a slight dependency issue when you run 'make -jX' directly. $ make (Until 10% then you can Ctrl-Z) $ make -jX (Where X is the number of cores your machine can do without any issue)' $ source /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler/setenv.sh We made our own scheduler for both Perf-Counter and Bcache so you can build them too $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ bash make.sh Each benchmark has the same three scripts that need to be run: btrain.sh, regress.sh, normal.sh, and beacon.sh. Currently, if you run all four scripts, it will run every benchmark in the each suite. If you want only want to run a few, you can open the script and edit the benches variable in each bash script. For example in Gaps, you have the below code: | declare -a benchs=( bc bfs cc cc_sv pr sssp tc ) | #declare -a benchs=( bc ) which you can edit by commenting the first line and editting the second line or directly editting first line. If you commented and editted the second line, it would be something like this: | #declare -a benchs=( bc bfs cc cc_sv pr sssp tc ) | declare -a benchs=( bc pr ) Before going to each benchmark to run them, you need to build the respective libraries we use during compilation. $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/runtime/gt_regress $ make $ cd ../gtb $ make cfs $ make bes For Gaps, $ cd /root/benchmarks/gapbs $ ./btrain.sh $ ./regress.sh $ ./beacon.sh $ ./normal.sh For PolyBench, $ cd /root/benchmarks/polybench-c-4.2.1-beta $ ./btrain.sh $ ./regress.sh $ ./beacon.sh $ ./normal.sh For Rodinia, $ cd /root/benchmarks/rodinia/rodinia_3.1/openmp $ ./btrain.sh $ ./run_regress.sh $ ./beacon.sh $ ./normal.sh B. Run UnPartitioned Cache Scheme ------------------------ This scheme means we do not partition the cache at all and run the mixes. For this, we have a script 'run_all.sh' that can run all the mixes or you can run specific mixes by running their scripts directly. You can also edit 'run_all.sh' if you want to run a few mixes. If running a particular mix, always run 'pqos -R' before it so it can reset the cache allocation. This is meant for Figure 3 comparison. For Gaps: $ cd /root/benchmarks/gapbs/unpartitioned_executables $ ./run_all.sh (For many mixes) $ pqos -R && ./run_mixgY.sh (For a particular mix and Y means the mix number) For Polybench: $ cd /root/benchmarks/polybench-c-4.2.1-beta/unpartitioned_executables $ ./run_all.sh (For many mixes) $ pqos -R && ./run_mixY.sh (For a particular mix and Y means the mix number) For Rodinia: $ cd /root/benchmarks/rodinia/rodinia_3.1/openmp/unpartitioned_executables $ ./run_all.sh (For many mixes) $ pqos -R && ./run_mixY.sh (For a particular mix and Y means the mix number) C. Run Max Ways Scheme ---------------------- This scheme means we apportion the cache to give each process the max possible ways it needs. For this, we have a script 'run_sp_all.sh' that can run all the mixes or you can run specific mixes by running their scripts directly. You can also edit 'run_all.sh' if you want to run a few mixes. If running a particular mix, always run 'pqos -R' before it so it can reset the cache allocation. This is meant for Figure 4 comparison. For Gaps: $ cd /root/benchmarks/gapbs/max_ways_executables $ ./run_sp_all.sh (For many mixes) $ pqos -R && ./run_mixY_sp1.sh (For a particular mix and Y means the mix number) For Polybench: $ cd /root/benchmarks/polybench-c-4.2.1-beta/max_ways_executables $ ./run_sp_all.sh (For many mixes) $ pqos -R && ./run_mixY_sp1.sh (For a particular mix and Y means the mix number) For Rodinia: $ cd /root/benchmarks/rodinia/rodinia_3.1/openmp/max_ways_executables $ ./run_sp_all.sh (For many mixes) $ pqos -R && ./run_mixY_sp1.sh (For a particular mix and Y means the mix number) D. Kpart --------------------------------------- This scheme is based on KPart: A hybrid cache partitioning-sharing technique for commodity multicores [6] mentioned in the paper. We took their github code found at (https://github.com/Nosayba/kpart) and modified it to work with our cache ways, cores and double sockets. For simplicity, we also created scripts for this. This is meant for Figure 6 comparison. To rebuild Kpart: $ cd /root/kpart/lltools $ make $ cd /root/kpart/src $ make To run more than one mix in one run, you should use the 'run_all.sh' script. You need to edit the benches variable to include more folders so it can run those mixes. You can figure out the folders to mapping from above $ cd /root/kpart/tests $ ./run_all.sh $ cd mixXY.sh && ./example.sh (Run a particular mix) E. Reactive Scheme --------------------------------------- This scheme is the perf-counter based one where it changes the ways based on IPC and cache misses. We built our own scheduler so you have to make sure everything is built from Section A. This is meant for Figure 5 comparison. Currently, if you run the script, it will run all mixes from all three suites based on our machine's configuration. Our machine has 2 sockets with 14 cores each and 19 MB L3 cache. To edit it to your configuration, the configs need to be changed and these configs can be found in '/root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler/res_configs'. Our config can be seen below: | NumSockets = 2 | NumCoresPerSocket = 14 | NumThreadsPerCore = 2 | ThreadDistance = 14 | NumJobs = 7 | Clos = 16 | Ways = 11 | Sets = 154 | GFactor = 4 and you will only need to edit CoresPerSocket, ThreadsPerCore, ThreadDistance, Ways, and Sets. CoresPerSocket, ThreadsPerCore, ThreadDistance can be gotten using 'likwid-topology', 'lscpu', or 'cat /proc/cpuinfo'. Ways can taken from 'getconf -a | grep CACHE' and Sets can be calculated from the size and ways. Once you get values, you can run our script which will edit all the configs rather than editting each file. $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ python3 updateConfigs.py --NumCoresPerSocket 14 --NumThreadsPerCore 2 --ThreadDistance 14 --Ways 11 --Sets 154 As mentioned, the script is set up to run all mixes. I would recommend editting and running a few of them. If you run many like >12, it starts getting wonky due to the shared memory between scheduler and process. All the outputs are in 'bcache_output" folder. Steps: $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ ./run_res.sh (After editting it) $ cd res_outputs If you want to see the metrics, you can run our script 'parseConfig.py' and parse the log for each process information: $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ python3 parseConfig.py -file res_outputs/gaps_configs/mix0.config F. Run Bcache Scheme --------------------------------------- This scheme is our main contribution and the inner workings can be found in the paper. Currently, if you run the script, it will run all mixes from all three suites based on our machine's configuration. Our machine has 2 sockets with 14 cores each and 19 MB L3 cache. To edit it to your configuration, the configs need to be changed and these configs can be found in '/root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler/bcache_configs'. Our config can be seen below: | NumSockets = 2 | NumCoresPerSocket = 14 | NumThreadsPerCore = 2 | ThreadDistance = 14 | NumJobs = 7 | Clos = 16 | Ways = 11 | Sets = 154 | GFactor = 4 and you will only need to edit CoresPerSocket, ThreadsPerCore, ThreadDistance, Ways, and Sets. CoresPerSocket, ThreadsPerCore, ThreadDistance can be gotten using 'likwid-topology', 'lscpu', or 'cat /proc/cpuinfo'. Ways can taken from 'getconf -a | grep CACHE' and Sets can be calculated from the size and ways. Once you get values, you can run our script which will edit all the configs rather than editting each file. $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ python3 updateConfigs.py --NumCoresPerSocket 14 --NumThreadsPerCore 2 --ThreadDistance 14 --Ways 11 --Sets 154 As mentioned, the script is set up to run all mixes. I would recommend editting and running a few of them. If you run many like >12, it starts getting wonky due to the shared memory between scheduler and process. All the outputs are in 'bcache_output" folder. Steps: $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ ./run_bcache.sh (After editting it) $ cd bcache_outputs If you want to see the metrics, you can run our script 'parseConfig.py' and parse the log for each process information: $ cd /root/llvm3.8/llvm/lib/Transforms/beacons/lib/Scheduler $ python3 parseConfig.py -file bcache_outputs/gaps_configs/mix0.config Possible Issues ============================================================== (1) Perf is not working correctly Perf might not work if you do not set kernel.perf_event_paranoid to -1 and you can change it back after you are done. This must be set on your machine and not in the docker container. You can set it to 0 or 1 but you should follow steps online on how to allow docker to read the perf registers. To check your kernel.perf_event_paranoid, $ sysctl kernel.perf_event_paranoid To set kernel.perf_event_paranoid, $ sudo sysctl kernel.perf_event_paranoid=-1 (2) Executable has incorrect path When copying into the docker image, we had to rename a lot of the paths so they can work. We might have missed a few either in the scripts or configs. You can edit them based on the below paths that could remain. The possible changes are: - /home/bodhisatwa/Research_Repos/benchmarks -> /root/benchmarks - /home/bodhisatwa/kpart -> /root/kpart - /home/bodhisatwa/Research_Repos/softwares/llvm3.8 -> /root/llvm3.8 - /root/build/lib -> /root/llvm3.8/build/lib