Published January 29, 2021 | Version v1
Dataset Open

Replication Package for: Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures

  • 1. Kiel University

Description

This repository contains a replication package and experimental results for our study Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures.

The following description can also be found in the README.md file.

Repeating Benchmark Execution

The following introduction describes how to repeat our scalability experiments. If you plan to conduct your own studies, we suggest to use the latest version of Theodolite with significantly enhanced usability.

The Apache Kafka Streams scalability experiments of our study were executed with Theodolite v0.1.2. To repeat our Kafka Streams experiments:

  1. Clone and install Theodolite v0.1.2 according to the official documentation located in execution.
  2. Copy the file repeat-kstream.sh into Theodolite's execution directory.
  3. Run the repetition file with ./repeat-kstream.sh from within the execution directory.

Our Apache Flink benchmark implementations are currently migrated to the latest version of Theodolite. Theodolite's apache-flink Branch provides the basis for our Flink scalability experiments. To repeat them:

  1. Clone Theodolite's apache-flink Branch and install Theodolite according to the official documentation located in execution (should be identical to the installation for Kafka Streams (see above)).
  2. Copy the files repeat-flink-without-checkpointing.sh and repeat-flink-with-checkpointing.sh into Theodolite's execution directory.
  3. Switch to the execution directory.
  4. Run the first repetition file with ./repeat-flink-with-checkpointing.sh.
  5. Disable checkpointing by reconfiguring the Kubernetes resources jobmanager-job.yaml and taskmanager-job-deployment.yaml for each benchmark (uc{1,2,3,4}-application) by setting the environment variable CHECKPOINTING to "false".
  6. Run the second repetition file with ./repeat-flink-without-checkpointing.sh.

Please note that the naming of our benchmarks recently changed. While our publication already uses the new naming, the corresponding Theodolite versions are is still using the old one. Specifically, this means that UC1 in the publication is UC1 in Theodolite, UC2 in the publication is UC3 in Theodolite, UC3 in the publication is UC4 in Theodolite, and UC4 in the publication is UC2 in Theodolite.

Raw Measurements

The results of above benchmark execution can be found in the measurements directory. These are CSV files, containing the measured lag trend over time for a certain subexperiment. Theodolite creates a bunch of additional files, which serve for debugging and preliminary interpretation. As these files are not required for replication, we do not included them in this package.

The CSV files are named according to the schema exp{id}_{uc}_{load}_{inst}_totallag.csv, where {id} represents the experiment ID, assigned by Theodolite, {uc} the benchmark name, {load} the generated load, and {inst} the number of evaluated instances.

The CSV table experiments.csv provides an overview about the configurations used in each experiment.

Reproducing Scalability Analysis

The following introduction describes how to repeat our scalability analysis, either with our measurements or with your own. If you plan to conduct your own studies, we suggest to use the latest version of Theodolite with significantly enhanced usability.

Analyzing the Theodolite's measurements is done using two Jupyter notebooks. In general, these notebooks should be runnable by any Jupyter server. Python 3.7 or 3.8 is required (e.g., in a virtual environment) as well as some Python libraries, which can be installed via: pip install -r requirements.txt. See the Theodolite documentation for additional installation guidance.

Obtaining a Scalability Graph as a CSV File

The scalability-graph.ipynb notebook combines the measurements (i.e., the totallag.csv files) of one experiment. It produces a CSV file, which provides a mapping of load intensities to minimum required resources for that load (i.e., the scalability graph). The CSV files are named according to the schema exp{id}_min-suitable-instances.csv, where {id} represents the experiment ID. Additional guidance is provided in the notebook.

Resulting Scalability Graph CSV Files

The results directory provides the scalability graphs for all our executed experiments.

Visualization of the Scalability Graph

The scalability-graph-plotter.ipynb notebook creates PDF plots of a scalability graph and allows to combine multiple scalability graphs in one plot. It can be adjusted to match the desired visualization.

Acknowledgments

This research is funded by the German Federal Ministry of Education and Research (BMBF) under grant no. 01IS17084 and is part of the Titan project.

Files

theodolite-replication-package.zip

Files (1.8 MB)

Name Size Download all
md5:ebe7d41031949aca872f0a138ea99e60
1.8 MB Preview Download

Additional details

Related works

Is supplement to
Journal article: 10.1016/j.bdr.2021.100209 (DOI)