Supporting and Verifying Transient Behavior Specifications in Chaos Engineering - Supplementary Materials
Contributors
Supervisors:
- 1. University of Hamburg
- 2. University of Stuttgart
Description
Context: Chaos Engineering is an approach for investigating the resilience of software systems, i.e. their ability to withstand unexpected events, adapt accordingly, and continue providing functionality. An integral part of the approach is continuous experimentation, expressed in continuously executing so-called Chaos Experiments. A Chaos Experiment consists of two crucial elements, namely a steady-state hypothesis and an anomaly injection. Traditionally, during the experimentation process, the steady-state hypothesis is verified at the start of an experiment, after which the anomaly is injected, followed by a second evaluation of the steady-state hypothesis. This chaos experimentation process is well suited for identifying whether a system is in a steady-state after a failure is introduced.
Problem: When applied, the traditional chaos experimentation approach can only verify whether the system is in a steady-state without providing any information about the time between the state changes, e.g. the recovery of the system. The experimentation process does not explicitly allow the specification of hypotheses regarding the transient behavior, i.e. the behaviors experienced during the transition between steady-states after a failure has been introduced. As a result, the experimentation process also does not support explicit verification of requirements on the transient behavior during the experiment, e.g, whether the response time stays below a given threshold all the time or whether a circuit breaker opens within a given time. Knowledge about such transient behaviors is beneficial for the stakeholders of the system. For example, assuming that high availability is of utmost importance for the business model of an application, a long period of recovery during which the application is unavailable after an unexpected failure could lead to considerable losses for various stakeholders.
Objective: The first objective of the thesis is to examine how a transient behavior requirement can be specified in the context of Chaos Engineering and Chaos Experiments. The next goal of the thesis is to study how the Chaos Experimentation process can include a verification of transient behavior requirements. A further goal is to create a concept for an extended Chaos Engineering approach, which supports the specification of transient behavior hypotheses and their verification, and to implement the concept into a working prototype. After the prototype is developed, the last objective of the thesis is to evaluate the extended Chaos Engineering approach and its implementation.
Method: In order to achieve the first objective of the thesis, formalisms capable of describing transient behaviors are examined with regard to their integration into Chaos Experiments. To accomplish the second objective, state-of-the-art Chaos Engineering tools are studied, and additionally, three expert interviews are conducted in order to elicit the requirements for such an extension of the Chaos Engineering approach. To accomplish the goal of creating a concept for the approach, the research results from the previous goals and the requirements elicited during the interviews are combined into a concept which is then implemented into a prototype. To reach the last goal of the thesis, the approach and the prototype are examined in three separate types of evaluation.
Result: First, the results of the thesis include the requirements for an extended Chaos Engineering approach, supporting the specification and verification of transient behavior requirements. Furthermore, the results also include a concept for the realization of the extension. Moreover, the results also comprise a functioning prototype of the proposed approach and its evaluation.
Conclusion: The proposed extension of the Chaos Engineering approach allows a deeper and more precise analysis of the resilience of software systems by enabling the specification and evaluation of more detailed and strict resilience requirements including transient behavior specifications. Furthermore, various stakeholders such as customers, operators, and developers benefit from the extended approach, allowing them to have stricter guarantees regarding the resilience of the application. Moreover, the prototype implementing the extended approach allows software engineers to easily adopt it and possibly extend it.
Files
supplementary-materials.zip
Files
(162.6 MB)
Name | Size | Download all |
---|---|---|
md5:1c9349beca2481150299857b84a78bcd
|
160.2 MB | Preview Download |
md5:745b2a78132a0f041721648bfbb9a8c4
|
2.4 MB | Preview Download |