Impact of Service-and Cloud-Based Dynamic Routing Architectures on System Reliability (cid:63)

. Various kinds of dynamic routing architectures are used in today’s service-and cloud-based architectures, including sidecar-based routing, routing through a central entity such as an event store, or architectures with multiple dynamic routers. We propose an analytical model of request loss during router and service crashes, as well as an empirical validation of that model. The comparison of the empirical data to the predicted values by our model shows a low enough and converging error rate for using the model during system architecting. Our model predicts that, having the same crash probability, decentralized routing results in losing a higher number of requests in comparison to more centralized approaches. To the best of our knowledge, our study is the ﬁrst to empirically study the reliability trade-oﬀ in such architectural decisions.


Introduction
Many distributed system architecture patterns [3,10,15] have been suggested for dynamic routing [8].Some dynamic routing architectures require a single dynamic request routing decision, e.g., when using load balancing.More complex request routing decisions or combinations of decisions, such as routing to the right branch of a company or checking for compliance to privacy regulations, often require multiple runtime checks during one sequence of requests.
In our prior work [1], we studied representative service-and cloud-based system architecture patterns for dynamic request routing.A typical cloud native architecture pattern is the sidecar pattern [10,12] in which the sidecar of each service handles incoming and outgoing traffic [6].In contrast, a central entity, e.g., an API Gateway, an event streaming platform [15], or any kind of central service bus [3], can be used to process the request routing decisions.These two extremes are often combined and multiple routers are used; this is called dynamic routers in this paper.Consider an API Gateway, two event streaming platforms, and a number of sidecars, all making routing decisions in a cloud-based architecture.
At present, the impacts of such architectures and their different configurations on system reliability have not been studied.More is known about other qualities relevant for this decision.For instance, our prior work [1] has shown that more distributed approaches for dynamic data routing offer a better performance compared to more centralized solutions.As reliability is a core consideration in service and cloud architectures [14], a reasonably accurate failure prediction for the feasible architecture design options in a certain design situation would help architects to better design system architectures considering quality trade-offs.RQ1: What is the impact of choosing a dynamic routing architecture, in particular central entity, sidecar-based, or dynamic routers, on system reliability?RQ2: How can we predict this impact when making architectural design decisions regarding system reliability?
We model request loss during router and service crashes in an analytical model based on Bernoulli processes; request loss is used as the externally visible metric indicating the severity of the crashes' impacts.The model abstracts central entities, dynamic routers, and sidecars in a common router abstraction.To validate our analytical model, we designed an experiment in which we studied 36 representative experimental cases (i.e., different experiment configurations) for the three kinds of architectures Our results show that the error is constantly reduced with a higher number of experimental runs, converging at a prediction error of 8.1%.Given the common target prediction accuracy of up to 30% in the cloud performance domain [11] these results are more than reasonable.Our model predicts and our experiment confirms that more decentralized routing results in losing a higher number of requests than more centralized approaches.

Related Work
Architecture-Based Reliability Prediction.To predict the reliability of a system and to identify reliability-critical elements of its system architecture, various approaches such as fault tree analysis or methods based on a continuous time Markov chain have been proposed [17].Architecture-based approaches, like ours, are often based on the observation that the reliability of a system does not only depend on the reliability of each component but also on the probabilistic distribution of the utilization of its components, e.g., a Markov model [4].
Empirical Reliability or Resilience Assessment.Today many software organizations use large-scale experimentation in production systems to assess the reliability of their systems, which is called chaos/resilience engineering [2].A crucial aspect in resilience assessment of software systems is efficiency [13].
To reduce the number of experiments needed, knowledge about the relationship of resilience patterns, anti-patterns, suitable fault injections, and the system's architecture can be exploited to generate experiments [18].
Service-Specific Reliability Studies.Some related works introduce servicespecific reliability models, e.g., Wang et al. [19] propose a discrete time Markov chain model for analyzing system reliability based on constituent services.Grassi and Patella [7] propose an approach for reliability prediction that considers the Fig. 1: Dynamic Routing Architecture Patterns (adapted from [1]) decentralized and autonomous nature of services.However, none of these approaches studies and compares major architecture patterns in service and cloud architectures; they are based on a very generic model about the notion of service.

Background: Dynamic Routing Architecture Patterns
Central Entity (CE).In a CE architecture, as shown in Figure 1, the central entity manages all request flow decisions.One benefit of this architecture is that it is easy to manage, understand, and change as all control logic regarding request flow is implemented in one component.However, this introduces the drawback that the design of the internals of the central entity component is a complex task.CE can be implemented utilizing an API Gateway, an event store, an event streaming platform [15], or a service bus [3].
Sidecar Architecture (SA). Figure 1 presents an SA example.Sidecars [6, 10, 12] offer benefits whenever decisions need to be made structurally close to the service logic.One advantage of this architecture is that, in comparison to the central entity service, it is usually easier to implement sidecars since they require less complex logic to control the request flow; however, it is not always possible to add sidecars, e.g., when services are off-the-shelf products.
Dynamic Routers (DR). Figure 1 shows a specific dynamic router [8] configuration.One benefit of using DR is that dynamic routers can use local information regarding request routing amongst their connected services.For instance, if a set of services are dependent on one another as steps of processing a request, DR can be used to facilitate the dynamic routing; nonentheless, dynamic routers introduce an implementation overhead regarding control logic, deployment and so on since they are usually distributed on multiple hosts.

Model of Request Loss During Crashes
We use the common term router for all request flow control logic.

Definition of Internal and External Loss
In Figure 1 routers and services send internal requests amongst one another to complete the processing of one external request received from clients.In case of a crash, external requests will not be processed fully.We define external and internal loss as the number of lost external and internal requests, respectively.
Internal Loss.In case of a crash, per each external loss, the internal loss is the total number of internal requests (IR T ) minus the ones that have been executed.Let IL c , EL c and n exec c be the internal and external loss, and the number of executed internal requests for the crash of a component c: Note that IR T and n exec c need to be parameterized based on the application.An example of this parameterization is given in Section 4.
External Loss.Let d c be the expected average downtime after a component c crashes and cf the incoming call frequency, i.e., the frequency at which external requests are received.Then, the external loss per crash of each component c is:

Bernoulli Process to Model Request Loss
In this section, we model request loss based on Bernoulli processes [17].We only model the crash of routers and services in Figure 1 because we assume an API Gateway is stable and reliable.Moreover, a crash of a Client results in external requests not being generated; as a result, external requests are not lost.Hence, from now on, we use the common term components for all routers and services.Number of Crash Tests.During T , all components can crash with certain failure distributions.Here, T should be interpreted as the time interval in which these failure distributions are observed (e.g., failure distributions of a day or a week).We model this behavior by checking for a crash of any of the system's components every crash interval CI.That is, our model "knows" about crashes in discrete time intervals only, as it would be the case, e.g., if the Heartbeat pattern [9] is used for checking system health.Let n crash be the number of times we check for a crash of components during T , i.e., the number of crash tests: Expected Number of Crashes.Each crash test is a Bernoulli trial in which success is defined as "component crashed".Assuming CI > d c (justifiable because when a component crashes it cannot crash again) all n crash crash tests of a component c are independent.The binomial distribution of each Bernoulli process gives us the number of successes.Let P c be the crash probability of a component c every time we check for a crash and E[C c ] the expected number of its crashes, i.e., the expected value of its binomial distribution during T : Total Internal and External Loss.The total internal loss (IL T ) is the sum of internal loss per crash of each component.Let C be the set of all components that can crash, i.e., routers and services.Using Equations ( 1) to (4): The total external loss (EL T ) is the sum of external loss per crash of each component.Using Equations ( 2) to ( 4): Total Number of Crashes.The total number of crashes (C T ) is the sum of the expected number of crashes of each component.Using Equations ( 3) and ( 4): 4 Empirical Validation

Experimental Planning
Goals.We aim to empirically validate our model's accuracy with regard to the number of crashes as well as the total external and internal loss represented by Equations ( 5) and (6).We realized these architectures using a prototypical implementation, instantiated and ran them in a cloud infrastructure, measured the empirical results, and compared the results with our model.Technical Details.We used a private cloud with three physical nodes, each having two identical Intel® Xeon® E5-2680 CPUs.On top of the cloud nodes we installed Virtual Machines (VMs) with eight CPU cores and 60 GB system memory running Ubuntu Server 18.04.01LTS.Docker containerization is used to run the cloud services which are implemented in Node.js.We utilized five desktop computers to generate load, each hosting an Intel®Core™i3-2120T CPU @ 2.60GHz, 8GB of system memory which run Ubuntu 18.10.They generate load using Apache JMeter which sends HTTP version 1.1 requests to the cloud nodes.
Specific Model Formulae.In our example configurations each service receives an internal requests, processes it and sends it back either to a router or the API gateway, so we can calculate IR T based on the number of services (n serv ): In order to calculate n exec c , we need to differentiate between service and router crashes.In case of a service crash, all internal requests up until the last router will be executed.Let s crashed be the label number of the crashed service: In case of a router crash, we need to know the allocation of routers (A) which is a set indicating the number of directly linked services of each router.Let r crashed be the label number of the crashed router : Experimental Cases.We chose different levels for cf and n serv to study their effects on IL T .We selected cf based on a study of related works, e.g., [5,16], as 10, 25, 50, and 100 requests per second.Based on our experience and a survey on existing cloud applications in the literature and industry [1], the number of cloud services which are directly dependent on each other in a call sequence is usually rather low.As a result, we chose 3, 5, and 10 as values for n serv .We simulated a node crash by separately generating a random number for each cloud component.If the generated random number for a component was below its crash probability, we stopped the component's Docker container and started it again after a time interval d = 3 seconds.We chose T = 10 minutes, during which we checked for a crash for all components simultaneously every CI = 15 seconds resulting in n crash = 40 (Equation ( 3)).Each component had a uniform crash probability of 0.5%; akin to the related works we chose a relatively high crash probability to have a high enough likelihood to observe a few crashes during T .
Data Set Preparation.For each experimental case, we instantiated the architectures and ran the experiment for exactly ten minutes (excluding setup time).We studied three architectures, three levels of n serv and four levels of cf , resulting in a total of 36 experimental cases; therefore, a single run of our experiment takes exactly six hours (36×10 minutes) of runtime.Since our model revolves around expected values in a Bernoulli process, we repeated this process 200 times (1200 hours of runtime) and report the arithmetic mean of the results3 .

Results
Experimental Results Analysis.Based on Equation ( 5), IL T is a model element that incorporates crashes of all components and it includes all model views, e.g., architecture configurations, expected average downtime, etc.Therefore, we conduct our analysis mainly based on IL T .It can be observed from Table 1 that when we keep n serv constant, increasing cf results in a rise of EL T (predicted by Equation ( 6)) in all cases, which leads to a higher value of IL T (Equation ( 5)).
Since in our experiment, we instantiated the DR architecture with three dynamic routers, it is interesting to consider the experimental case of n serv = 3.In this case, SA and DR have the same number of components, i.e., routers and services.Note that SA uses a sidecar per each cloud service; as a result with n serv = 3, we will also have three sidecars.The difference between the two architectures in this experimental case is that in DR dynamic routers are placed

Discussion and Conclusions
Evaluation of the Prediction Error.We measure the prediction error by calculating the Mean Absolute Percentage Error (MAPE) [17].Let model i and empirical i be the result of the model, and the measured empirical data for experimental case i, respectively.n case is the number of cases (36 in this study).
Table 2 reports prediction error measurements of our model for a different number of runs.As the table shows, with a higher number of experimental runs the prediction error is reduced, which indicates a converging error rate.After 200 runs, the final prediction error regarding IL T is 8.1%.As mentioned before, the common target prediction accuracy in the cloud performance domain is 30% [11].
Threats to Validity.While injecting crashes is a commonly taken approach (see Section 2.1), a threat remains that measuring internal and external loss based on these crashes might not measure reliability well, e.g., cascading effects of crashes [14] are not covered in our experiment.We collected an extensive amount of data to validate our model; however, we did so in limited experiment time and with injected crashes, simulated by stopping Docker containers.We avoided factors such as other load on the experiment machines; much of the related literature takes a similar approach.To increase internal validity we decided not to run the experiment on a public cloud where, e.g., other load on the experiment machines might have had a significant impact on the results.As a consequence, there is the threat that generalization to a public cloud setting might be limited.As our private cloud setting uses very similar hardware and software stacks as many public cloud offerings, we believe this threat to be small.As the statistical method to compare our model's predictions to the empirical data, we used the MAPE metric as it is widely used and offers good interpretability in our research context.To mitigate the threat that this statistical method might have issues we double-checked three other error measures, which led to similar results.
Conclusions.We investigated the impact of architectural design decisions on system reliability.Regarding RQ1, our study concludes that more decentralized routing results in losing a higher number of requests in comparison to more centralized approaches.Regarding RQ2, we derived an analytical model for predicting request loss in the studied architectures and empirically validated this model using 36 representative experimental cases.Our results indicate that with a higher number of experimental runs the prediction error is constantly reduced, converging at a prediction error of 8.1%.

Table 1 :
CE Results of the Model and the Experiment on a different VM than their directly linked services, but in SA sidecars are placed on the same VM as their corresponding cloud services.For this reason, it can be observed that the reported values for SA and DR closely resemble each other when we have different values of cf but keep n serv constant at three.Considering the cases with five or ten cloud services, we almost always observe higher IL T when we change the architecture from a CE to a DR or from a DR to Number of Runs C T (%) EL T (%) IL T (%)

Table 2 :
Prediction Error of Experimental Runs an SA but keep the same configurations, i.e., constant n serv and cf .It is because in our experiment, CE has only one control logic component (the central entity), DR has three (dynamic routers), and SA has n serv (sidecars).Consequently, the number of crashes corresponding to control logic components goes up from CE to DR and then to SA.This increases C T , which results in losing more requests.