On systemic risk in the cloud computing model

The major benefits of the emerging cloud computing infrastructure include elimination of the fixed cost and reduction in the marginal cost for the users due to the economy of scale and dynamic resource sharing. This paper argues that these unquestionable economic benefits of dynamic resource sharing are inherently associated with systemic risks and drawbacks. This possibility necessitates shift in cloud architecture design and operation paradigm from maximizing the economic benefits to managing and optimization of the inherent systemic risk/benefit tradeoffs. This paper is a first step towards exposure and quantification of inherent tradeoffs involving system economic efficiency and systemic risk of cascading overload in the cloud model of shared resources.


INTRODUCTION
The NIST definition lists five essential characteristics of the cloud computing model: on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service [1].
The economic and convenience advantages of the shared cloud computing architecture as compared to the conventional model of owning computer resources are unquestionable. Among economic advantages are elimination of the fixed cost and reduction of the marginal cost for users of cloud computing infrastructure due the economy of scale. One can easily identify significant similarities between the current emergence and evolution of the cloud computing model and the earlier emergence and evolution of the Internet. Of course, these similarities are not accidental; they are the result of the same economic pressures. Indeed, convenience for the end users and economic efficiency, which are due to the end-to-end principle and packet switching, resulted in Internet success in competition with telephone network [2].
Assuming that historic parallels between the Internet and Cloud models are rooted in reality, one may attempt to learn major Cloud evolution trends from the historic lessons of the Internet evolution. Probably, the most important lesson is that the core principles of the Internet model, e.g., the end-to-end principle and packet switching, are inherently associated with serious risks and drawbacks, which have led to numerous proposals for modification of the Internet model for the purpose of mitigating and managing of the corresponding risk/benefit tradeoffs [2]. One may expect that the cloud computing model of shared computing and communication resources inherits both benefits and drawbacks of the Internet model of shared communication resources.
However, current research is mostly concentrated on the benefits of the cloud model [3]- [5]. A notable exception to this research trend is [6]. Due to highly dynamic operating environment, static load balancing, which attempts to match the expected demand to the system resources, may be unable to accommodate unavoidable fluctuations in the exogenous demand even with available buffering. Demand fluctuations can be accommodated with dynamic load balancing allowed by the cloud computing model. However, assuming infinite buffers, [6] rigorously proves that these benefits of dynamic load balancing are achieved at the cost of "much longer" queues and delays as compared to static load balancing. While important, these rigorous results rely on a number of unrealistic assumptions, including infinite buffers.
This paper proposes a mean-field approximation, which overcomes this and some other limitations of model [6] at the cost of losing mathematical rigor. Our results indicate that in a practically relevant parameter region, benefits of dynamic load balancing in the "normal operating regime" come with risks of the system abruptly transitioning to the "persistently congested mode" through the process of cascading overload. A similar phenomenon has been predicted in a rather unrealistic model of a cloud without buffering [7], and has been observed in networks with dynamic routing [8]- [9]. This paper can be viewed as a first step in the direction of paradigm shift in Cloud design and operation: from maximizing economic benefits to identifying and managing the relevant risk/benefit tradeoffs. In particular, we demonstrate that economic pressures push the system to the stability boundary of the normal operating regime, making the tradeoff between "normal system performance" and systemic risk of transitioning to congested regime more acute. We also suggest using the Perron-Frobenius eigenvalue [10] of the "reduced" system performance model as a measure of the systemic risk of cascading overload. Applicability of the Perron-Frobenius theory [10] in systems with dynamic resource sharing is due to the positive feedback in the system component overload, which can spill over to the "neighboring" components.
The paper is organized as follows. Section II introduces a cloud Markov performance model. Since this Markov model is computationally intractable for realistic size clouds, Section III proposes tractable mean-field and fluid approximate performance models. Based on these approximations, Section IV analyses performance of a symmetric cloud, when U.S. Government work not protected by U.S. copyright dimension of mean-field performance model does not depend on the system size. Section V quantifies systemic risk and risk/performance tradeoff for a symmetric system, and proposes an approach to such a quantification for a general system. Finally, Section VI briefly summarizes and outlines directions of future research. We assume a service strategy which either rejects or accepts an arriving job to some service group where the job stays until service is completed. We also assume work-conserving service disciplines which do not allow idle server(s) in a group which has at least one buffered job. We consider dynamic admission and routing strategies, which are based on the resource availability characterized by the vector

II. CLOUD MODELS
if server group i has available resources, i.e., a server, or buffering space, or both. If however server group i has neither a server or a buffering space available, then We consider admission and routing strategies which immediately either reject an arriving class i job with probability 0 i q , or route this job to a "native" service group i with probability , the request is finally lost.
In the above admission and routing strategies, the admission probabilities ii q control the expected exogenous demand in order to match it to the system capacity, sets i J determine the system "topology", probabilities i  determine the level of resource sharing (pooling), and probability matrix determines the load balancing/routing strategy.
Assuming that each serviced request of class i brings revenue i w , the system performance can be characterized by the overall revenue rate where the portion of lost revenue is and the loss probability for a class i job is Our goal is to quantify ability of dynamic resource sharing (resource pooling) to mitigate the mismatch between the uncertain (due to limited reliability) system capacity and fluctuating exogenous demand, assuming that the systematic mismatches have been already eliminated through admission control, capacity planning, pricing, or their combination. In order to isolate the effect of load fluctuations we further assume 1  ii q , which allows us to simplify (4) as follows:

B. Markov Performance Model
Introduce the indicators Since different service groups fail independently from each other, the joint probability distribution of ) ,.., 1 , Given vector is the number of class i requests served by group j , and ) (t m ij is the number of class i requests waiting in the group j buffer at moment t .
In a case of finite buffers: This steady-state distribution can be determined from the corresponding steadystate Kolmogorov equations, which we omit here due to limited space.
After solving this steady-state Kolmogorov system, one can calculate conditional loss probabilities where distribution ) (  is given by (6). Unfortunately, the astronomically high dimension of the Kolmogorov system, which has to be solved for each of , makes the Markov performance model computationally infeasible even for moderate size systems. The source of this difficulty is the correlations between states of different service groups due to the resource sharing allowing a non-native service.

III. APPROXIMATIONS
Subsection A proposes a mean-field approximation, which is computationally tractable for medium-size systems. Subsection B combines this mean-field approximation with fluid approximations in cases of large service groups or large number of service groups and large level of resource sharing.

A. Mean-field Approximation
Steady-state Kolmogorov equations can be explicitly solved in a particular case when only native services are allowed: where ] [ : and ) 0 ( In (12) server group i utilization by native requests is ) ( (7), we obtain the conditional losses (13) Similar to [7], consider a mean-field approximation, which drastically reduces the problem dimension by postulating that (10)-(11) hold approximately: where ] [ : and ) 0 ( : (16) Self-consistency requires "effective utilizations" j  in (16) to satisfy the following "conservation laws": where the load to be routed, subject to admission strategy, from server group i to server group j is Substituting (19) into (18) and then into (17) and (16), we obtain a closed system of I fixed point equations for I unknowns ) ( (20) After solving system (20) for , the conditional losses (7) can be approximated as follows: Thus, the mean-field approximation has greatly lower dimension than the Markov description at the cost of nonlinearity.

B. Fluid Approximation
Fluid approximation, which replaces probability distributions by their averages, can be justified if these averages are much greater than the corresponding standard deviations. In other words, fluid approximation neglects fluctuations of the random variables assuming that these fluctuations are small comparatively to the corresponding averages.
Usually, fluid approximation is associated with a case of "large" service groups    j j B N , when formula (16) takes the following form: Combining (22) sufficient condition for existence of this lossless system equilibrium is that the total utilization of each server group I j ,.., 1  by native and non-native jobs is less than one: In a case of a large number of service groups 1  I and high level of resource sharing, i.e., large sets i J , I i ,.., 1  , fluid approximation is applicable to the aggregate loads conditioned on vectors  and  . Indeed, in this case the aggregate load on a server group is a sum, i.e., composition, of the native load and a "large" number of "small" non-native loads from other service groups. At any given moment, the aggregate load is the function of the corresponding vectors  and  . The fluid approximation replaces average of this function over random vectors  and  (17) with the corresponding function of the average vectors ~ and  : In a case of both large server groups and large degree of resource sharing, further simplification is possible resulting in the following condition for existence of a "normal" operational state with lossless performance: Note that more accurate approximations of (16) than (22)

IV. SYMMETRIC SYSTEM
This section analyzes persistent behavior of a symmetric system for which the dimension of the mean-field system is independent of the system size. Subsection A defines a symmetric system and derives the corresponding mean-field and fluid approximations. Subsection B analyses the persistent performance of a symmetric system, which is identified with equilibria of the corresponding mean-field equation. Subsection C discusses the phase diagram of a symmetric system under fluid approximation. A phase diagram separates regions in space of system parameters with different persistent system behavior.

A. Mean-field Equations
Consider a symmetric system, where where the exogenous load is Substituting (28) into the right-hand side of (27) we obtain a single fixed-point equation (29) Figure 1 shows the solution to the fixed-point equation (29).  depend on the system parameters. Following accepted practice, e.g., see [7]- [9], we interpret globally stable equilibria of (29) as describing stable system states, and locally stable equilibria of (29) as describing metastable system states. Note that the metastability is the result of the positive feedback in the effective load  due to resource sharing allowing congestion spillover to the neighboring service groups.

B. Persistent Performance
After solving equation (29) the unconditional persistent loss rate can be determined as follows: Figure 2 shows the persistent loss rate (30) as a function of "slowly" changing exogenous load  for different values of parameter  representing the level of resource sharing.
. As load  "slowly" increases, the loss L follows curve correspond to the globally stable "normal" and "congested" system equilibria respectively. . Thus, increase in the resource sharing increases "spread" between the normal and congested metastable regimes by reducing loss in the normal regime and increasing loss in the congested metastable regime. Figure 3 demonstrates this dual effect of dynamic resource sharing on the system performance by sketching persistent loss rate (30) as a function of the "slowly" changing level of resource sharing  for sufficiently large service groups B N  , and sufficiently inefficient non-native service, i.e., small  .

C. Fluid Approximation
A system phase diagram separates regions in the space of system parameters with different topological structure of meanfield equation (29). This Subsection analyzes phase diagram of a symmetric system under fluid approximation (22), when introducing normalized exogenous load which is the sum of the native load and non-native load redirected from non-operational service groups. Solution to the system (28), (31)-(32) yields the following result. In the normal persistent state, losses occur only for requests, which are native for non-operational groups and not redirected to operational groups: (33) The normal persistent state is stable if operational service groups can handle all native and non-native traffic which is rerouted from non-operational service groups: The congested persistent state with loss is stable if the operational service groups cannot handle all their native and non-native traffic once being fully loaded: The normal and the congested persistent states coexist if (37) Figure 4 shows section of the system phase diagram by plane ) , (   . In Figure 4, and Inside regions * U and * U the system has unique "normal" and "congested" equilibrium with loss (33) and (35) respectively.
Inside region * * U these two equilibria coexist as metastable.
Inside regions * U and * U system has unique "normal" and "congested" equilibrium with loss (33) and (35) respectively.
Inside region * * U these two equilibria coexist as metastable.

V. SYSTEMIC RISK
Based on the similarity with discontinuous phase transitions, one may expect that system transition from normal to congested persistent mode occurs through the process of cascading overload.
This Section discusses and quantifies the corresponding systemic risk and the risk/performance tradeoff. Subsection A analyses a symmetric system using direct solution of the corresponding mean-field equation. Subsection B suggests that Perron-Frobenius theory provides an adequate analytical framework for quantitative evaluation of the systemic risk and risk/performance tradeoff. Figure 3 indicates a tradeoff between system performance in the persistent "normal" state and risk of system transition to the persistent overloaded state. Indeed, conventional system design aims at the optimization of the system performance in the persistent "normal" state assuming that this state exists and system remains in this state. In practice, due to inherent uncertainties, the optimal and critical levels of resource sharing opt  and *  respectively, are subject to variability. This variability may result in parameter  exceeding the threshold *  , which would cause the abrupt system transition from the

A. Symmetric System
where ) ( *  L is the system loss rate in the persistent normal state. While parameter  is controlled, stability threshold *  is a subject to various inherent uncertainties which. We model these uncertainties by assuming that *  is a random variable, and measure risk by the expected "catastrophic loss" where ) ( *  L is the system loss rate in the persistent overloaded state. Note that similar analysis holds for other risk measures. Figure 6 sketches the risk/loss tradeoff (40)-(41), which is the Pareto optimal frontier separating the feasible (upper-right) and infeasible (lower-left) regions on the (Risk, Loss) plane. Since it is impossible to simultaneously reduce Risk and Loss beyond the Pareto optima, the systemic risk management involves two steps: first, reaching the (Risk, Loss) Pareto optimal frontier, and second, maintaining the optimal operating point on this frontier.
In a particular case of normally distributed threshold Figure 6 indicates that Risk/Loss tradeoff becomes more essential for less reliable systems, i.e., systems with higher probability of service group failures f .

B. General System
Assuming that each arriving class i request brings revenue i w , the aggregate unconditional revenue loss in the normal persistent state * L is given by (3) with derived from (5) unconditional loss probabilities: where (49) After solving this optimization problem, the Loss/Risk tradeoff is given by where systemic risk is identified with  .
Solutions to risk-aware performance optimization problem (48)-(49) depends on the system parameters. However, unavoidable variability in the system parameters results in variability of the Perron-Frobenius eigenvalue  around the expected eigenvalue where ) ( opt L is the aggregate revenue loss in the corresponding persistent congested state.

VI. CONCLUSION AND FUTURE RESEARCH
This paper has reported on work in progress on exposure and quantification of the inherent tradeoff between economic efficiency and systemic risk of cascading overload in the cloud computing model. Due to the intractability of the conventional Markov model, we have suggested employing methodologies of "Complex Systems" for the quantitative assessment of the corresponding tradeoffs. Specifically, we have proposed a mean-field approximation, which greatly reduces the complexity of the system description as compared to the conventional Markov model.
We have followed a conventional interpretation of multiplicity of local equilibria of the corresponding mean-field equations as describing the metastable, i.e., persistent, system states. We have identified the stability margin of a metastable system state with the stability margin of the corresponding equilibrium. In particular, our analysis has indicated that dynamic resource sharing allows for mitigation of the limited system reliability. However, this mitigation occurs at the cost of increased risk of cascading overload.
Numerous questions deserve further investigation, e.g., accuracy of the proposed mean-field approximation and nature of the predicted systemic instability. More broadly, future work should address the practicality of the proposed systemic risk measure at the system design and operational stages. Of particular interest is a potential ability of online measurements of the corresponding Perron-Frobenius eigenvalue [11] to provide "early warning signals" of the system approaching the instability/breaking point [12].