Latency Preserving Self-optimizing Placement at the Edge

The Internet is experiencing a fast expansion at its edges. The wide availability of heterogeneous resources at the Edge is pivotal in the definition and extension of traditional Cloud solutions toward supporting the development of new applications. However, the dynamic and distributed nature of these resources poses new challenges for the optimization of the behaviour of the system. New decentralized and self-organizing methods are needed to face the needs of the Edge/Cloud scenario and to optimize the exploitation of Edge resources. In this paper we propose a distributed and adaptive solution that reduces the number of replicas of application services that are executed throughout the system, all the while ensuring that the latency constraints of applications are met, thus allowing to also meet the end users' QoS requirements. Experimental evaluations through simulation show the effectiveness of the proposed approach.


INTRODUCTION
The adoption of Cloud computing is spreading at an incredible pace [1]. The transition is not only cost-effective for application providers, which avoid the set-up and most maintenance costs of the server-side infrastructure of services, but is also typically convenient for the end-users, who exploit always up-to-date applications via lightweight clients. However, not all modern applications are easily migrated onto Cloud computing infrastructures, as some application features and constraints, like latency-sensitiveness, can prevent a straightforward Cloud-based deployment and execution of services. Virtual reality applications can hardly be removed from their end-users without impairing the perceived QoE. Other examples of latency and bandwidth constraints are found among IoT-based applications, where a flow of raw data from a huge number of usually low-cost, low computational power sensors must be classified and processed in real time to react to changes in environment, such as in domotics [2]. Edge Computing is an attempt to overcome these limitations, it is a service computing paradigm that aims at bringing the computation as close as possible to the data producers and/or consumers (e.g., end-users). The direct bearing of this approach is to strongly limit the latency between the data producers/consumers and the computing resources. In the Edge Computing perspective the infrastructure is envisioned as revolving around a vertical computing continuum that goes from one (or more) Cloud(s) to a potentially large number of edge resources, geographically distributed over a usually broad area. The management of such a complex set of distributed, typically heterogeneous resources can be very burdensome and poses practical problems, as well as some interesting research challenges. Among those there is the non-trivial task of selecting the most adequate resources for a given application service, conditioned by the set of its users and their dynamically varying locations. Many approaches have been proposed so far to tackle the problem, some based on traditional optimization techniques whereas others exploiting data-driven approaches (e.g. machine-learning). There are solutions based on centralized solvers and others with a distributed nature. This paper proposes a decentralized, latency-aware solution for application placement at the edge. Application instances on the edge interact with one another in order to perform workload exchanges, lowering the overall resource usage while keeping the latency between the users and the edge under an application-dependant preset threshold. The remaining part of this paper is organized as follows. Section 2 contextualizes this work in the related scientific literature. Section 3 presents our formal definition of the problem and illustrates the approach we propose. Section 4 describes the experimental evaluation of the proposed solution. Finally, Section 5 draws concluding remarks and highlights future work directions.

RELATED WORK
In recent years, many solutions have been proposed for optimizing the behaviour of Edge-based systems [3,4]. The common feature of these approaches is to limit the communications to centralized Cloud servers by moving services and/or data closer to users, thus enhancing the quality of the services offered to them. To achieve this result, decentralized and/or self-organizing methods are exploited.
Some solutions are focused on data-driven mechanisms. In this case, the data is moved in the system to make it easy for the users to access it . In particular, a common strategy is to shorten the distance between data producers or storage and consumers to reduce the latency needed to respond to the users' requests. Suitable distributed strategies for data placement and replication are proposed in works like Aral and Ovatman [30] and Li et al. [31]. Differently from these approaches, we deal more with computing requests and find an optimal placement/replication of computing services.
Other solutions are more concerned with the placement of applications closer to the data to use and/or their respective users. For instance, the work described in [32] exploits a self-organizing clustering of the Edge data producer devices. Clusters are used to identify the characteristics of the data produced by the various groups of devices and, as a consequence, to optimally place the users' data analysis applications. Maia et al. [33] propose two methods, one based on genetic algorithms and one based on a Mixed-Integer Linear Programming heuristic. However, these two solutions are designed to work offline and with complete knowledge of the set of potential applications and computing nodes, i.e., a simpler scenario than the one we consider in this work. Ning et al. [34] propose a distributed, online solution for service placement at the Edge. Their solution is based on a probabilistic optimization method for computing the utility (in terms of cost, storage capability, and latency) of service migration and placement to determine the service placement configurations. De Lira et al. [35] implement a genetic algorithm that exploits the concept of user coverage to place latency-sensitive services according to the supposed location of end-users to maximize their quality of experience.
Different from the previous set of works, the method we propose in this paper is related to the optimization of the Edge resource consumption levels. One of the solutions available in this field is proposed by Kavalionak et al. [36,37]. The authors design a distributed solution, where geographically dispersed Edge devices are able to coordinate, both among themselves and with a remote server. In this way, the devices (surveillance cameras, in their use case scenario) are able to fulfill their activities more efficiently by sharing and balancing the required computational costs. Beraldi et al. propose CooLoad [38], a distributed, cooperative load-balancing scheme. In this model, an Edge data center stops processing incoming requests when they exceed a given threshold. To balance the load, new requests are to re-directed to another adjacent data center. Carlini et al. [39] have designed a completely decentralized system, where autonomous entities in a Cloud Federation are able to communicate to exchange services. The objective of the proposed approach is to maximize the profit of the whole Federation by mixing collaborative and selfish behaviours of the entities of the system. Indeed, the devices collaborate by mutually exchanging data and services. However, they make only the exchanges that increase their own subjective revenues, eventually leading the system to optimize its overall profits. Rather than balancing the load or maximizing revenues, in this paper, we propose a solution that optimizes the resource usage on the Edge by limiting the number of redundant replicas of the applications that are executed in the system. As we explain in detail in the following sections, to achieve this result, point-to-point interactions between edge mini clouds are used to move the users' requests from one minicloud to another, hence allowing to shut down some of themini clouds of the same application. User requests are moved only if the users' latency QoS requirements are guaranteed. In this way, we avoid wasting resources on useless replicas without violating the QoS constraints.

PROPOSED SOLUTION
In this section, we provide a detailed description of our proposed approach. We first give a formal definition of the problem we face in this paper. Then, we describe how we have tackled the problem, using autonomous interactions between entities at the Edge.

Problem Definition and Model
In the context of this paper, we consider that the system at its edges is made of entities that are termed as edge mini-clouds (EMC). An EMC represents an entity located at the Edge. This entity is in charge of supervising a set of other smaller, simple devices at the Edge. In this paper, the resources handled by an EMC are the sum of all the resources available within it. The only direct interactions that could happen in the system are the ones between different EMCs. Users of the system request the services provided by a set of application. Each application offers a different type of service. Several instances (or replicas) of an application can be deployed on several EMCs, on the basis of users' requests and QoS constraints. Each EMC is in charge of supervising the execution (and all the related optimization aspects) of the application replicas it receives.
On the basis of this description, we can define that the system that we study in this paper is made of the following entities: In addition, we consider that these entities have the following characteristics: • Each edge mini-cloud has a specific capacity of resources (or maximum admittable execution cost) , ∈ [1, . . . , ].
• Each application has an associated set of users U , |U | = ≥ 1 • Each application type is characterized by a fixed cost and a variable cost. We assume that pays a cost for Session 1: Edge-Cloud Continuum Orchestration and Resource Management FRAME '21, June 25, 2021, Virtual Event, Sweden each of the users it serves, so that the total variable cost is computed as · . • To respect the QoS agreed with its users, each application has a specific maximum latency requirement .
As highlighted in the Introduction, in this paper we face the problem of minimizing the total number of running replicas in the system. This operation should be done by a proper choice of the replicas to maintain active and the selection of the edge miniclouds where they have to be allocated. The selection and allocation processes should ensure that all the users are served, without violating their latency requirements. Moreover, the edge mini-clouds capacity limits should be respected.
Using the notation introduced in this section, this problem can be formally described as an optimization problem, using the following mathematical formulation: where is a replica of application on (active = 1, inactive/turned off = 0), is the total number of users of application , is the maximum capacity of , is the number of users of application handled by the replica running on , is the latency experienced by the replica , while is the maximum allowable latency for any replica of .
The objective function minimizes the total number of replicas of applications that are running in the system. The first constraint ensures that the total cost brought by the applications (and their users) running on each EMC does not exceed the EMC's maximum allowable cost. The second constraint guarantees that all the applications' users are served by the active replicas, while the third constraint imposes that the active replicas should respect the maximum latency assigned to their respective applications. Note that, by assuming ≥ 1, the second constraint also implies that ≥ 1 ∀ , i.e. there is at least an active replica for each application .

Solution Description and Pseudo-code
In the previous section, the problem we face in this paper has been described as an optimization problem. It is characterized by a high complexity, and a general solution would require a global knowledge. Both these points force to exploit a different approach, that is also capable of taking into account the distributed and dynamic nature of the computing environment at the Edge.
In the rest of this section, we detail our proposed solution. This solution is a self-optimization scheme. Indeed, it is based on the autonomous actions of the entities (the EMCs) that compose the system. The overall collective behaviour of these entities eventually leads the whole system to optimize its behaviour.
The description proposed in this section follows the pseudocode given in Algorithm 1. The algorithm describes the step performed by a generic EMC . Specifically, we consider that has a set of neighboring EMCs. These neighbors are the EMCs that are in the communication range of . We assume that the latency between and any other ∈ is ( , ). Moreover, let be the set of applications running on . Each application ∈ is characterized by a maximum agreed latency , a set of users , which have a maximum experienced latency . In order to contribute to a solution of the overall problem, at regular intervals performs the following actions. It randomly chooses one of its application ∈ . Then, it selects the subset ⊆ of its neighbors that respect the condition ( , ) + ≤ . This condition allows to determine which are the neighbors eligible for receiving the users of without violating the latency constraint .
At this point, randomly selects and contacts a neighbor ∈ .
responds by communicating to the list of the application replicas it is running, its actual resource occupancy , and its maximum capacity . checks whether contains a replica of , and that the cost of adding the users to does not exceed , i.e. + ≤ . In case these conditions are satisfied, redirects all the users to use the replica on . Then, it shuts down its application replica .
In our algorithm the entities in the system collaboratively try to detect application replicas that can be considered redundant, then regroup the users of that application in order to use fewer replicas. The applications' latency constraints are never violated by the regrouping, while the progressive elimination of redundant replicas and the limits imposed by the EMCs' capacities eventually lead the system to converge.
In this paper, such convergence is not formally demonstrated, leaving it as a future work. A possible technique is to represent the algorithm with a workflow, using modelling languages such as StateFlow/StateCharts. Then, it is possible to convert automatically such workflows in linear temporal-logic formulae and exploiting Session 1: Edge-Cloud Continuum Orchestration and Resource Management FRAME '21, June 25, 2021, Virtual Event, Sweden a Model Checker tool, as in [40], formally verify a property of the whole system, that a terminal state is always eventually reachable following every computation path.

EXPERIMENTAL EVALUATION
In order to validate our approach, we simulated a target scenario using PureEdgeSim [41], a discrete-event simulator for Edge environments. The simulated scenario consists of a federation of EMCs. Each EMC is composed of an heterogeneous set of resource constrained edge devices and servers, able to host various types of applications on behalf of a set of users. Addressing such complexity and heterogeneity is beyond the scope of this work, so each EMC is simulated as a single aggregated entity with a PureEdgeSim Datacenter object with a quantity of resources that is the sum of the resources of the devices and servers that compose it. Each user device is simulated with a PureEdgeSim EdgeDevice object (e.g., a tablet or a smartphone). At the beginning of the simulation, each user device offloads a single application instance (represented by a Task object in PureEdgeSim) to the closest EMC. If an instance with the same application type already exists on the same EMC, the user is added to the set of users served by the preexisting instance, increasing its footprint. In this work, mobility of user devices, energy consumption of devices, and dynamic churn of users are not simulated. The number of users is fixed at the beginning of each experiment, varying in the set {60, 120, 180}. Each user device is generated at the beginning of each experiment and placed randomly in a simulated bi-dimensional rectangular space of 200x200 metres.
In our experiments the number of EMCs is fixed to 4. They are placed at predefined locations inside the simulation space, at the vertices of a square of coordinates {50,50}, {50,150}, {150,50} and {150,150}. An investigation of the behaviour of our proposed solution when varying the number of EMCs is beyond the aim of this paper; similarly, for this preliminary work, a confrontation of our algorithms with the best practice algorithms existent in literature is not performed, since our aim is to show the effectiveness and validity of realization of our algorithm and if it reaches its goals, expressed in the optimization function. Such confrontation will be performed as a future work. A set of assumptions are made, accordingly to our model: (1) any application type can be run on any EMC, (2) there is at most a single instance of each application type on each EMC, and (3) all the EMCs belonging to the federation are able to communicate with each others and with all user devices, to provide full connectivity. The types of resources simulated for EMC and applications are limited to three: VCPU s, Ram and Bandwidth. They represent, respectively, the number of VCPUs, the amount of Ram and the amount of network bandwidth required by the VMs/ Containers of a certain application type. For an EMC, instead, they represent, respectively, the maximum declared capacity of VCPU, the maximum declared capacity of Ram, in million of bytes, and the maximum amount of aggregated Bandwidth over communication links to/from the EMC to other EMC or user's device.
There are 4 different application types, each characterised by a different resource footprint, which is composed by a fixed cost (the amount of resources required to start an application instance ), and a variable cost (for each additional user). The different application types and their costs are described in Table 1 and Table 2.
Computational Bound, Memory Bound and I/O Bound application types simulate a computational intensive, memory intensive and networking intensive application, respectively. In this perspective "intensive" means having double the requirements of VCPUs, Ram or Bandwidth than the basic Balanced application type.  For each application type, we have also specified 3 different values for the max network latency: such values vary in the set 0.2, 0.3, 0.5 seconds, and are aimed at testing the behaviour of the experiments with applications with different latency sensitivities. Thus, the number of distinct types of applications used in our experiments is 12 (4 types of applications times 3 different latency constraints). Consequently, each EMC has an equivalent amount of resources, allowing it to be able to host 80-100% of the replicas of the different type of applications at the beginning of each experiment; the capacity of each EMC is { VCPU = 24, Ram = 6Gbytes, Bandwidth = 600Mbits/s}. A simulated latency is calculated for each user's device, using the function This function is composed of two parts: • a fixed part, ℎ , which is dependent from the communication channel type; in our experiments, it is fixed at 0.1 seconds and also includes the part of the latency that depends from the bandwidth of the channel and the dimension of the sent packet; this part is actually negligible for our experiments and thus considered accounted for • a linear part, proportional to the Euclidean distance ( , ) between the EMC hosting the instance of the serving application and the user's device . The predefined constant represents the latency cost for each unit of distance and is fixed in our experiments at 0.0005. Our simulation is divided into iterations occurring at discrete time intervals. Each EMC is an agent that, once per iteration, acts as initiator for the algorithm. In our scenario, an iteration is started every 30 seconds, and every experiment has a simulated duration The graph in Figure 1 shows the speed of convergence and the effectiveness of our algorithm in optimizing the objective function, with a varying number of users. As we expected, with the small federation used in our experiments, even with a limited number of users (120 -180) the number of instances initially hosted by the federation is close to the maximum the whole platform can bear, that is 48; only with 60 users the initial number of instances is 34. The graph shows a high speed of convergence, also for the scenario with a low number of users, mainly due to the limited number of EMCs and the high number of initial application instances, which increases the probability that the randomly chosen EMC to contact actually hosts an instance of the same randomly chosen application type, and the swap does not violate latency constraints. The convergence is reached between 12-15 iterations, with a reduction in the number of instances of about 45% in average for 120 and 180 users, reduced to 25% for 60 users. This is an almost ideal scenario, as we expect lesser reduction in number of instances and slower speed of convergence for scenarios with (1) a greater number of EMCs, keeping fixed the number of users, or with (2) lower maximum latency. Both conditions decrease the probability of being able to move users to another EMC in order to coalesce app instances.
The graph in Figure 2 shows the reduction in resources consumption for all resource types and the entire federation, varying the number of users. Analyzing the graph, we can see that the reduction is more consistent during the first 10-15 iterations of the algorithm: it is an expected result, due to the corresponding behaviour in the number of application instances that switched. We shall remind that the asymptotic resource cost is not linear with the number of users, as the fixed costs are proportional to the number of EMCs and their allocation is constrained by the actual set of communication latencies, that will generally prevent full utilization of instances.
Finally, the graph in Figure 3 shows the average measured network link latency, in percentage respect to the average of maximum network link latencies, for all the application types. We know from the simulator that no latency constraint is exceeded, hence it makes sense to look at the aggregation of results with respect to the varying number of users. We can observe analyzing the results that, at the beginning of the simulation, the average measured latency is about the 36% of the average maximum latency, as every user offloads its application to the closest EMC. While choosing the locally minimal latency ensures the minimum average latency overall, it does not lead to a minimization of the overall resources consumption or of the number of application instances. The two metrics of average communication latency and resource usage are actually in conflict, and optimizing the second while subject to the constraint of the first needs to achieve a trade-off. The graph shows this tradeoff in the simulation. The average network latency increases as users are moved to other EMCs by the resource optimization algorithm, as opposed to previous two graphs that show a reduction in resource consumption and number of the application instances.
Analyzing the experimental results, we can conclude that our algorithm is suitable to reach a sound trade-off, at least in the most common scenarios. The results document a good speed of convergence (achieving most of the resource savings within the first 15 iterations in our experiments).

CONCLUSIONS AND FUTURE WORK
This paper presents a solution for application placement in an edge computing environment based on a fully decentralized approach. It works by performing inter-edge exchanges with the objective of reducing the resource usage, while safeguarding the QoE of applications by keeping the communication latency below application dependant-thresholds. The paper gives a formal definition of the constraints and the objectives, as well as the pseudo code of the proposed approach. An experimental evaluation via simulation shows the viability and validity of the solution in the most common scenarios, where the number of instances and the overall resources consumption of a set of application types in an Edge Federation is minimized without violating the latency constraints.
While the solution is quite a promising one, there is space to improve the results in the near future. It is worth e.g. considering alternative local search criteria and heuristics for the selection criteria of the EMC and application for the swap proposal, which currently is a plain random choice. This may improve the asymptotic cost savings and is likely to improve the achieved savings as well as the convergence speed of our algorithm. We also plan to provide a complexity analysis of the approach, a more detailed experimental evaluation including user mobility and variations in the edge resources, and an evaluation of the impact on energy consumption.