Darly: Deep Reinforcement Learning for QoS-aware scheduling under resource heterogeneity Optimizing serverless video analytics

Today, video analytics are becoming extremely popular due to the increasing need for extracting valuable information from videos available in public sharing services through camera-driven streams. Typically, video analytics are organized as a set of separate tasks, each of which has different resource requirements (e.g., computational- vs. memory-intensive tasks). The serverless computing paradigm forms a very promising approach for mapping such types of applications, as it enables fine-grained deployment and management in a per-function manner. However, modern serverless frameworks suffer from performance variability issues, due to i) the interference introduced due to co-location of third-party workloads with the serverless funcations and ii) the increasing hardware heterogeneity introduced in public clouds. To this end, this work introduces Darly, a QoS- and heterogeneity-aware Deep Reinforcement Learning-based Scheduler for serverless video analytics deployments. The proposed framework incorporates a DRL agent which exploits low-level performance counters to identify the levels of interference and the degree of heterogeneity in the underlying infrastructure and combines this information along with user-defined QoS requirements to dynamically optimize resource allocations by deciding the placement, migration, or horizontal scaling of serverless functions. Promising results are produced withing our experiments, which are accompanied with the intent to further build upon this groundwork.


I. INTRODUCTION
Video traffic has already been and is projected to be further increased over the next years [1].Video analytics are typically offloaded to the Cloud, due to the quasi-unlimited computing capacity it offers.Serverless computing is an emerging paradigm offering a very high-level abstraction of the cloud infrastructure to end-users.However, it comes with decreased control over the infrastructure itself, leading to limitations regarding the efficient management of resources that often result in Quality of Service (QoS) violations, due to the high degree of performance variability [5].
Serverless workloads are managed by open-source runtimes like OpenFaas and Openwhisk, leveraging container orchestrators such as Kubernetes for scheduling and deployment.Nonetheless, workload orchestrators (e.g., the native Kubernetes scheduler) usually apply resource management decisions once, at the beginning of each job, neglecting future system states or QoS targets.State-less scheduling decisions for meeting QoS invest in resource over-provisioning, application workload modelling [2] and horizontal/vertical scaling [6] but occasionally fail due to i) the heterogeneity of hardware resources [4], ii) interference phenomena in multi-tenant environments [7]; as a consequence of the resource sharing and contention, the workload execution interferes with the execution of other applications and iii) unawareness of workload specific features both in function and workflow scope [8].
Improving scheduling efficiency in modern cloud environments needs continuous feedback loops for boosting the algorithm's heterogeneity, interference and performance awareness and thus ease informed decision-making for serving multiple QoS requirements.Deep Reinforcement Learning (DRL) forms a very effective solution in modelling environmental variability so as to derive orchestration strategies for online management of serverless workflows.
We present Darly, a DRL-based scheduling framework for managing video analytics pipelines in serverless infrastructures, which in principle can be also applied by design to granular serverless workflows from other domains.Darly exploits low-level system metrics to identify underlying cluster interference and along with user-defined QoS requirements, aims to regulate end-to-end latency of video analytics pipelines, through horizontal scaling and migration of the pipeline's functions.Our solution manages to dynamically orchestrate the deployed functions under various run-time conditions, i.e., system-level dynamic resource fluctuations due to interference and/or dynamically changing QoS criteria.We develop a video-analytics workflow reflecting a realworld pipeline [9], that performs computer vision (CV) inference in frames of a video.We explore interference's and heterogeneity's influence in its performance under various scenarios, in a cluster of four Virtual Machines (VMs) deployed on top of an on-premise, highly-heterogeneous, highend server infrastructure.Our pipeline is represented as a DAG in Fig. 1, and consists of 5 separate functions: i) Framer which extracts frames from the input mp4 video file, ii) Face-detector which detects whether a human face exists or not in a frame and forwards the processed frame to iii) or iv), iii)Face-analyzer that performs emotion recognition to a detected human face, iv) Object-recognition which classifies objects existing in the frame and v) Uploader that aggregates results from iii), iv) and uploads them to remote storage.We invoke our pipeline with four distinct input sizes and measure the average execution latency for the individual functions, which sum up to the end-to-end latency.
Impact of interference: We spawn different amounts of cpu microbenchmarks from the iBench suite [3] and apply four levels of interference to w01: 0%, 25%, 50% and 75% of the total available cores, as portrayed in Fig. 2. Great, non-linear performance variability is presented w.r.t CPU interference that reaches up to 57.6% for the 16-frames input and up to 47.2% for the 32-frames input in the Framer and CV models functions respectively.
Impact of heterogeneity: Fig. 3 shows the performance variation of the examined functions, w.r.t.hardware heterogeneity.VMs with fewer cores, provide less multi-threading capacity to the hosted functions which result in poorer latency.For the Framer function we find deltas with a maximum value of 23% and a minimum of 10% performance variation in the 16-frames and 65-frames inputs respectively.For the ML-

III. Darly: A DYNAMIC DRL-BASED SCHEDULER
We design a dynamic scheduler for modern cloud environments that is aware of node heterogeneity, resource interference from third-party co-located workloads and resistant to fluctuations caused by unpredictable user demand.Based on the findings outlined in Section II, we leverage DRL to implement Darly, a dynamic DRL-based scheduler for the widely adopted open-source serverless platform, OpenFaas.
Our scheduler receives a video-request with a QoS constraint and after scanning the cluster state, orchestrates the workflow functions' topology in order to optimize resource utilization and regulate end-to-end latency without exceeding the user-defined QoS.The proposed framework, shown in Fig. 4, consists of four components: a System Monitor which collects low-level metrics (i.e., IPC, L3-cache misses, Memory Reads/Writes, C-states) depicting the system's state, a DRL-based Agent which reads the system metrics and calculates the next action to be performed on the deployed functions regarding the specified QoS, a Runtime Engine that given the current functions' placement accommodates the execution of workflow instance and a function Mapper which maps a function to a node according the agent's latest decision.
The DRL-based agent, utilizing a Deep Q-Network (DQN), interacts with the environment at discrete time steps t and aims to maximize the received reward over time by choosing an action A t among a discrete set of available functions and force the transition of the environment to a new state.
Action set A: Includes: i) per function horizontal-scaling, ii) function migration to different nodes or iii) inactivity, i.e., preserving the function topology as is.Rewarding Strategy: The incentive behind the reward function (Eq. 1) is the regulation of the execution latency (L a ) by striving not to violate the latency threshold set by the user (L t ) while attempting to minimize both the number of utilized servers (sp) and the replica count (r) for each function.
IV. EVALUATION We evaluate our framework w.r.t its efficiency to identify the appropriate actions for satisfying the pre-defined latency constraint while allocating the least amount of resources.We quantify the scheduler's performance by the QoS quotient (i.e., execution latency achieved divided by the user-defined QoS), agent's cumulative reward over time and QoS violations that are depicted in Fig. 5.
Experimental Conditions: We set two QoS levels, relying on the performance characterization presented in Sec.II, i.e., 35 and 26 seconds, which correspond to looser and stricter constraints respectively.Training on discrete levels of QoS is essential for exposing the DRL-based agent on a wide enough spectrum of states to facilitate its ability in identifying patterns among (state, action) pairs.During the experiments we dynamically change the underlying interference on the cluster, by randomly altering the number of cpu micro-benchmarks per VM in the same way as explained in Sec.II.
Examined Schedulers: We examine four different schedulers, so as to determine the inter-relationship between the DRL-agent's proposed actions and the employed scheduling mechanism.We aim to quantify the impact of i) the scheduling granularity when migrating functions and ii) heterogeneityand interference-awareness with our proposed framework.Specifically, we developed four distinct schedulers, all differing in their actionspace: Fullmap-based, Custom-based, Kubernetes-based, Profile-based.The former two decide both the migration and the destination of a function, while the latter two decide just whether a function should be migrated or not and a third-party scheduler, Kubernetes and Profile respectively, locates the migrating function to a node, with its own policy.The Profile-based scheduler leverages offline profiling information (Sec.II) of the performance of deployed functions and decides the optimal scheduling policy accordingly.
Results: As depicted in Fig. 5a, the Custom-and Profilebased schedulers manage to adapt effectively to all changes in resource stress levels while regularizing the QoS quotient (i.e., remaining close to the upper bound of value of 1).A similar but less stable result is achieved by the fullmap-based scheduler which, due to a larger action space, is less prone to convergence.Last comes the Kubernetes-based approach due to its heterogeneity-and interference-unawareness fails to adjust its migration decisions to the occurring conditions.

Fig. 5 :
Fig. 5: Comparative evaluation of different schedulers during training of the DRL