Dynamic Resource Aware VNF Placement with Deep Reinforcement Learning for 5G Networks

The increasing demand for fast, reliable, and robust network services has driven the telecommunications industry to design novel network architectures that employ Network Functions Virtualization and Software Defined Networking. Despite the advancements in cellular networks, there is a need for an automatic, self-adapting orchestrating mechanism that can manage the placement of resources. Deep Reinforcement Learning can perform such tasks dynamically, without any prior knowledge. In this work, we leverage a Deep Deterministic Policy Gradient Reinforcement Learning algorithm, to fully automate the Virtual Network Functions deployment process between edge and cloud network nodes. We evaluate the performance of our implementation and compare it with alternative solutions to prove its superiority while demonstrating results that pave the way for Experiential Network Intelligence and fully automated, Zero touch network Service Management.


I. INTRODUCTION
The explosive growth of the Fifth Generation (5G) networks and their complexity surpasses the limits of manual administration. The increasing demand for agile and robust network services has driven the telecommunication industry to design innovative network architectures that employ Network Functions Virtualization (NFV) and Software Defined Networking (SDN). NFV is the core enabling concept that proposes to decouple network functions from proprietary hardware appliances and emulates them in virtualized containers. Compared to traditional network functions implemented by dedicated hardware appliances, NFV has the potential to significantly reduce the operating and capital expenses and improve service agility. SDN aims to enable intelligent and centralized network control using software that is operating in consumer-grade hardware and not proprietary hardware appliances. According to the academia and industry, the combination of both concepts is expected to revolutionize network operations, not only by substantially reducing the cost [1], but also by introducing new possibilities for enterprises, carriers, and service providers.
The advances in NFV and cloud technologies have led to the emerging paradigm of Multi-access Edge Computing (MEC), which brings cloud computing capabilities and services close to the 5G base stations. Hence, it creates a new ecosystem that supports the flexible deployment of new applications and services close to the end-users, thus significantly reducing the overall delay. This opens up new possibilities in advanced use case scenarios such as ultra-Reliable Low-Latency Communication (uRLLC) and Enhanced Mobile Broadband (eMBB).
The strategic placement of Virtual Network Functions (VNFs), which can be composed of one or more Virtual Machines (VMs), to the most suitable location is directly tied to the availability of physical resources. It can greatly affect network performance, service latency, overall expenses, power consumption, and even carbon emissions. Furthermore, advanced VNF migration techniques support the spatial relocation of VMs from one Data Center (DC) to another with virtually no downtime [2]. This new feature, known as Live Migration, provides unique opportunities for the dynamic placement and life-cycle management of VNFs [3] [4].
The wide variety of challenges introduced by the disruptive deployment of 5G trigger the need for a drastic transformation in the way services are managed and orchestrated. It can become time-consuming to orchestrate services due to manual configurations around VNF integration. The intention is to have all operational processes and tasks, such as delivery, deployment, configuration, and optimization, executed automatically [5]. Since the world is already moving towards datadriven automation, technologies like Machine Learning (ML) can be employed to tackle part of the workload. New advancements in Deep Reinforcement Learning (DRL), will pave the way for a more intelligent and self-organizing network [6].
DRL has provided an important breakthrough recently, not only by defeating human players in a plethora of board [7] and video games [8] but also numerous applications in industry, such as self-driving cars [9].
Despite the significant advancements in networking thanks to the introduction of 5G technologies, the deployment and orchestration tools still require additional research and development, to reach an acceptable level of automation. Since inefficient resource utilization might lead to severe performance degradation [10], there is a need for an automated, selfadapting orchestrating instrument that can manage resources wisely, taking into account as many relevant variables as possible [11]. Static scaling and migration, based solely on host system resources, overlook important factors such as End-to-End (E2E) latency and network performance that are crucial for uRLLC services. An auspicious solution is the use of traditional Deep Learning (DL) algorithms for the prediction of future network states. However, due to the high complexity, the need for offline training and the increased power consumption, operators may face increased operational expenses and network performance issues. Thus, there is a need for a novel and flexible mechanism for intelligent VNF placement [12] to fully exploit the use of MEC for uRLLC services.
In this paper, we leverage a Deep Deterministic Policy Gradient (DDPG) Reinforcement Learning (RL) algorithm to solve the NFV placement problem between MEC DCs to minimize latency for uRLLC services. Specifically, we follow the definition of ETSI Experiential Network Intelligence (ENI) and Zero-touch network Service Management (ZSM) standards [13]. Our contribution is threefold: (i) we automate live migration with DRL, (ii) we introduce a system capable of adapting to the current network conditions, and (iii) we manage edge computing and link resources in a way beneficial to both the vendors and the end-users.
The remainder of this paper is organized as follows. Section II provides a discussion of the related work. Section III presents the network architecture of our implementation. Section IV introduces the theoretical background of DDPG and defines in detail the solution that we have developed. Section V showcases the experimental setup and the results produced. Finally, Section VI provides a conclusion of this work and our future intentions.

II. RELATED WORK
Although a plethora of research works have addressed the dynamic VM migration between DCs for orchestration purposes, only a handful of them studied the use of DRL algorithms as a potential solution to the problem. In the following, we explore relevant works in the literature to provide a wider view of the subject.
Cziva et al. [14] utilize Integer Linear Programming and Gurobi Solver to minimize the E2E latency between users and their VNFs. The results seem encouraging, but the complexity of the problem requires to take into consideration additional factors. In [15], Zhang et al. propose an online adaptive control mechanism that aims to reduce operating costs and enable resource allocation through reconfiguration. The main objective is the minimization of the latency by taking into account Service Level Agreements (SLAs) and the resource cost, avoiding any resource optimization technique. Although their work provides considerable value for the research community, there is a need for a solution that incorporates resource provisioning to foresee and adapt in advance to the forthcoming network traffic. L. Yala et al. [16] propose a Genetic Algorithm to solve the VNF placement problem between MEC servers for uRLLC services. Their objective is to minimize latency and maximize service availability by solving a multi-objective optimization problem that captures the trade-off between service access latency and availability. Their experimental results reveal that the solution is close to the optimal. Ben Jemaa et al. [17] introduce VNF placement and provisioning optimization strategies, taking into account Quality of Service (QoS) requirements. However, additional metrics besides Central Processing Unit (CPU) utilization and link capacity, should be taken thoughtfully into consideration. By exclusively employing VMs, they do not fully exploit the potential of MEC, as with container technology, more VNFs can share the same resources.

III. SYSTEM MODEL
We consider a 5G capable network consisting of a set of DCs in a star topology in a star topology [18]. The DCs can be located either at the network edge, close to the end users, or in a remote cloud service. The topology consists of (i) a single Cloud DC denoted as C, that hosts numerous VNFs, denoted as V , (ii) multiple MEC DCs M , that offer computational power near the edge, (iii) network switches W , and (iv) User Equipment (UE) devices denoted as U respectively. The links that interconnect all entities in the network are denoted as L and defined with physical limitations l = {delay, bandwidth}. The network offers a uRLLC service that is composed of multiple VNFs, that are made of multiple VMs in the DCs. We assume that all MEC DCs have the same capacity c = {cpu, ram, storage}, which limits the number of VNFs that the node can host. All the entities are connected to an overlaying controller node that orchestrates the entire SDN and hosts the proposed algorithm. The topology is described as an undirected graph G = (C, L, W, M, U ). Both C and M DCs can host multiple VNFs at any given time.
We define as VNF profile a set of variables that declare the total computational resources, the SLA, and the average ser-vice round-trip latency needed by the VNFs of the service. It is denoted with P = {cpu, memory, storage, latency, sla}.
The objective is to calculate an appropriate placement of the VNFs on the available MEC DCs, at any given time. The placement should minimize the average E2E latency between the users and the uRLLC service VNFs while considering the distribution of available computational resources at the network edge.
The overall procedure repeats indefinitely and will be analyzed extensively in the following sections.

IV. DEEP REINFORCEMENT LEARNING APPROACH
In this section, we will make a brief introduction to the background of DRL, DDPG and the relation between them. Later on, we will focus on Actor-Critic architecture and how it applies to our proposed architecture.

A. Markov Decision Process
First, we formalize the VNF placement problem as a Markov Decision Process. It is a decision-making process that allows us to mathematically represent an environment through Actions, States and Rewards.
In this problem, as State we define: (i) the computational resource load of all MEC DCs M , (ii) the load of all links L, and (iii) the VNF profile P . As Action we define an integer that indicates the MEC ID where the VNF should be placed and as Reward the feedback value that precisely defines the change that the given Action produced to the environment of the algorithm, which in this case is the service latency.
To reduce the large discrete action space, we devise a serialization method that calculates the placement of each VNF one at a time. This notably decreases the execution time, the complexity of the algorithm and the total training time. Given that, the action is a single integer that corresponds to the ID of a MEC DC M .

B. DRL & DDPG Background
RL is a specific type of ML method, concerned with how a software agent acts and receives a Reward by its environment to evaluate its Actions, can maximize this Reward. DRL is a subset of RL methods that exploit the use of multi-layer Neural Networks (NNs) instead of the design of a state-action table. NNs are composed of interconnected memory units, termed neurons. A set of neurons that perform a similar task is called a layer. Each neuron is followed by an Activation Function that decides, whether it should be activated or not. DRL algorithms, instead of employing labeled input and output data tuples like supervised learning algorithms, they focus on finding the balance between exploration of all possible actions and their outcome, and exploitation of current actions to maximize the cumulative Reward.
DDPG is a DRL algorithm that belongs to the family of Policy-Gradient algorithms. It is an off-policy, model-free algorithm meaning that it learns directly from raw observed data extracted from its environment with no a priori knowledge of the domain dynamics. This advantage makes them a versatile option for automation across a wide variety of environments compared to static planning algorithms.

C. Actor-Critic Architecture
There are two main types of RL algorithms: (i) Value-Based and (ii) Policy-Based. Value-Based algorithms map the input to real numbers, where their value represents the long-term reward achieved, given a state and an action. On the other hand, Policy-Based algorithms instead of calculating a value function of expected rewards, they learn directly the policy function that maps state to action. The DDPG algorithm is the amalgamation of both methods aiming to tackle the drawbacks of each one simultaneously. It is composed of two entirely separate NNs, as depicted in Fig. 2, that share the same input layer. (i) The Actor and (ii) the Critic NNs.

1) Actor:
The Actor NN directly maps the States to the VNF placement decisions, creating a Policy between the input and the output. We use a One-vs-Rest in the output of Actor NN that selects the decision with the maximum probability to return higher Reward, to point at the MEC ID that the VNF should be migrated to.
2) Critic: The Critic NN, on the other hand, is responsible for the approximation of the Q-value function of the Actor NN to determine the inclination of the Actor's policy base model, similar to the Deep Q-network algorithm [19]. The Q-value is essentially an approximation of the maximum future Reward. The Critic NN is similar to the Actor NN, except for the final layer. The final layer is a fully connected layer that maps the states and actions with the Q-value of the Actor NN.
The Reward is an injection of positive or negative values in the weights of both Actor and Critic NNs, to point them towards making decisions that minimize the service latency.

D. TD-Error, Loss & Mini-batch
The NNs of both the Actor and the Critic generate a Temporal-Difference (TD) error signal in each time step. The TD error is computed by: where r t expresses the immediate reward and t N. θ μ and θ Q are the outputs of the Actor and Critic NNs accordingly. The weights of Critic NN θ Q are updated with the gradients obtained from the loss function: where N is the size of a mini-batch, sampled from the Replay Buffer (RB), which is a list of all the actions and rewards gained from the environment so far by the actions taken. RB allows the software agent to learn by sampling experiences from the environment. Mini-batch is a smaller list holding experiences from RB for processing. Random minibatches from the RB are used to improve stability [20]. The Q(s t , a t |θ Q ) is the loss of the Critic NN for every sample with index t.

E. Ornstein-Uhlenbeck Process
We also employ the Ornstein-Uhlenbeck process to generate random decisions that are temporally correlated. This stochastic process is called Exploration and allows the algorithm to explore the network behavior in the current network traffic. According to DeepMind [19], forcing Exploration during the first stages of operation can improve convergence speed and boost the frequency of higher rewards in RL algorithms. It is described by: where θ is the speed the variable reverts towards the mean value, μ represents the equilibrium value and σ represents the volatility of the process. The Ornstein-Uhlenbeck process offers a way to prevent the algorithm from converging at the local minimum states of the problem. This also reduces the training time required by the algorithm to converge.

F. Algorithm Flow & Procedures
As shown in Algorithm 1, the training procedure of our algorithm begins by initializing all main variables. Store transition 10: if memory >= memory capacity then

11:
Decrease exploration by exploration rate 12: Learn the transition 13: end if 14: end while 15: end while The first loop iterates through the Episodes and the enclosed loop through the Steps. Every Step begins by requesting a new State from the controller. Then, we use State as an input to the Actor and Critic NNs. Based on the current conditions described in the State, the Actor NN generates the number of the corresponding MEC or Cloud to migrate the VNF, while the Critic NN estimates output that will return the maximum Reward. Finally, the algorithm instructs the orchestrator about the migration decision. The migration procedure begins by copying and initiating the VNFs in their new hosts while shutting them down in the old ones. After the migration procedure is completed, the algorithm requests another State to calculate the impact of the action in the network. The impact shapes the Reward value and is calculated by comparing the State before and after the migration.
In addition, the algorithm has to comply with a high-level set of instructions and rules set by the operator. This is called the Operator Rules Set and describes with high-level language a target policy, basic instructions that the algorithm uses to evaluate its actions. This approach offers a fail-safe way to adjust the algorithm to gradually meet the requirements of the network operator and prevent potential bottlenecks, packet loss or performance drops [6].

V. EXPERIMENTAL EVALUATION
In this section, we conduct several simulation experiments to study the performance of the proposed algorithm in a realistic network topology.

A. Actor-Critic Configuration
The Actor NN has one hidden layer with 30 neurons and ReLu activation functions, as this design returned the highest reward during testing. The output layer uses tanh activation function that outputs the final MEC ID where the VNF should be migrated. ReLu activation functions form a ramp function and offer a threshold that needs to be passed for the neuron to update, whereas tanh forms its naming function, offering a smoother value update.
The Critic NN is built with 30 neurons in its first hidden layer and 30 more in its second hidden layer. It also has a ReLU activation function that estimates the Q-value.

B. Hyper-parameter Configuration
In ML terminology, hyper-parameters are the adjustable values that are defined before the learning process and are directly linked to the performance of the algorithm. The hyperparameters selected for this simulation study are shown in Table I. In our case, manual hyper-parameter tuning through trial-and-error was the key to find the optimal balance.

C. Simulation Setup
In our experimental topology, we employed Containernet [21], an advanced branch of Mininet [22] network emulator. It enables us to divide the physical resources into individual Containers with Docker and build an SDN layer on top. We utilized the VideoLAN media player for real-time video streaming between the Containers and Wireshark to inspect the network traffic. Also, a Network Level POX controller with in-house built migration functionality is used to interact with the underlying SDN OpenFlow switches. The DDPG algorithm is implemented with a Python-based framework called TensorFlow [23].

D. Topology
In our performance evaluation, we consider a network topology that consists of a Cloud DC and 5 MEC DCs. The Cloud DC is considered to possess abundant computational resources. It also acts as a host for both the proposed DDPG algorithm orchestrator and the SDN controller. All MEC DCs have a finite amount of resources, 8 CPU cores, and 8 GB of RAM each.
In the following, we assume that VNFs V [25, 250] and each one of them serves one user. Each VNF is broadcasting a 1280 × 720 pixel, highly compressed H.264 video stream with Real-Time Transfer Protocol to its assigned user. Traffic distribution is designed considering that 90% of users generate less than 5% of total MEC DC resources combined, whereas the remaining 10% produces more than 25%. SLA latency requirements were assigned randomly following a normal distribution, with values ranging between 5 and 30 milliseconds, to simulate the nature of different services.

E. Baseline & Cloud Approaches
As a Baseline, we consider an algorithm that rejects any VNF placement request to a MEC, if the given node has reached 90% of its total utilization capacity, at any given metric. The rejected VNFs are hosted at the cloud DC instead of the network edge nodes, having their SLA requirements ignored.
As a Cloud approach, we consider the use of one remote Cloud DC with a fixed delay value. We consider this static value as the total delay caused by the wide-area transport network that includes multiple physical switches and links. Every possible hop and queue introduces an additional delay to the propagation of the information from the edge of the network to the cloud DC.

F. Performance Evaluation
First, we carried experiments to study the average roundtrip latency from the users to their VNFs, where a different number of users considered.
!!"# $ % &% Figure 3. Comparison of the average latency from end-users to their VNF between Cloud, Baseline, and DDPG Assisted systems. Fig. 3 outlines the average round-trip service latency that users perceive. As it is expected, increasing the number of users in the network also increases the average service latency. This phenomenon is observed because a greater number of services have to compete for the same, insufficient MEC resources.
We can observe that our proposed algorithm, greatly outperforms both baseline and cloud options by offering considerably lower average round-trip latency, specifically at a lower number of users. All options gradually increase until they reach a threshold that indicates the saturation of MEC resources.
As presented in Fig. 4, with optimal placement in the same topology, both the average SLA violations and the VNF rejections retain lower levels than the baseline option at any traffic scale. This is an indication that our proposed algorithm can learn how to prioritize the placement of VNFs according to their latency requirements in the MEC servers and orchestrate them according to their VNF profiles.

VI. CONCLUSION & FUTURE WORK
In this paper, we proved the viability of DDPG RL for automated spatial resource allocation by migrating VNFs between a cloud DC and several MEC DCs. Our work is flexible and dynamic, discovering the best trade-offs between SLA requirements, latency and network resources, which is beneficial for both operators and users. The presented results show clear benefits over alternative solutions.
As future work, we intend to build a proactive state prediction algorithm based on Long-Short Term Memory Recurrent Neural Networks, that will provide better insights into the usage trend of each entity in the network.