Continuous Multi-objective Zero-touch Network Slicing via Twin Delayed DDPG and OpenAI Gym

Artificial intelligence (AI)-driven zero-touch network slicing (NS) is a new paradigm enabling the automation of resource management and orchestration (MANO) in multi-tenant beyond 5G (B5G) networks. In this paper, we tackle the problem of cloud-RAN (C-RAN) joint slice admission control and resource allocation by first formulating it as a Markov decision process (MDP). We then invoke an advanced continuous deep reinforcement learning (DRL) method called twin delayed deep deterministic policy gradient (TD3) to solve it. In this intent, we introduce a multi-objective approach to make the central unit (CU) learn how to re-configure computing resources autonomously while minimizing latency, energy consumption and virtual network function (VNF) instantiation cost for each slice. Moreover, we build a complete 5G C-RAN network slicing environment using OpenAI Gym toolkit where, thanks to its standardized interface, it can be easily tested with different DRL schemes. Finally, we present extensive experimental results to showcase the gain of TD3 as well as the adopted multi-objective strategy in terms of achieved slice admission success rate, latency, energy saving and CPU utilization.


I. INTRODUCTION
N ETWORK slicing is a key feature in 5G networks.It enables to run fully or partly isolated logical networksor tenants-on the same physical network, offering thereby a concrete resource multiplexing gain between slice instances.In this intent, network softwarization and virtualization technologies such as software defined networking (SDN) and network functions virtualization (NFV) provide the necessary programmability and flexibility to operate NS by dynamically creating, scaling and terminating chained virtual network functions (VNFs).Multi-access Edge Computing (MEC) [1] is also an important component that-co-located with C-RAN-can bring the high performance computing resources at the edge, paving the way to accommodate low-latency slices.Having said that, zero-touch and fully automated operations and management have become quintessential to harness the potential gain of dynamic resource allocation in SDN/NFV-enabled NS.Besides ETSI's architecture standardization efforts [2], many algorithms have been presented in the literature to enable the automation of B5G networks as detailed in the sequel.

A. Related Work
In [3], the authors have studied autonomous MANO of VNFs, where the central unit learns to re-configure resources, deploy new VNF instances or offloaded to a central cloud.They have proposed a DRL-based solution dubbed parameterized action twin (PAT) Deep Deterministic Policy Gradient (DDPG) which leverages the actor-critic method to learn to provision network resources to the VNFs in an online manner, given the current network state and the requirements of the deployed VNFs.The proposed solution outperforms all benchmark DRL schemes as well as heuristic greedy allocation in a variety of network scenarios.In [4], the authors have developed a datadriven resource scheduling based on DRL for dynamic resource scheduling in networks slicing.They have solved the slicing resource management challenge in an asymmetric information scenario without using the user-related data, due to the modelfree and dynamic online learning features.Correspondingly, [5] has proposed a dynamic resource reservation and DRL-based autonomous virtual resource slicing framework for the next generation radio access network.At light load, autonomous radio resource management of the deep Q-network (DQN) algorithm has achieved 100% satisfaction and up to about 80% saturation which is the best compared with other benchmarks.Li et al. have studied the application of DRL in some typical resource management scenarios of NS.Their results have shown that compared with the demand prediction-based and some other intuitive solutions, DRL could implicitly incorporate more deep relationship between demand and supply in resourceconstrained scenarios [6].In [7], the authors have proposed vrAIn as a dynamic resource controller for virtualization of RANs (vRAN) based on DRL where vRAN dynamically learns the optimal allocation of computing and radio resources.The proposed solution meets the desired performance targets while minimizing CPU usage and gracefully adapts to shortages of computing resources.

B. Contributions
In this paper, we present the following contributions: • Given that DDPG algorithm is a limiting case of stochastic policy gradient in actor-critic approaches used for solving continuous tasks, this work adopts and fine-tune an alternative way of updating the actor (policy) in DDPG algorithm to speed-up convergence and fulfill a stable and robust learning process [8] based on TD3 method [9].• We introduce a multi-objective approach in the NS environment to maximize cumulative rewards while minimizing network costs.• We develop a complete 5G NS environment based on OpenAI Gym to ensure reproducible comparison of DRL algorithms.

II. SYSTEM MODEL
As depicted in Figure 1, we consider a C-RAN architecture according to 3GPP CU-DU functional split.The underlying N single-antenna small-cells (n = 1, . . ., N ) are connected to a virtual baseband unit (i.e., CUs) pool that runs as a set of VNFs.A total number of J VNFs (j = 1, . . ., J) can be deployed on top of the C-RAN datacenter endowed with I active central processing units (CPUs), where each processor i (i = 1, . . ., I) has a computing capability of P i million operations per time slot (MOPTS) [10].At each time step t, M UEs (m = 1, . . ., M ) can connect to the N small-cells according to the maximum received power criteria.Each UE m requests a slice and starts its activity, wherein the packet arrival to the CU VNF follows a Poisson distribution with mean rate λ m .The mean arrival data rate of all UEs to the CU VNFs is Ω/j, where j is the number of active VNFs.

Computation cost(K (t)
N et )− The baseband processing procedure at a VNF consists of coding, Fast Fourier Transform (FFT) and modulation.The corresponding computing resources follow that in [11], and is given by: Where θ is an experimental parameter, δ m denotes the signalto-interference-plus-noise ratio (SINR) of UE m and K 0 includes computing resources for FFT function that according to [12] imposes a constant base processing load on the system.Based on the experimental results of [13] we assume that, in each cell n, the computing resource requirements for coding, modulation and FFT are 50%,10% and 40%, respectively.Moreover, we assume that VNFs have first in first out (FIFO) queues where µ * is mean service rate for cloud processing and r m for wireless transmission rate which satisfies according to r m = B m log(1 + δ m ), where B m is wireless transmission bandwidth for UE m.In this respect, we further suppose that cloud processing and wireless transmission queues follow an exponential distribution with mean 1 µ * and 1 rm respectively [14].Next, we explain each cost functions of this paper.

Latency(L (t)
N et )− According to queuing theory [15], the mean processing delay at time step t is L  where d denotes latency for creating, booting up and loading new VNFs and j denotes the total number of active VNFs to be deployed.We suppose m is a predefined maximum network delay for UEs which can be viewd as a quality of service (QoS) requirement [14].

Energy(E (t)
N et )− The energy consumption incurred by the VNF instantiation, running processors and wireless transmission power where E (t) v = ψ j refers to energy consumption associated with the deployment of the j th VNF instance where ψ j is a constant value.The energy consumed by processor i in Watts is i where σ * is a parameter determined by the processor structure.The wireless transmission power for UE m is given by where W m is the precoding vector from all cells to UE m, and ρ denotes the efficiency of power amplifier [10] at the cells.Finally we have: We define overall network cost as all costs incurred at each time step: Where ω * 1 , ω * 2 , ω * 3 ∈ R are fixed weights which can be set based on the operator preferences.

III. MDP AND BUILDING NS ENVIRONMENT BY GYM
In this section, we formulate resource allocation optimization problem as Markov Decision Process (MDP).We contemplate the autonomous CU with the goal of improve average return.To this end, we define the observation space and action space that CU can take at each time step.
The MDP for a single agent often is defined by a 5-tuple (S, A, P, γ, R). consisting of a set of states S (state space), a set of actions A (action space), P denotes the state transition probability for state s and action a.The key term of MDP is decision.In fact, the way that agent make decisions for what actions to do in what states is called a policy which denotes with the symbol π.The notation of return G t refers to total discounted rewards from time step t and the main goal is to maximize this return, Where γ is a real-valued discount factor weighting of future rewards and γ ∈ [0, 1] refers to how much we value rewards right now relative to rewards in the future as short-sighted (γ = 0) or far-sighted (γ = 1).
The value function informs the agent how good is at each state or action and how much reward to expect takes a particular V (s t , a t ).Another value function Q which not only depends on the state s but also action a is actionvalue function.The policies determine relation between Q and V .The optimal policy in Reinforcement Learning (RL) is the best policy for which there is no greater value function, so for optimal value functions and optimal action-value function we have ∀s ∈ S, V ⋆ (s) = max π {V π (s)} and ∀s ∈ S, a ∈ A, Q ⋆ (s, a) = max π {Q π (s, a)} respectively.
A continuous state and action space in OpenAI Gym is defined the action that an agent can take and the input that the agent receives are both continuous values: 1) State Space: We use Box space as multidimensional continuous spaces with bounds.In telecom environment the state space is the set of possible network configurations.
We consider state at time step t consists of : • The number of new UEs which connect to network and request services for each slice (X (t) ).• Computing resources allocated to each VNF (C (t) ) • Delay status with respect to latency cost for each slice (L (t) ) • Energy status with respect to energy cost each slice(E (t) ) • Number of users being served in each slice (m (t) ) • Number of VNF instantiations in each slice (V (t) ) The network state space or input can be characterized by 2) Action Space: We consider vertical scaling action space.
The vertical scaling can be classified into scale up and scale down that is related to increasing or decreasing capacity respectively.The CU select continuous value action with respect to traffic fluctuation and learn to decide to increase/decrease computing resources allocated to each VNF.In OpenAI Gym, It takes an action as input and provides observation, reward, done and an optional info object as output at each step.Let consider vertical scaling action for CPU resources as CP U .Therefore, due to change the allocation resources according to time slot, we have: One may note that vertical scaling is limited by the amount of free computing resources available on the physical server hosting the virtual machine [16].
3) Reward: The main objective of this work is minimize the total network cost where the agent learns to increase the expected return.To this end we define the return as follows: We pursue an experimental approach because maybe the total network cost (N T ) is a general and imprecise metric to guide the agent for learning and leading to good results.Consequently, tuning the hyperparameters, Deep Neural Networks (DNNs) architecture and designing training steps are very tricky.

IV. TWIN DELAYED DDPG
The basic idea behind policy-based algorithms is to adjust the parameters φ of the policy in the direction of the performance gradient ∇ φ J(π φ ).The fundamental result underlying these algorithms is the policy gradient theorem [17]: We can parameterize policy like value function and the goal is to find the optimal policy π φ where φ includes updating the weight of the policy.The expected return can be approximated in many ways.We calculate the gradient of expected return according to parameters of φ as ∇ φ J(φ).We use gradient ascent as opposed of gradient descent for updating the parameters, φ t+1 = φ t + α∇ φ J(π φ )|φ t .In actor-Critic method, we have two models that work concurrently where the actor is a policy taking state as input and delivering actions as output, while the critic takes states and actions concatenated together and return the Q-value and a policy that can be updated through the deterministic policy gradient [9], Initially we should store random experience in the buffer β.In the other words, we store (s t , a t , r t , s t+1 ) to train Deep Q-Network.We take a random batch B and for all transitions (s tB , a tB , r tB , s tB +1 ) of β, the predictions are Q(s tB , a tB ) and the targets consider as optimal immediate return that are obs, done = env.reset(),False end t=t+1 end exactly first part of temporal difference learning (TD) error as R(s tB , a tB ) + γmax a (Q (st B +1 ,a) ), and over the whole batch B, we calculate the loss between predictions and the targets in the batch B. Another target network is used instead of using Q-network to calculate the target to fulfill more stability for learning algorithm.As shown in figure 2, the TD3 is based on the actor-critic model that it leverages three tricks to improve algorithm: 1) Clipped double Q-learning with pair of critic networks: We use two DNNs as two actor networks and denote them by φ as actor network and φ ′ as actor target.In addition, we create two pair of critic networks and denote them by θ 1 ,θ 2 for parameterization of value network and θ ′ 1 ,θ ′ 2 as critic targets.Indeed, two learnings happen simultaneously, namely, Q-learning and Policy learning, and they address approximation error, reduce the bias, and find the highest Q-value.This was inspired by the technique seen in [18] as Double-Q Learning.For each element and transition of batch, the actor target plays a ′ based on s ′ while we add Gaussian noise to this a ′ .The critic targets takes the couple (s ′ , a ′ ) and return two Q-values Q ′ t1 and Q ′ t2 as output.Then, the (min ) is considered as an approximated value for critic networks.In [19] has proposed using the target network as one of the value estimates.Given that we calculate the final target of the two value networks, we have: then the two critic networks return two Q-values as Q 1 (s,a) and Q 2 (s,a).Next, we calculate the loss based on two critic networks and with Mean Squared Error (MSE).To minimize the loss over iterations via back-propagation technique, we use an efficient optimizer called Adaptive Moment Estimation (Adam) [20] in our code: In the next step, we explain how we update the target networks.
2) Delayed policy updates and target networks: The main idea is to update the policy network less frequently than the value network since we need to estimate the value with lower variance [21].The update rule is given by Polyak Averaging, so we update parameters by: where τ ≤ 1 is an hyperparameter to tune the speed of updating.3) Target policy smoothing and noise regularisation: When updating the critic, a learning target using a deterministic policy is highly susceptible to inaccuracies induced by function approximation error, increasing the variance of the target.This induced variance can be reduced through regularization [9] to be sure for the exploration of all possible continuous parameters.We add Gaussian noise to the next action a ′ to prevent two large actions played and disturb the state of the environment: where the noise ǫ is sampled from a Gaussian distribution with zero and certain standard deviation and clipped in a certain range of value between −c and c to encourage exploration.Due to avoid the error of using the impossible value of actions, we clip the added noise to the range of possible actions (min_action, max_action).The TD3-based NS method is summarized in Algorithm 1.

V. NUMERICAL RESULTS
To evaluate our method described in section IV, we generate six deep neural networks which work together based on actorcritic model.The implementation is written in PyTorch, fol- scenario and technology used.We measure the performance on our customized NS environment, interfaced through OpenAI Gym to fulfill reproducible comparison.In this environment, the mobile network operator (MNO) collects the free and unused resources from the tenants and when slices need more resources can receive new resources.It is done either periodically to avoid over-heading or based on requests of tenants.We consider a two-tenants scenario, i.e., two slices with different QoS requirements in terms of latency and CPU constraints.For each time step, users packets arrive into the network and the algorithm computes the computing requirements to allocate to the relevant VNF.We compare the performance of TD3 method against a fine-tuned version of the DDPG presented in [21] and [9] as well as Soft Actor-Critic (SAC) [22] to keep all algorithms consistent.Table I presents the number of DNNs for each method.As shown in Figure 3, the learning curve of TD3 outperforms all other algorithms in final performance with respect to our big and complex state space.However, the problem formulation is general, we use constraints as penalty in implementation to leads agent to the good results and this is the reason of negative values in the learning curves.In Figure 4, we present the results of comparison with respect to network performance and cost functions in (II) for each slice under similar traffic patterns and also network measures based on different weights that are shown in Figures 4-(k) and 4-(l).Admission rate: Figures 4-(a) and 4-(b) show that the retuned TD3 algorithm outperforms the other approaches with respect to resource availability and constraints.The algorithm learns according to iterations or interact with the environment with different configurations of network so this is the reason of high fluctuation.The main goal for the algorithm is to find the best policy during the time.The similarity between results of slice 1 and slice 2 is because the MNO is trading-off between slices while there is a high resource availability.This approach enables slices to request more resources and results in provisioning services to more users and increasing admission rate.Latency: As shown in figure 4

VI. CONCLUSION
In this paper, we have presented a new continuous multiobjective zero-touch NS solution.In this intent, we have developed an OpenAI Gym NS environment and used advanced DRL-based algorithm for resource allocation problem to enable CU to learn how to re-configure computing resources autonomously, aiming at minimizing the latency, energy consumption and VNF instantiation for each slice.This method leverages 3 techniques to fulfill more stability of learning algorithm.In this respect, we have compared the network performance and costs between TD3 and other DRL benchmarks.We have shown that the proposed solution outperforms other DRL methods.
Ω and transmission latency in wireless transmission queue is L (t) trans = 1 rm−λ (t) m so we have:

Figure 1 :
Figure 1: Proposed distributed and hierarchical architecture.

Figure 3 :Figure 4 :
Figure 3: Learning curves for the gym NS environment.lowed by the experimental parameters used in the simulations.In fact, the values of parameters depend highly on capability, -(c) and 4-(d), our solution leads to less average delay per user compared to DDPG and SAC.Dealy QoS violation: The comparison between Delay QoS metric of TD3 and other schemes are presented in figure 4-(e) and 4-(f).Energy consumption: Figures 4-(g) and 4-(h) show that the performance of our scheme and other methods where agent learns to satisfy another objective and minimize power consumption by decreasing VNFs instantiation and tuning wireless transmission power.CPU utilization: As depicted in figures 4-(i) and 4-(j), the TD3 deployment leads to more efficient usage of CPU compared to other methods.

Table I :
Number of networks for algorithms.