Coordinated Load Control of Renewable Powered Small Base Stations Through Layered Learning

The massive deployment of Small Base Stations (SBSs) represents one of the most promising solutions adopted by 5G cellular networks to meet the foreseen huge traffic demand. The usage of renewable energies for powering the SBSs attracted particular attention for reducing the energy footprint and, thus, mitigating the environmental impact of mobile networks and enabling cost saving for the operators. The complexity of the system and the variability of the harvesting process suggest the adoption of learning methods. Here, we investigate techniques based on the Layered Learning paradigm to control dense networks of SBSs powered solely by solar energy. In the first layer, SBSs locally select switch ON/OFF policies according to their energy income and traffic demand based on a Heuristically Accelerated Reinforcement Learning method. The second layer relies on an Artificial Neural Network that estimates the network load conditions to implement a centralized controller enforcing local agent decisions. Simulation results prove that the control of the proposed framework mimics the behavior of the upper bound obtained offline with Dynamic Programming. Moreover, the proposed layered framework outperforms both a greedy and a distributed Reinforcement Learning solution in terms of throughput and energy efficiency under different traffic conditions.

required to manage the network for a Western European MNO in 2007 [1]. Moreover, the forecast for the mobile traffic is of an annual growth rate of 53%, which will produce a further increase in the energy consumed by telecommunication infrastructures and, thus, on its footprint on the environment. The Information and Communication Technology (ICT) ecosystem was consuming 1500 TWh of energy in 2013 [2], which corresponds to the 10% of the worldwide electricity generation. Considering the mobile traffic trend, it is expected to reach up to the 51% in 2030 [3]. As a result, the energy sustainability is one of the key pillars in the design of the next generation cellular networks for OPEX and carbon footprint reduction. Many standardization bodies have already started working on this aspect, e.g., ETSI [4] and 3GPP [5].
In the last decade, the research community has been paying close attention to the Energy Efficiency (EE) of the radio communication networks. The dynamic switching ON/OFF of the Base Stations (BSs) has been identified as one of the most promising EE techniques, as testified by the rich literature [6]. However, this solution has been received with reserves from MNOs, since it might generate problems of coverage holes and possible failures of network equipment due to the frequent ON/OFF switches. Consequently, the introduction of Energy Harvesting (EH) capabilities represents an interesting approach to further increase the energy savings, allowing simultaneously to mitigate both the costs and the environmental impact of the mobile communication systems. Many studies proposed solutions using BSs powered with renewable energies [7], not only for regions with limited access to the power grid, e.g., rural areas, but also for deployments in urban environment. Furthermore, 5G will bring densified Heterogeneous Networks (HetNet) comprising a high number of Small BSs (SBSs) within the coverage of Macro BSs (MBSs), especially for satisfying the high traffic demand in urban scenarios [8]. The reduced energy requirements of the SBSs encourage the adoption of renewable energy for their power supply. The reference renewable system for a SBS is composed of a Photo-Voltaic (PV) panel and a battery to store the surplus energy that cannot be directly used and make it available for the periods when PV source is not generating energy. The intermittent and erratic nature of the renewable energies has to be considered to manage the high variations in the incoming energy. In fact, even during the summer and in good weather conditions, the harvested energy in the peak irradiation hour can vary up to the 85%, as showed in [9]. Similarly, considering that the solar radiation intensity and the daylight duration vary significantly across the months [9], seasons have a strong impact in the amount of the harvested energy income and have to be contemplated to make the renewable system working during the whole year. Therefore, an energy management system is needed to adapt the energy demand to the energy supply availability and avoid service shortage. The SBSs shall be coordinated by a dynamic traffic load control framework enabling traffic offloading from the MBS based on the energy reserves and load variations, and guaranteeing service continuity. Dynamic load control is a Demand Response method commonly adopted in Smart Grids, which directly switches ON/OFF the loads of a micro-grid [10], i.e., the SBSs in our case, and provides effective load balancing to the renewable energy system.
Self-Organized Network (SON) paradigm is expected to provide intelligence and autonomous adaptability to mobile network elements and may be used as the reference model for implementing dynamic traffic load control. In particular, Softwarization and Artificial Intelligence (AI) have been identified as the main technologies for implementing the SON paradigm. Software Defined Networking (SDN) [11] and Network Function Virtualization (NFV) [12] provide a flexible infrastructure for collecting the necessary system information and reconfiguring the network elements [13]. AI gives the tools with Machine Learning (ML) for automatic and intelligent system (re-)configuration [14]. ML can be used to extract models that reflect the user and network behaviors, and to solve interactive decision making problems in real-time, at short time scales and with minimum a priori information of the system.
In this paper, we investigate a control architecture and an online algorithm based on SDN/NFV and ML for controlling the traffic load of a densified HetNet powered with solar energy. Each SBS may offload the MBS (ON state) or be in a sleep mode (OFF) and save energy to be used in a different time period. The scenario is studied through a multi-agent system model based on Multi-agent Reinforcement Learning (MRL), where each SBS implements an agent. In particular, we concentrate on the problem of finding simultaneously a solution among all the agents that is good for the whole system. The main goal of our proposal is to incorporate stability of the learning process on the one hand, and adaptation to the dynamic behavior of the other agents on the other hand. Stability means the convergence to stationary policies, whereas adaptation ensures that performance is maintained or improved. We decompose the general problem into two subtasks and use Layered Learning (LL) [15] to find the global solution. The first layer implements Heuristically Accelerated MRL (HAMRL) [16] and it is in charge of the local online control at the SBS level as a function of the local traffic demand and the energy incomes. The second layer is in charge of a network-wide control and it is based on Artificial Neural Networks (ANNs) aimed at estimating the level of congestion of the overall network. The architecture for implementing the two levels and enabling their interaction is based on SDN. To the best of our knowledge, this is the first work in the literature that proposes an online coordinated control system for densified HetNet with EH capabilities based on a complete learning solution and with realistic environmental conditions for both traffic load and energy arrivals, as will be also discussed in Section II. In particular, this work is the first to demonstrate that LL is a valid framework to efficiently manage mobile networks. Moreover, this is the first proposal, which uses HAMRL as the tool enabling the interactions between the algorithms at the different layers to improve agents' coordination. As a result, the innovative contributions of this paper can be summarized as follows: 1) Design of an online optimization framework based on LL for the traffic control of densified HetNets powered with renewable energies tailored to an SDN/NFV architecture. 2) Characterization of the temporal behavior of the learned control policies. 3) Characterization of the performance of the LL framework and of the specific algorithms at each layer. 4) Comparison with an upper bound evaluated offline and with a distributed MRL scheme. We note here that the LL solution has been previously introduced in our work [17]. However, the analysis has been carried out only in a limited scenario and without analyzing the temporal behavior of the learned policies and without the comparison against the offline solution.
The rest of this paper is organized as follows. In Section II, we present the state of the art in EE, with a focus on the online solutions. Section III defines the problem statement and the architecture of the control framework. Sections IV and V describe the solutions adopted in the two layers with details. Section VI is devoted to the presentation of the simulation scenario where the proposed approach has been evaluated, and to discuss the correspondent numerical results. Finally, Section VII concludes by summarizing the main achievements of the work.

II. RELATED WORK
In general, renewable energy systems are dimensioned to guarantee that the BS can autonomously and perpetually operate. However, in this case, the resulting PV panel sizes are large and not suitable for the deployment of SBSs in urban areas [18]. In [19], the authors relaxed this design assumption by introducing the concept of outage, defined as the fraction of time during which a BS cannot satisfy the traffic demand, due to energy shortage. The authors evaluate the size of harvesters and batteries as a function of the outage probability for different geographical locations. The main outcome is that the system may be feasible and cost-effective in locations with relatively high solar irradiation, considering the cost and dimension of the energy harvesting hardware and of the grid energy, when adopting an outage of 1% as design constraint.
Several offline solutions have been proposed for studying the behavior of EH in mobile networks, providing useful guidelines for the design of the network and their harvesting system. In [20], the authors presented a work on energy saving in k-tier HetNet with EH capabilities and sleep mode strategies. In this model, the authors used stochastic geometry to define a metric called availability ρ k , which represents a fraction of time the k th tier of BSs has enough energy reserve to stay switched ON. The method provides the optimal values of the battery and the PV power capacity for the deployment of self-powered HetNets. Nonetheless, it considers a Poisson energy arrival process and continuous transmissions, which may provide a non-realistic approximation to these phenomena, as described in [9] and [21], respectively.
In [22], the authors introduced the concept of Zero grid Energy Networking (ZEN) for a mesh network of BSs powered only with RES. The rural scenario has been considered, where there is no connection to the electric grid and, therefore, the BSs need to be energy self-sufficient. The renewable energy system is designed by considering the daily typical traffic and energy harvested profile for the cities of Aswan, Palermo and Torino generated with a simulator called PVWatts [23]. In addition, the assumption of energy self-sustainability is relaxed and an algorithm enabling SBS sleep mode is introduced to reduce the dimension of the RES equipment. The proposed solution assumes that half the BSs in a given area can enter sleep mode when the traffic is below 50% of peak. The energy consumption and the PV panel dimension can be reduced by the 50%, but the effects on the service provided by the network are not evaluated.
The ski rental problem has been proposed in [24] to optimize the switch ON/OFF scheduling of EH SBSs. Using this approach, each SBS can decide to switch ON/OFF autonomously, without any prior information on future energy arrivals. The decision is performed every 10 sec., which implies that UEs might have to perform frequent handovers and the SBSs may be subject to numerous ON/OFF switches. The failure probability is high and cannot be ignored, since it represents the main constraint for MNOs to implement switch ON/OFF techniques in operative networks [25]. Moreover, authors in [26] show that only larger time quantization (e.g., 1 hour) can achieve a correct dimensioning of the solar power system for cellular base stations.
Reinforcement Learning (RL) has been used in [27] to study a point-to-point wireless communication system in which the transmitter is equipped with an energy harvesting device and a rechargeable battery. The authors have shown that, the RL approach is able to reach the optimal performance of the online optimization problem, solved using policy iteration algorithm. This proposal poses the theoretical basis on the application of RL to energy harvesting networks; however, it is not applicable to multi-agent systems. RL has also been used in [28] to control a single EH SBS as a function of the local harvesting process and storage conditions. The authors design the SBS control with a Fuzzy Q-learning algorithm aimed at improving the battery lifetime and minimizing the electricity expenditures. However, the effect of the simultaneous switching OFFs by multiple SBSs on the overall network performance is not studied.
In [29], we have considered a scenario where grid powered MBSs are supported by SBSs equipped with solar panels and batteries. The solution adopted is based on distributed Q-learning. Each SBS implements an agent that, by taking autonomous decisions, independently learns switch ON/OFF policies according to the energy income and the traffic demand. This solution, despite showing encouraging results, presents problems when scaling up the number of agents due to their conflicting behaviors, since they experience difficulties in coordinating.
As a conclusion, the LL solution proposed in this paper represents the first attempt to coordinate ON/OFF switches of SBSs in a densified HetNet with renewable energy capabilities.

A. Problem Statement
We consider a two-tier network composed of clusters of one MBS and N non-overlapping SBSs. The MBS provides baseline connectivity and is powered by the electric grid. The SBSs are deployed for increasing the capacity in a hot-spot manner (e.g., shopping malls, city centers, etc.) [30]. SBSs are solely powered through the energy harvested by a solar panel and are equipped with rechargeable batteries.
We consider a stationary discrete-time dynamic system that evolves in slots according to the variation of the traffic demand and the energy arrivals in time. The time granularity ΔT is the time difference between two consecutive slots. The traffic load experienced in slot t by each SBS, generated by the User Equipments (UEs) in their coverage, is defined as . . , E t N ] and the energy stored by each SBS in slot t is defined by In particular, the energy stored in the batteries B t+1 at the beginning of the next slot is evaluated according to the following formula: where P t = [P t 1 , P t 2 , . . . , P t N ] is the power consumed by the SBSs in slot t and B cap is the maximum battery capacity. The amount of harvested energy that exceeds the battery capacity is wasted since it cannot be stored. At each slot t, the configuration of the SBSs in terms of ON/OFF states is decided based on their available energy budget and the traffic demand of the cluster. The total cost minimization problem is defined by a Markov Decision Process (MDP) as is the state of the SBSs in slot t, a t = [a t 1 , a t 2 , . . . , a t N ] is the control, and w t = [w t 1 , w t 2 , . . . , w t N ] is the random disturbance. In particular, each element x t i , with i = 1, . . . , N , characterizes the environment and is defined as The system control a t i is defined as follows: When a SBS is switched OFF, the associated UEs have to connect to the MBS. However, in case the MBS is reaching its capacity limit, some of the UEs may not be served and may be dropped. The probability to be in this situation, defined as system outage, must be minimized. Moreover, each battery at the SBS has to be maintained in the proper State Of Charge (SOC) to avoid a rapid reduction of its lifetime [31]. In fact, an erroneous battery utilization can permanently damage the storage capabilities and entail huge expenses for the replacement. Therefore, the optimization goal is to minimize the grid energy consumption of each cluster, while serving the whole traffic requests (or minimizing the losses) and maintaining the battery level of each SBS above a security threshold. The general optimization problem may be formulated as follows:

P1: min
where B OFF th is the battery threshold level and K is the time horizon or the number of times the control is applied; g(x t , a t , w t ) is the weighted cost function in slot t, which is defined as: where P m (x t , a t , w t ) and D(x t , a t , w t ) are respectively the normalized grid power consumption and the system drop rate in the cluster, given the operative modes of the SBSs and the slot t. The grid power consumption in slot t is due to the power drained by the MBS. The system drop rate in slot t, D(x t , a t , w t ), is the ratio of total amount of traffic demand that cannot be served by the system in the time step t, i.e., the sum of the traffic of all the UEs that are not served by their SBSs, which is switched OFF, nor by the MBS, which is overloaded. The weights ω 1 and ω 2 provide flexibility in the cost function to emphasize one part of the cost over the other. They must always sum to 1, that is, ω 1 + ω 2 = 1. In this work, we will consider ω 1 = ω 2 = 0.5 to have a balanced importance of the two components.
An offline solution of this problem is proposed in our work [32] and [33] using Dynamic Programming (DP) and with a priori knowledge of the environmental variables. In detail, the problem is represented as a graph and stated as a Shortest Path search. Label Correcting Method is used to explore the graph and find the optimal ON/OFF policy. The results achieved may be considered as system performance limits and are compared with the LL solution in this paper.

B. Control Architecture
We study the problem introduced in the above through a multi-agent system based on MRL, where each agent is located at the SBSs.
Multi-agent systems are an effective way to treat complex, large and unpredictable problems since they offer modularity in distributing the implementation of the solution across different agents. The agents can be either endowed with an offline behavior or learn new behaviors online, such that the performance of the single agent or of the whole system are improved iteratively. The offline scheme is usually solved thanks to game theory. However, sometimes the complexity of the environment makes difficult or impossible the design of the agent behavior a priori. This is the case of our densified HetNet scenario, in which the number of variables and operations is too high and makes the problem hard to model. Therefore, the online MRL solution represents a viable approach for optimizing the agents behaviors without any prior knowledge of the system dynamics. In this approach, at each slot t, the agents take an action according to the state observed from their local environment. The action causes the environment to transit to a new state and allows the agents to evaluate the benefits incurred in the transition, the so called reward. By trying different actions, the agents can learn the optimal behavior of the system through the cumulative rewards. This phase is called exploration and is aimed at training the algorithm for the stable phase, called exploitation, in which the agents will use the learned policies. The exploration is a highly sensitive task since the agents explore to obtain information not only about their local environment, but also about the other agents and to adapt to their behavior. In fact, any agent's action on the environment depends also on the action taken by the other agents. Each agent is, therefore, faced with a moving-target learning problem: the best policy changes as the other policies of agents change [34].
Our proposal of agent coordination is based on the LL paradigm [15]. The global solution is obtained in a hierarchical fashion by decomposing the multi-agent problem in different learning subtasks at each layer, anyone of which, in turn, aims at facilitating the learning process of the next higher level subtask. A heuristic function, inspired to [16], is used for interfacing the algorithms at the different layers and guiding the agents' exploration of the state-action space in a more efficient way. In particular, the proposed LL solution is based on two-layers in charge of local and network-wide EE control, respectively. In this way, the coordination of the agents at the local layer is assisted by a centralized entity at a higher layer that has a global view of the effect of the agents' actions on the overall environment. The goal is to help the agents in their coordination with a heuristic function that guides the exploration from a centralized and global perspective.
The first layer is a set of local agents in the SBSs, each of those in charge of learning switching ON/OFF policies according to the harvested energy arrivals, the available energy budget, the user traffic demand and the energy consumption of the SBSs. To address this objective, agents implement an algorithm based on Heuristically Accelerated MRL (HAMRL) [16]. HAMRL has been already successfully applied in the wireless communication domain in the field of Inter-Cell Interference Coordination (ICIC) problem in [35]. In this work, HAMRL has been utilized to implement a decentralized ICIC controller aimed at reducing the interference in the LTE downlink channel of a network of MBSs.
Layer 2 is a central manager, which assists the learning process based on a network-wide perspective. It is in charge of collecting local agents state information and assisting Layer 1 through the heuristic function of HAMRL. In particular, the second layer is composed of: • A Multilayer Feedforward Neural Network (MFNN) to forecast the MBS load based on the environmental variables of each SBS. • A SBS Centralized Controller that decides whether to enforce the local policies of a specific set of SBSs as a function of the level of congestion of the MBS. The MFNN is used to estimate the load of the MBS only with partial knowledge of the amount of resources needed by the users to be served and saves overhead signalling. In fact, traffic load at the MBS depends on the position of the users within the cell, which differs in case they are served by a SBS or the MBS. As a result, the agents at each SBS in Layer 1 can keep a local view of the system and avoid large state spaces for maintaining the algorithm complexity at reasonable levels. This also prevents from having too large phase of exploration, which, in turn, reduces the possibility of generating conflicts among the agents. The algorithms implemented in the two layers will be detailed in Sections IV and V.
The two layers interact thanks to a SDN framework that provides the infrastructure for collecting the needed parameters and distributing the control policies. Fig. 1 shows the architecture proposed in this paper. An example of such SDN solution is the Energy Monitoring & Management Application (EMMA) defined in the 5G-Crosshaul project [36]. EMMA is an infrastructure-related application based on the SDN paradigm aimed at monitoring the status of the RAN, fronthaul and backhaul elements and triggering reactions to minimize the energy footprint.

C. Energy Consumption and Harvesting Models
The BS power consumption model adopted is P = P 0 +βρ, where ρ ∈ [0, 1] is the BS traffic load, normalized with respect to its maximum capacity, and P 0 is its baseline power consumption. This model is supported by real measurements [37] and closely matches the real power profile of LTE BSs.
The energy harvesting traces are generated based on a Markov model that provides accurate statistics per month basis by processing the hourly solar energy arrival data over 20 years [9] and used for simulation purposes. In detail, the 24 hours are divided into a number N s ≥ 2 of time slots of constant duration, equal to T i hours, i = 1, . . . , N s . Each slot is a state of the Markov model and the pdf is evaluated through the kernel smoothing technique per month basis, considering the empirical data that has been measured for all days in the dataset for the month under consideration.

D. Traffic Model
The UE resource allocation scheme uses the methodology defined in [38]. This includes a detailed wireless channel model and the dynamic selection of the Modulation and Coding Scheme (MCS) for each UE as function of its Signal to Interference plus Noise Ratio (SINR), which is given by: where P t,0 and h 0 are the transmission power and the channel gain for the useful transmission respectively, N I is the number of interferers, whereas |h i | 2 and P t,i represent the channel gain and the transmission power of the i-th interferer. σ 2 0 is the power of the thermal noise. The UEs have been classified as in [37] in heavy and ordinary users according to their amount of requested traffic. The traffic demand of each UE in a cycle has been dimensioned for mimicking the realistic traffic profiles presented in [21], which are derived combining time, location and frequency information of thousands of cellular towers. The analysis in [21] demonstrates that the urban mobile traffic usage can be described by only five basic time domain patterns that corresponds to functional regions, i.e., residential, office, transportation, entertainment and comprehensive. We will consider the first two regions (residential and office), which represent the most frequent patterns in the studied area.

A. Distributed Q-Learning and HAMRL Overview
The first layer is composed of a set of distributed agents implementing HAMRL with the goal of dynamically switching ON and OFF the SBSs according to the available harvested energy budget, the user traffic demand and the energy consumption of the SBS. The control decisions are made by multiple intelligent and uncoordinated agents, which can only partially observe the overall scenario. Therefore, the local environment may differ from agent to agent, since they come from spatially distributed sources of information. To this end, the HAMRL extension of the distributed Q-learning [39] technique has been considered in this paper, and briefly introduced next.
The decision making problem in distributed Q-learning is based on the MDP defined in Section III-A. In detail, each SBS implement an agent i in charge of maintaining a local policy and a local Q-function Q(x t i , a t i ) representing the level of convenience in taking actions a t i in state x t i , with t being the decision epoch (slot). As a result of the execution of this action, the environment returns an agent dependent reward r i t , which allows the local update of a Q-value, Q(x i t , a i t ). The Q-value is computed according to the rule: where α is the learning rate, γ is the discount factor, x i t+1 is the next state for agent i and a is the associated optimal action. For more details on RL and Q-learning the reader is referred to, e.g., [39]. Consequently, thanks to Eq. (6), the control policy can be learned by exploring the environment. The most common exploration procedure in Q-learning is the ε-greedy [39], which consists in having a learning agent policy Π(x t i ) that randomly chooses a sub-optimal action with probability ε, and the one with highest Q-value with probability 1-ε, i.e., Π( ). The choice of the maximum Q-value makes Q-learning an off-policy algorithm, since the policy to generate the behavior is unrelated to the policy evaluated. In fact, it uses the greedy exploration for estimating the long term reward, while the Q-values are evaluated according to Eq. (6).
Alternatively, HAMRL guides the exploration by introducing a heuristic function H (x t i , a t i ) in the learning agent policy Π(x t i ) for influencing its action choices. In particular, H (x t i , a t i ) contains additional knowledge not included in the state variables (Layer 2 in our case). The resulting HAMRL policy selection formula is: H (x t i , a t i ) has to assume values and dimensions comparable with those in the agent Q-tables to properly influence the action to be taken and the policy Π(x t i ). If H (x t i , a t i ) = 0, the algorithm behaves like a regular Q-learning for the i-th SBS.

B. HAMRL for Energy Harvesting SBSs
We define the local state x t i of agent i in slot t as . The three variables characterizing the environment represent the instantaneous energy harvested, the battery level and the SBS load, respectively. Since these parameters assume continuous values, they have been normalized and quantized for having a reasonable number of states to be explored, in detail E t i , B t i and L t i into 2, 5 and 3 levels, respectively. The possible actions are to switching ON and OFF the SBS.
The reward of an agent i in slot t is defined as follows: where T i t is the normalized throughput of SBS i in slot t, D t is the instantaneous system drop rate, defined as the ratio between the total amount of traffic dropped and the traffic demand in the entire cluster. D th is a design parameter that represents the maximum tolerable system drop rate. Finally, B OFF th is the threshold on the battery level needed to maintain it in the proper SOC.
The rationale behind Eq. (8) is to mimic the optimization problem in Eq. (3). The condition in the first line implies a zero reward when the battery level falls below B OFF th (B i t < B OFF th ) or the system drop rate is below D th (D t < D th ), and aims at penalizing the action that has been taken since it has led to a critical condition. This incentives the SBS to turn itself OFF to save energy, as this implies a higher reward. Thus, the SBS can be switched ON and offload the macro BS later. Whereas, when D t > D th , the system performance is deemed not sufficient and the SBS is encouraged to stay ON and support the MBS to serve the traffic requests. In the second and third line of Eq. (8), the reward is proportional to the throughput when the SBS is turned ON and is, instead, proportional to the inverse of the energy buffer level when the SBS is OFF. Note that the SBS, after a training phase, will choose to remain ON (and offload the MBS) when the reward in the second line is higher, i.e., when κT i t > 1/B i t . Note that 1/B i t may dominate over κT i t in case battery level and throughput are both low. In this case, the SBS switches OFF to save energy. The constant κ is used to balance the impact of the two terms in the reward function.

V. LAYER 2: CENTRAL MANAGER
The task of this layer is of guiding the agents of the Layer 1 toward more EE switch ON/OFF policies. To achieve this, we advocate for using a network-wide view to reduce the system traffic drop rate and to concurrently manage more efficiently the harvested energy. In particular, the second layer is in charge of deciding the local agent(s) to be influenced and returns the most appropriate set of heuristics values ] according to network-wide parameters that are not present in the HAMRL optimization. Layer 2 comprises an MBS load estimator based on a MFNN which provides the input to a SBS Centralized Controller, in charge of evaluating the system load conditions of the whole cluster and selecting the SBSs to be influenced.

A. MBS Load Estimator
A MFNN estimates the normalized MBS load L t MBS in the slot t as a function of the load of each SBS L t i and of their ON/OFF policy Π t = [Π(x t 1 , a t 1 ), Π(x t 2 , a t 2 ), . . . , Π(x t N , a t N )]. We define the estimated load of the MBS in the slot t asL t MBS . A supervised approach has been adopted, i.e., a training set of input-output is used to train the MFNN according to the backpropagation algorithm [40].
The basic element of a MFNN is represented by the neuron (also called perceptron), which consists of a linear combination of fixed non-linear functions θ j (x ). In detail, for a vector of input x i , i = 1 . . . , N , it takes the form: where ϑ i are the weights associated to each input and p(·) is a non-linear activation function, typically the sigmoid func- In order to achieve this, it is necessary to determine the values of the weights correspondent to the function to be approximated (also known as training phase). Further details will be provided in Section VI-B.

B. SBS Centralized Controller
We identify two different critical cases based on the load of the MBS: 1) The system is under-dimensioned whenL t MBS is above the threshold L thrHigh MBS . 2) The system is over-dimensioned whenL t MBS is below the threshold L thrLow MBS . The first case occurs when many SBSs are OFF simultaneously. In this case, the switching OFF of some SBSs can be delayed, whether their battery levels allow such operation, to increase the system throughput. More in detail, the SBS Centralized Controller defines the set of candidate SBSs that can be switched ON (SBS t ON ) among those that are in OFF state and have enough battery reserves. Alternatively, in the second case, the traffic demand can be served by a lower number of SBSs and be managed by the MBS with negligible impact in the grid energy consumption. In detail, the set of candidate SBSs to be switched OFF (SBS t OFF ) is defined among those that are in ON state and have scarce energy reserves (i.e., B t i ≤ B LOW th ). The number of SBSs to be influenced and their relevant heuristics values H t are derived based on Algorithm 1. For each SBS i that has been activated or deactivated in this process, the correspondent heuristic value is set to In fact, H (x t i , a t i ) must be the lowest value that can influence the choice of action in order to minimize the distortion in the Q-value function due to the use of heuristics [16]. Therefore, for influencing the choice of action of SBS i in state x t i , H (x t i , a t i ) should be negative and higher than the maximum Q-value in x t i (i.e., Q t i,MAX ).

A. Simulation Scenario
The scenario considered in this analysis is composed of a single cluster with 1 MBS placed in the middle of a 1 × 1 km 2 area and a varying number of SBSs randomly placed and nonoverlapping. We consider medium scale factor "metro cells" as SBSs, featuring a maximum transmission power of 38 dBm, which corresponds approximatively to 50 meters of coverage range. The values of β and P 0 of the energy model presented in Section III-C for the MBS (SBS) are 600 (39) end while 22: end if 23: end procedure 1.5 KWh. This dimensioning of the RES equipment has been proved to be optimal for the winter season through simulations using the algorithm in [32]. The solar energy arrivals are generated according to Section III-C for the city of Los Angeles considering N s = 12 states. The traffic demand is modeled as in Section III-D. In detail, the office and residential traffic profiles have been considered, respectively termed "Res" and "Off" in the figures. Both of them present a high activity during the day. However, they differ in the profile: the traffic is concentrated during the daylight hours (e.g., from 10 am to 6 pm) in office, while the residential has only one peak during the early night hours (e.g., from 6 pm to 12 pm). User traffic has been classified in two classes, namely heavy and ordinary, according to [37]. The main simulation parameters are given in Table I.

B. MFNN Training Analysis
We analyze here the behavior of the MFNN in Layer 2 during the training phase. Based on our simulative analysis, the best number of neurons per layer is I 1 = 3N /2 , I 2 = 2N /3 and I 3 = 2N /3 . Fig. 2 presents the overall mean squared error (mse) of this configuration for MFNN with two and three hidden layers ("2L" and "3L" in what follows, respectively) as a function of the day, which includes 24 system evolution slots. The tests have been carried out for a cluster of 10 SBSs, which represents a complex scenario, as the approximated function increases its domain R N in dimension. The 2L starts with lower mse and performs better already at 50 days, reaching mse values below 0.2. Alternatively, the 3L presents higher mse till 100 days; after that, its mse starts decreasing faster and reaches lower values asymptotically.
As an additional illustrative result, we evaluate the statistical measure called sensitivity, which is defined as the proportion of positive cases that are correctly identified as such, in detail: sensitivity = true positive no. true positive no. + false negative no.
where we define the false negatives as the cases when the MFNN does not estimate that the system is under-dimensioned (i.e.,L t MBS ≤ L thrHigh MBS ) but it is really in outage. Fig. 3a provides the sensitivity as a function of the day. From Fig. 3a, we can observe that the 2L takes approximatively 50 days for reaching a stable behavior, whereas the 3L takes 10 times longer and passes the 500 days. Besides, the specificity measures the proportion of negative cases that are correctly identified as such, which corresponds to: specificity = true negative no. true negative no. + false positive no.
where false positives have been defined as the cases when the MFNN expects that the system in under-dimensioned (i.e., L t MBS > L thrHigh MBS ) but it is not in outage. Fig. 3b depicts the specificity as a function of the day. In this case, the MFNNs reach a stable behavior at 1500 days. However, the 2L presents less variance on the specificity. We can also note that the asymptotic value of the specificity is lower than the sensitivity. This is due to the fact that we have adopted a guard margin to avoid the MBS to be overloaded (i.e., L thrHigh MBS = 0.85). Therefore, some false positives are MFNN estimations that fall between L thrHigh MBS and 1, which do not represent a system outage.
Based on the analysis in the above, a MFNN with two hidden layers has been used in the analysis in the following subsections due to its better sensitivity and specificity, and a faster training phase.

C. Layered Learning Training Analysis
The training phase of the LL framework has been evaluated compared with a distributed Q-learning solution based on our proposal in [29], "QL" in the following. QL is a standard MRL solution defined with the same environmental model described in Section IV-B, which uses the standard Q-learning policy selection formula (i.e., without the heuristic function). The training has been analyzed considering the stability of the system when starting with all the Q-values initialized to zero to avoid conditions of battery degradation, which can occur when battery level drops below the security SOC threshold B OFF th . An example of the behavior of LL and QL algorithms is shown in Fig. 4 and Fig. 5, where the hourly battery level of  a SBS is plotted for a scenario with 3 SBSs and different traffic profiles. The simulations start with the month of January, and runs for 400 days, spanning across the correspondent months. In both cases, the system starts with a short-sighted approach, since it is using the energy only according to the instantaneous availability, and drops frequently below the threshold B OFF th . In fact, during this period, the agent is at the beginning of the exploration phase and has to gather information from the environment to update its Q-values accordingly. The training phase of the office traffic profile results to be the shortest, with about 40 days for both 20% and 50% of heavy users). With residential traffic profile, the LL takes 50 and 80 days for reaching a stable behavior for the case of 20% and 50% of heavy users, respectively. After these points, the battery level drops less often below B OFF th and the density of points starts becoming more prominent above the battery threshold. Similarly to what was experienced with the duration of the training phase, the number of points falling below the threshold in case of office traffic profile is the slightest. This phenomenon is due to the fact that the traffic in the office scenario is concentrated during the daylight hours, as for the energy in the harvesting process, which implies that SBSs have more energy when it is needed. Alternatively, the residential traffic profile has the peak during the early night hours (e.g., from 8 PM to 12 PM), which is when the solar energy scavenged starts being scarce. In such a case, the improvement of the LL approach with respect to the QL is more evident, as depicted in Fig. 5a and Fig. 5b. In fact, in Fig. 5a the LL is able to avoid that the battery level falls outside the ideal SOC window (i.e., below B OFF th ), while QL presents many points below the battery security threshold starting from 300 days, which is approximatively the beginning of the winter season. The effect of the traffic demand can be appreciated in both Fig. 4 and Fig. 5. In case of the office traffic profile, the minimum average battery level decreases from 0.6 to 0.4. In case of the residential traffic profile, only the LL is able to guarantee the minimum battery level and only for the case of 20% of heavy users. It is to be noted that, the energy harvesting equipment is dimensioned for the worst case scenario of working in the winter season. Thus, during the summer the energy reserves are abundant and, consequently, the learning processes are facilitated in finding a solution for maintaining the battery level above the B OFF th threshold.  Finally, it is to be noted that, even after the training phase has finished, a good trade-off between exploration and exploitation has been adopted in order to guarantee a constant update of the Q-table and the selection of the proper ON/OFF policy across the different seasons, where the energy harvesting process can have substantial variations. These changes in the environment are well managed by the learning algorithm as presented in [29], where we proved that the difference in network performance between QL and an offline trained QL are negligible. In detail, the offline QL uses pre-calculated Q-tables per month basis to avoid their update due to the variations of the energy process across the months.

D. ON/OFF Policies
In this section, we analyze the behavior of the switch ON/OFF policies of the LL solution. The policies obtained offline with a Direct Load Control based on DP presented in Section III-A, "Offline" in the following, are presented for sake of comparison. The policies of LL have been evaluated across 365 days of simulation with the training already performed offline. The results are presented separately for the winter and the summer periods, respectively termed "Win" and "Sum" in the plots, since the harvesting process substantially differs for different seasons. January, February, October, November and December are considered winter months. The network scenario studied is based on a cluster of 3 SBSs.
The daily average switch OFF rate of the SBSs for the office and residential traffic profile is reported in Fig. 6 and Fig. 7, respectively. The total traffic requested by the 3 SBSs is also reported in those figures. It is to be noted that, the two traffic profiles considerably differs in the amount of traffic requested during the day. In fact, the office traffic daily demand arrives to 61 GB/h and to 115 GB/h for 20% and 50% of heavy users, respectively. While, in the residential traffic profile the UEs jointly require up to 116 GB/h and 218 GB/h for 20% and 50% of heavy users, respectively. Fig. 6 shows that the policies substantially converge in having a high switch OFF rate during the night to save energy for the daily peak of traffic. However, the LL algorithm is starting to highly switch OFF earlier, at 8 pm and 12 pm for 20% and 50% of heavy users, respectively. As it can be noted, the total amount of traffic in the network influences the policies of the LL algorithm reducing the high switch OFF rate window duration from 10 to 6 hours. Therefore, the main difference between the Offline and the LL solutions is in the extent of the high switch OFF rate window, where the LL presents a more conservative approach and needs to switch OFF with higher intensity for being able to reach the design goals. This phenomenon is mainly due to the random nature of the energy harvesting process, which may provide very diverse energy incomes across different days. Fig. 7 shows that the policies of the two schemes have a similar behavior during the night and differs during the day. In fact, considering the case of 50% of heavy users in Fig. 7b, the Offline solution reports an extra switching OFF peak during the afternoon in winter, from 2 to 3 pm, in order to save energy for the peak of traffic during the night. On the contrary, LL is maintaining the behavior of the case of 20% of heavy users with only switch OFF window during the night. However, LL reacts to the higher traffic demand during the night by reducing of 50% the switch OFF rate window duration with respect to the case of the office traffic profile. It is to be noted that, this is the hardest case for LL stability, as testified by Fig. 5b.

E. Network Performance
In this section the LL framework is compared against the QL solution introduced in Section VI-C and a greedy (GR) algorithm. The GR switches OFF a SBS when its battery is below a security threshold B OFF th , and reactivates it when the battery returns above the threshold. Results are obtained averaging simulations spanning over different months for an overall duration of 365 simulated days with LL and QL already trained. It is worth noting that, the performance behavior evaluated including the training phase does not change substantially, since the training phase is relatively short with respect to the assessment window, as presented in Section VI-C, and will not be reported in the following. Despite of the fact that the training is performed offline, the exploration phase is not stopped to be able to follow the slower dynamics of the harvesting energy process across the seasons. As for Section VI-D, the results will be presented analyzing the two representative periods "Win" and "Sum". We considered a high-traffic scenario involving 70 UEs with 50% of heavy users at each SBSs. The results with low-traffic presents a similar behavior and will not be presented due to space reasons. Fig. 8 presents the system average percentage gain in throughput of the LL and QL schemes with respect to the GR. The LL framework presents always a higher throughput. Moreover, the LL throughput gain is increasing with the number of SBSs, while with QL the gain starts reducing from 6 SBSs. This phenomenon is of particular intensity in case of residential traffic profile, where QL has a throughput lower than the GR in the summer period, as depicted in Fig. 8a. This is originated by the typical problem of MRL, discussed in Sections I and III, where the lack of coordination may generate conflicting behaviors among the agents. This issue may occur with higher probability when the number of agents grows, as clearly demonstrated by the QL performance in Fig. 8. Alternatively, LL is able to improve the coordination of the agents and better exploits the increased capacity resources. It is to be noted that, during summer the gain in throughput is lower since the renewable energy system has been dimensioned to provide the necessary energy in winter season. This implies that the harvested energy in summer is generous and both LL and QL have fewer margins for policy optimization. Fig. 9 reports the average system drop rate of the three schemes, as defined in Section III-A. LL always presents the lowest system drop rate and maintains it, almost always, below the threshold D th of the 3%. The only exception is for the case of residential traffic profile in the winter season, where the system drop rate reaches the 8%. Alternatively, the QL solution presents a higher system drop rate with respect to the GR in several cases. For residential traffic profile, GR outperforms QL with 10 SBSs during the summer, and for the office traffic profile both in summer, starting from 8 SBSs, and in winter, in case of 10 SBSs. This confirms the problem of QL when scaling up the number of SBSs.   We now analyze the average daily performance during the summer and winter periods in the scenario with 10 SBSs to highlight the differences between QL and LL in the most sensitive hours. In Fig. 10, we report the system drop rate of the LL, QL and GR solutions in a cluster of 10 SBSs and varying the number of UEs per SBS. The LL solution is able  to reduce the system drop rate of more than 50% with respect to GR. On the contrary, QL has always the worst drop rate both in summer and winter period starting from 60 UEs per SBS and considering the office traffic profile. Finally, Fig. 11 presents the average hourly system drop rate for the case of 70 UEs per SBS. It is clear that LL outperforms the other solutions during all the day for both traffic profiles. In detail, in case of the office traffic profile, LL is able to stay below 3%, D th , during the whole day, while GR and QL present high peaks in the early morning (25% at 9 am) and early night (15% from 8 pm to 12 pm). Regarding the residential traffic profile, LL passes the threshold D th , since it experiences difficulties in managing the high traffic peaks at late night and early morning, which are two sensitive periods as, in both of them, the system has very limited energy reserves. Table II presents the footprint of the two learning-based methods (LL and QL) together with a baseline solution where both the MBS and the SBSs are powered with the grid. In particular, the comparison is performed in terms of grid energy consumption for a scenario with 50% of heavy users. In addition, the column named "excess energy" reports the values of the harvested energy that cannot be used by the SBSs nor stored in the batteries, since the harvesting/storage system is dimensioned for the worst case of the winter season.

F. Energy Assessment
The LL solution can reach energy savings of up to 50% during the summer, as for the scenarios with 10 SBSs. The savings are affected by the number of SBSs deployed. In fact, for a smaller numbers of SBSs (e.g., 5), the energy savings are of the 20−30%. The traffic profile is another important factor that influences the footprint, since it varies in the total amount of data exchanged in the network and in temporal dynamics, as discussed in Section VI-D. The energy savings for the scenarios with office traffic profile are in general 10% higher than the residential traffic profile. The reason behind this fact is that the residential traffic profile has a peak of traffic during the night (12 am), which is where the energy reserves are scarce and the learning solutions need to rely more on the MBS, as can be noticed with the longer high switch OFF rate window depicted in Fig. 7. Considering the two learning methods, the amount of traffic delivered influences the behavior of the energy consumption, as expected. In this way, LL consumes more energy since it drops less traffic. However, the gap between LL and QL is almost null when considering scenarios with 10 SBSs, which is where QL experiences the highest drop error rate, as presented in Fig. 9.
Finally, looking at the excess energy values in Table II, we can appreciate how the harvested energy process is abundant in summer, where the energy that cannot be used is considerably higher. As for the energy savings, this circumstance occurs with particularly intensity with the residential traffic profile, that is the scenario in which the learning algorithms experience the worst network performance. The behaviors of QL and LL are similar, except for the case of summer with 10 SBSs, where the QL consumes 10% more of harvested energy, despite of the fact that is dropping more traffic. This behavior confirms that LL performs better when scaling up the number of SBSs and is able to use more efficiently the available energy reserves, since it utilizes less solar energy than QL and, concurrently, delivers more traffic.
In the light of the above results, future work needs to consider further investigation on the decisions making process for enabling the system to reduce the excess energy and work with the same performance under different traffic conditions. Moreover, considering the abundance of the excess energy, it can be taken into consideration in the control framework either for sharing it with other network elements, and/or trading it, as done by the prosumers in the smart grid architecture.

VII. CONCLUSION
In this paper we have presented a framework for the traffic control of a densified HetNet powered with solar power based on SDN/NFV and MRL. In particular, the coordination problem of finding simultaneously a solution among all the agents that is good for the whole system has been solved through Layered Learning. The first layer implements a HAMRL algorithm to learn the control policies locally at each SBS and to adapt to the dynamic conditions of the environment, in terms of energy inflow and traffic demand. Layer 2 is in charge of improving the Layer 1 policies through a MFNNbased solution by considering network-wide parameters and mitigating the effects of the conflicting behavior of the agents.
We have analyzed the training phase of the LL algorithm, which presents improvements compared to a distributed MRL solution thanks to the introduction of the layered coordination. Then, we compared the policies with respect to an offline solution based on DP, obtaining that they act very similarly. The LL algorithm has been also contrasted against a greedy and a distributed MRL solutions from a network performance perspective. Simulations results show that the proposed solution outperforms the others in terms of throughput and drop rate. The energy savings achieved are considerable with respect to a baseline solution where both MBS and SBSs are powered with the grid, reaching up to the 50%. As a result, we can conclude that our proposal succeed in incorporating stability of the learning process on the one hand, and adaptation to the dynamic behavior of the agents on the other hand, that was our primary goal.
There are several ways in which this work can be extended. The traffic model used could incorporate accurate geographical and spatial information. In fact, densified HetNet scenarios are characterized for specific spatial distribution of users (i.e., hotspot coverage) that varies during the day according to the zone (e.g., residential, office, transportation). The proposed solution has shown some stability problems for network of very dense SBSs and high traffic. The LL scheme adopted represents a good starting point, since it combines the flexibility of a distributed learning solution with the efficiency of a centralized approach. Further work can be done for a deeper integration between the solutions adopted in the two layers, e.g., using deep learning methods. Finally, the results on the excess energy encourages to integrate its control in the optimization problem. In fact, the presence of surplus energy perfectly matches with Demand Response method of the smart grid, where the energy can be shared among elements and/or traded with the operators.