Dynamic Control of Functional Splits for Energy Harvesting Virtual Small Cells: a Distributed Reinforcement Learning Approach

In this paper, we propose a network scenario where the baseband processes of the virtual small cells powered solely by energy harvesters and batteries can be opportunistically executed in a grid-connected edge computing server, co-located at the macro base station site. We state the corresponding energy minimization problem and propose multi-agent Reinforcement Learning (RL) to solve it. Distributed Fuzzy Q-Learning and Q-Learning on-line algorithms are tailored for our purposes. Coordination among the multiple agents is achieved by broadcasting system level information to the independent learners. The evaluation of the network performance confirms that coordination via broadcasting may achieve higher system level gains than un-coordinated solutions and cumulative rewards closer to the off-line bounds. Finally, our analysis permits to evaluate the benefits of continuous state/action representation for the learning algorithms in terms of faster convergence, higher cumulative reward and more adaptation to changing environments.


I. INTRODUCTION
It is evident that there is an exponential growth of mobile traffic demand [1]. To cope with this, Mobile Network Operators (MNOs) are deploying dense networks, which are composed of multi-tier Base Stations (BSs) in the same coverage area. This involves a network setup of many small BSs (SBSs) for satisfying the traffic demand from hot-spots (e.g. shopping malls, offices, entertainment areas) and Macro Base Stations (MBSs) to ensure mobility and reliable coverage [2]. The resulting mobile network architecture is known as Heterogeneous Networks (HetNets). On the other hand, as mobile networks become densified, their electrical power requirements are also rapidly increasing. As a result, power consumption is playing a major part in the operational expenditures of MNOs. Moreover, there is an increasing concern regarding the environmental impact of Information and Communication Technology (ICT). It is estimated that ICT is consuming about 10% of worldwide electricity generation and is forecasted that it might reach 53% in a decade [3]. Hence, energy sustainability is identified as one of the key requirements in the design and operation of mobile networks in order to ensure cost effectiveness and reduce the impact on the environment.
In the last years, Cloud Radio Access Network (CRAN) architecture has been proposed for enabling an efficient resource utilization via centralized processing of Baseband (BB) functions [4]. In CRAN, BB processing takes place at a centralized Baseband Unit (BBU) pool.
As a result, base stations are reduced to simple Radio Remote Heads (RRHs). The connection between the RRHs and BBU pool is provided by the fronthaul links. CRAN ensures simplified base stations at cell sites and more efficient resource utilization due to centralized processing.
The main drawback of CRAN is the need for a high capacity and very low latency fronthaul. As a result of efforts to relax the fronthaul requirements, flexible functional split between local BS sites and a central BBU pool is proposed [5]. Hence, part of the BB processes are executed at the local BS sites, while maintaining many of the centralization advantages of CRAN. Moreover, with the advent of Network Function Virtualization (NFV), network functions can be executed on general purpose computing hardware as virtual functions with Software Defined Networking (SDN) applied as a tool to realize the management and control of such functions [6]. As a result of Flexible Functional Splits and leveraging on NFV and SDN, network functions of SBSs, (e.g. BB functions), can be virtualized and placed at different sites of the network. These small base stations are known as Virtual Small Cells (vSCs) and enable flexibility in resource allocation and management. More recently, Multi-Access Edge Computing (MEC) has been introduced to enable the convergence of IT and telecommunication networking [7]. Thanks to MEC, BSs can leverage cloud computing capabilities and share part of their computational processes.
As a means of ensuring energy sustainability, Energy Harvesting (EH) technology is becoming widely applicable in mobile networks. EH allows both cost and environmental impact reductions [8]. However, EH comes with its own unique challenge mainly due to intermittent energy sources which cause unreliable supplying. Hence, in Energy Harvesting Base Stations Functional Split options. The vSCs that completely rely on EH can opportunistically use the BB processing units available at the MEC server, which can be co-located at the MBS site. This is particularly important since the power consumption due to BB processing has a huge share in the total power consumption breakdown of smaller base stations. Consequently, MEC-enabled energy-aware placement of BB processes according to functional split options is a promising technique to enable energy efficient operation of vSCs powered by EH.
Overcoming the challenges that arise from EH and ensuring, an intelligent energy management scheme requires the design of dedicated control methods. In our previous work [9], we have studied the performance bounds of dynamic placement of different functional split options in an off-line manner for vSCs powered by EH. These performance bounds are determined by solving a joint grid energy consumption and system outage minimization problem, based on a-priori knowledge of the system dynamics (traffic and energy arrivals) subject to battery constraints.
These results proof that dynamically adapting functional split options can provide significant grid energy saving as opposed to static configuration options. However, the off-line solution relies on a-priori knowledge and has limitation to scale up with the number of vSCs due to high computational complexity. On the other hand, Machine Learning (ML) tools can be used to extract models that reflect the user, energy harvesting and network behaviors, and to solve interactive decision making problems in real-time, at short time scales and with minimum a-priori information of the system [10]. In particular, Reinforcement Learning (RL) based algorithm for dynamic placement of functional split options is proposed in our previous work [11]. It is based on Temporal Difference (TD) learning methods, namely Q-learning and SARSA [12], for online learning of control policies of a vSC powered by EH with flexible operative modes. The control has been implemented with an agent placed in the vSC and we investigated results for the case of single and autonomous vSC deployment [11]. In this case, RL allows learning of optimal strategy through interaction with the system environment for achieving system wide objectives, i.e. efficient utilization of the harvested energy. However, when considering the case of various vSCs operating simultaneously, RL is expected to face more problems. Centralized solutions might experience long convergence and training phases due to the high number of state/action pairs needed to model the environment. Alternatively, a distributed approach may allow to reduce the complexity by dividing the problem among the multiple agents. Nevertheless, multi-agent systems can experience issues due to the conflicting interests of the agents [13]. In fact, when each vSC is allowed to learn the best energy management policy independently, there is a risk that its actions affect other vSCs' policies, which in turn would have a negative effect on the overall performance of the network, e.g. system drop rate. Hence, multi-agent RL based strategy should ensure coordination among the learning agents, i.e. the vSCs, towards achieving system wide gains. This paper proposes multi-agent RL based on-line algorithms for dynamic placement of functional split options in MEC-enabled RAN with energy harvesting capabilities.
Both Fuzzy Q-Learning (FQL) and Q-Learning (QL) based on-line algorithms are proposed with performance comparisons and evaluation against an off-line bound. Coordination among the multiple agents is achieved by broadcasting system level information to the independent learners. A comparison with an implementation of the learning algorithms without coordination is also analyzed.
The main contributions of the paper may be summarized in the following items: • Edge Computing Platforms for Energy Saving: we propose a network scenario where part of the computational processes of vSCs powered solely by energy harvesters may be shared with on grid-connected central MEC-server at the MBS.
• Grid Energy Minimization Problem Statement: we formulate a network wide grid energy optimization problem while avoiding system outage for the proposed network scenario.
• Coordinated Multi-agent RL Solutions: we propose multi-agent RL controllers to solve the grid energy optimization problem. In particular, distributed FQL and QL based solutions are tailored for our purposes, including different levels of coordination among the vSCs.
• Characterization of the Learning Algorithms: we analyze the complexity and the conver-gence of the proposed learning algorithms by giving insights of the hyperparameter setup in both simulative training and run-time scenarios. We study the effects of the quantization of states and actions on the stability and system performance. We characterize the selected policy of the coordinated solutions with respect to the off-line bound.
• Network Performance Evaluation: we evaluate the network performance (in terms of energy consumption and traffic drop rate) by our multi-agent RL solutions with different levels of coordination and compare them against the off-line performance bound and static solutions.
We state here that the list of contributions of our paper represents the first effort on controlling MEC-enabled RAN with energy harvesting capabilities through coordinated multi-agent RL. For more details, the reader is referred to Section II, in which we survey the related work on the topic.
The rest of the paper is organized as follows. Section II describes the related literature and Section III shows the reference architecture considered in this work. Section IV describes the overall system model including power consumption, network, traffic and energy arrival models.
Both FQL and QL based control designs are explained in V. Section VI shows the simulation scenario and numerical results of simulations including the comparison among QL, FQL and off-line solutions. Finally, we draw our conclusions in Section VII.

II. RELATED WORK
Recently, intelligent energy management in EHBSs has been the focus of many studies in the research community due to the increasing significance of energy sustainability combined with dense deployment of BSs. Most of these literature analyze hierarchical multi-tier networks, the so called HetNets, with an intelligent switching on/off scheduling of BSs. The authors in [14] apply Dynamic Programming (DP) to determine the optimal switch on/off policy in a two-tier HetNets with baseline MBS and hot-spot deployed SBSs. The solution shows the performance bound of an intelligent switch on/off policy when all the system dynamics information are known a-priori. Minimizing the grid energy consumption for hybrid powered base stations is also studied in [15]. Here, the authors applied a two stage DP methods designed to achieve energy saving gains while maintaining probability of blocking. The authors in [16] apply a skirental framework based on-line algorithm for optimal switch on/off scheduling for minimizing the operational costs of a network composed of self powered base stations. Moreover, the authors in [17] studied sleep mode coordination between base stations powered by EH and grid energy using DP. However, the DP based solution is shown to entail high computational complexity.
On the other hand, the authors in [18] apply RL, in particular QL algorithm, to optimize the harvested energy utilization. The work is based on distributed Q-learning where each renewable powered base station take autonomous decision whether to switch on/off according to energy arrival, energy storage and traffic demand. In addition, multi-armed bandit based distributed learning is studied in [19] to allow each SBS to learn its own energy-efficient policy. The authors in [20] applied layered learning for system wide harvested energy allocation through decomposition of the problem into two layers. The first layer, based on RL, is in charge of local control at each SBS and the second layer, based on artificial neural networks, ensures network wide coordination among the SBSs. Moreover, renewable energy allocation in edge computing devices with EH is studied in [21]. Here, the authors propose RL based on-line solutions for offloading and auto-scaling in edge computing devices that are powered by EH. On the other hand, the authors in [22] proposed a RL based energy controller for a SBS powered by energy harvesting, battery and smart grid by considering battery aging effects. This work is based on FQL and is shown to provide significant extension to the life time of a small cell battery.
However, this work is limited to a single SBS and a coordinated energy management among base stations with in a mobile network remains an open issue.
Nevertheless, there is a gap in the literature in integrating EH and flexible functional split options in MEC-enabled RAN: to the best of our knowledge, no solution has analyzed the possibility of dynamically sharing the BB processes of vSCs with a MEC server co-located with the MBS, which is the main topic of our paper. Here, we claim that functional splits give insight into considering more operative modes of small BSs, in addition to switch on and off and enable higher grid energy savings. Moreover, from a methodological perspective, most of the work has used RL to solve a single agent problem. Instead, in this paper, we propose a multiagent solution for a multi-cell scenario using multi-agent RL algorithms with different levels of coordination among agents. In particular, we propose to communicate system level information to the multiple agents and compare such coordination with solutions based on independent learners. Moreover, we tailor QL and FQL algorithms to our network scenario and evaluate the effects of the quantization of the states in the system performance. MBS with MEC server are responsible for providing baseline coverage, mobility support and baseband processing resources. The MBS site is fully powered by energy from the grid, thus assuring reliable communications and computing. The second tier is composed of vSCs, which are deployed in hot-spot manner for capacity enhancement and they do not overlap in coverage [23]. They are completely powered by solar panel and batteries and they are fully or partially dependent on the MEC server at the MBS for BB processing. The reference architecture is shown in Fig. 1.
The proposed MEC-enabled architecture jointly with SDN and NFV paradigms [24] are enablers for automated network management, flexibility and cost reductions. In MEC deployments, multi-tier MEC servers are co-located at the BSs and have different computational and transmission capabilities (i.e., MBSs are high-power and high-computing nodes) [25]. In this work, we are interested in the case that MBSs may support vSCs and enable computational offloading of some of their networking functions. Hence, the vSCs opportunistically use the central BBU pool, i.e. the MEC server at MBS, for full or partial BB processing requirements. For doing so, a standardized interface (e.g. Openflow [26]), can be used to implement the interactions between the MBSs and vSCs. In this way, vSCs transmission functions are decoupled from proprietary hardware-dependent implementations and may be executed in a different hardware resource of the network. To this respect, 3GPP has defined different functional splits between the distributed and the centralized unit [27], in our case the vSCs and the MBS, respectively. The vSCs in our scenario can opportunistically operate in one of the functional split configuration options, which are based on [27] and are explained below. Therefore, MEC deployment relies on a reconfigurable front-haul, since the bandwidth and latency requirements become more stringent when more functions are placed in the centralized unit [5].

A. Network Model
We consider a two-tier mobile network composed of clusters of one MBS with co-located BBU pool and N vSCs. The system evolves in time slots based on the variation of the traffic demand and the energy arrivals. The traffic load at slot t generated by the users in the coverage of the vSCs is defined as . . , H t N ] and the energy stored by each vSC in its battery in slot t is defined . Moreover, the energy stored in the batteries at the beginning of the next slot, B t+1 , is evaluated according to the following formula: where P t = [P t 1 , P t 2 , . . . , P t N ] is the power consumed by the vSCs in slot t (and described in Section IV-B), B cap is the maximum battery capacity and ∆ t is the time difference between two consecutive slots. The operative state of the vSCs in slot t is defined by . At each slot, intelligent decisions are made to determine the optimal configuration of the vSCs in the mobile cluster to serve the traffic demand based on their available energy budget, energy arrival information and the traffic request. The optimization problem is defined by a Markov is the random disturbance of the environmental variables. In particular, we define each state X t i , and the control action A t i as follows: The optimization objective is to minimize the energy consumption from the grid while avoiding system outage. We define system outage as the event to not satisfy the traffic demand due to battery energy depletion or wrong configuration decisions, which may overload the MBS with the traffic of the switched off vSCs. Hence, the general optimization problem can be formulated as follows: where B th is the battery threshold level and K is the time horizon or the number of times the energy control is applied; f (A t , t) is the weighted cost function in slot t, which is defined as: where E m (A t , t) and D(A t , t) are respectively the normalized grid energy consumption and the traffic drop rate in the cluster, given the operative modes of the vSCs in slot t. The grid energy consumption in the slot t is equivalent to the energy consumption at the MBS site, including the MEC server used for computational offloading. The grid energy consumption is then computed as: where P m (A t , t) is the power consumption of the MBS given the operative modes of the vSCs.
The details on the power consumption models are described in Section IV.B. The traffic drop rate in slot t, D(S t , t), is the ratio of the total amount of traffic demand that cannot be served by the system in the slot t. Additionally, each battery at the vSCs has to be maintained in the proper State Of Charge (SOC) (i.e, above the battery level threshold B th ) to avoid a rapid reduction of its lifetime [28]. Finally, the weights ω 1 and ω 2 provide flexibility in the cost function to emphasize one part of the cost over the other. They must always be positive and sum to 1, that In this work, we will consider ω 1 = ω 2 = 0.5 to have a balanced importance of the two components.
An off-line solution of this problem is proposed in our work [9] using DP and with a priori knowledge of the environmental variables. The problem of finding optimal configuration options is represented as a graph and stated as a Shortest Path search, while Label Correcting Method is used to explore the graph and find the the shortest path. Those obtained results are considered as system performance bounds and are used as a benchmark to the control methods proposed in this paper.
In this work, we propose an on-line solution based on multi-agent RL. In particular, we use approximated DP methods, known as TD learning, to determine optimal policies [12]. Our proposal is based on distributed and coordinated decision making: i.e. each vSCs take actions by them selves, which makes it scalable with the number of vSCs. In order to coordinate the decision making process, we rely on communicating system wide information, e.g. traffic load at MBS, to each vSCs. Section V describes the proposed QL and FQL based control solutions to the MDP described here.

B. Power model
The power consumption of each split option is estimated based on the model introduced in The total BS power consumption is given by: where P BB is the power consumption due to the baseband processing, P RF is the power consumption due to RF circuitry, P PA is the power consumption by the power amplifier and P overhead is the overhead power consumption (e.g., cooling system).
Baseband power consumption, P BB is given by: where P CPU is the idle mode power consumption, P OFDM is the power consumption due to OFDM processes, P filter is the power consumption due to filtering, P FD is the frequency domain processing power consumption and P FEC is the power consumption due to forward error correction (FEC) processes. In accordance with [29], all the terms in (7) are dependent on the number of antennas and bandwidth. Moreover, P FD and P FEC are the only components that depend on the traffic load.
When the vSCs are in PHY-RF split mode, their power consumption model does not include the corresponding P BB , since the baseband processing takes place at the MBS site. Instead, the corresponding P BB term is added to the MBS. On the other hand, in MAC-PHY split mode, the vSC power consumption includes the baseband power consumption term and is given in (6). Considering the aformentioned model description, the grid power consumed by the MBS is computed as: where P MBS BS is the power consumption of the MBS computed as in (6), P i BB is the baseband power consumption of the i-th vSC and G is the set containing the vSCs in PHY-RF split mode.

C. Energy Harvesting and Demand Profiles
Hourly energy generation traces from a solar source have been obtained for the city of Los Angeles (CA, USA). The solar raw irradiance data have been collected from the National Renewable Energy Laboratory and converted into harvested energy traces using the SolarStat tool [30]. The energy harvesting traces are generally bell-shaped with a peak around midday, whereas the energy harvested during the night is negligible. Moreover, as discussed in [30], high variability of the harvested energy may occur during the day and this also holds for the summer months. This means that, although the energy inflow pattern can be known to a certain extent, intelligent and adaptive algorithms that make their decisions based on current and past inflow patterns, as well as predictions of future energy arrivals, have to be designed.
For the demand profile, it is commonly accepted and confirmed by measurements that the energy use of base stations is time-correlated and daily periodic. The UEs have been classified as in [31] in heavy and ordinary users according to their amount of requested traffic. Moreover, in this article we use the traffic load profile obtained in [32] as the average amount generated by the users. In addition, based on the average traffic generated by the users, traffic variability is added following a normal distribution using standard deviation from measurements of real mobile traffic traces [33]. The traffic demand of each UEs in a cycle are dimensioned based on traffic profiles presented in [32], which are derived from time, location and frequency information of thousands of cellular towers. The analysis in [32] demonstrates that the urban mobile traffic usage can be described by mainly five basic time domain patterns that corresponds to different functional regions, i.e., residential, office, transportation, entertainment and comprehensive. In this article, we are considering residential and office profiles which are the most common use cases for urban deployment scenarios. An example of a normalized energy harvesting trace,

A. Background
RL is a learning paradigm that relies on learning by interacting with the environment without an exemplary supervision [12]. It is a well known framework of solving a problem described as an MDP. Formally, the RL framework is defined in terms of states, actions and rewards.
Through the RL process, according to the current state, the agent executes a certain action and receives an immediate reward and as a result of the action, its environment will evolve to a new state. It is important to note that in RL, the rewards can be delayed. Hence, it is a sequential decision making process with the goal of maximizing cumulative reward. For our network model, the objective of the RL based controller is to learn energy management policies through the interaction with the environment. The controller decides the operative mode of vSCs in a cluster at each time slot based on the traffic load, energy arrival and energy storage information.
Let X t be the state of the system at time t, the controller chooses an action A t from action set A, which translates to the operative modes of the vSC. As a result of this action, the environment returns an immediate reward r t . Based on this r t , the correspondent Q-value, Q(X t , A t ), which represents the level of goodness when taking a specific action in a given state, will be updated.
The process of learning needs a balance between exploration i.e. taking random actions to discover new knowledge and exploitation i.e. taking the actions that have been already discovered as good, i.e., the actions with the maximum Q-value. This process of selecting a specific action and updating the Q-value continues sequentially for each time slot. The controller selects the action at the beginning of each time slot t based on the specific RL algorithm it applies. The goal of such algorithms is to determine iteratively the Q-values for each state-action pairs for achieving an optimal policy in the long-term.
We propose a multi-agent RL solution based on a distributed control architecture. In fact, our reference network scenario consits of multiple vSCs in the coverage of a MBS. Therefore, single agent RL methods may have high complexity and slow convergence due to the high number of state/action pairs that represent the system. Instead, multi-agent RL methods guarantee higher scalability since they distribute the algorithms among the different vSCs. However, while in single-agent RL, the state of the environment changes solely as a result action by the agent, in multi-agent RL scenarios, the state of the environment is subject to actions from multiple agents. As a result, multi-agent RL is prone to conflicting interests among the learning agents and requires coordination techniques among the agents to learn optimal system-wide strategy.
We solve the coordination issue by broadcasting system wide information to every vSCs, which harmonize the selected policies and achieve system wide gains.
On the other hand, FIS is the process of mapping a set of input control signals to a set of output actions through fuzzy rules [34]. FIS are mainly applicable in systems that cannot be represented by explicit mathematical models through approximation of system knowledge in a similar way to human perception and reasoning. The design of FIS involves the following steps [34]:

B. QL based controller
Q-Learning is an off-policy RL algorithm that can learn the optimal Q-values for each stateaction pairs [12]. For the single agent case, as long as all state-action pairs are visited and continued to be updated, Q-learning guarantee an optimal behavior regardless of the specific policy being followed throughout the learning phase. On the other hand, the multi agent case does not have a formal demonstration for the convergence to the optimal solution, due to the problem of conflict among the agents.
The equation for updating the Q-values is given by: where α is the learning rate, γ is the discount factor, A t is the current action, r t is the immediate reward, X t and X t+1 are the current and the next state respectively. The procedure of Q-learning algorithm is shown in Algorithm 1. In what follows, we describe the definition of states, reward and actions for QL based controller for solving our MDP. 3) Reward: The reward function determines the immediate reward the controller acquire as a result of taking a specific action. The optimization goal is to minimize the power drained from the grid while avoiding system outage as given by (4). Hence, the reward function can be formulated as: where E m (A t , t) and D(A t , t) are respectively the normalized grid energy consumption and the traffic drop rate in the cluster, given the operative modes of the vSCs and the time step t.

Algorithm 1 Q-Learning Algorithm
Initialize Q(X, A)∀X ∈ X, A ∈ A arbitrarily for each episode do: for each step, t, of episode do: Choose A t from X t using policy derived from Q Take action A t , get reward r t and next state X t+1 update Q-value using (9) X t = X t+1 end for end for

C. FQL based controller
The main advantage of RL algorithms is that they do not need a model of an environment, which makes it suitable to be applied for our study. However, QL can be inefficient for large state-action spaces and cannot be directly applied in problems involving continuous state-action spaces. In such cases, fine grained discretization of the state-action space helps, but at a cost of an exponential increase in the state space, which makes the learning process slow. In order to overcome these limitations, fuzzy functions approximation can be used with QL and achieve a more smooth action transition in response to a smooth change in states, without the need for fine grained discretization. This motivates us to design a fuzzy approximation based controller for our MDP.
FQL allows to integrate the benefits of FIS in Q-learning: provide good approximations of the Q-function and enable the use of Q-learning in continuous state spaces [35]. In FQL, let Each fuzzy rule is corresponding to a state. A state X i is defined as: (x 1 is X i,1 and x 2 is X i,2 and .... Determine the global action A(X t ) for the state X t using (11) Estimate the corresponding Q value Q(X t , A) using (12) Take action A(X t ) and observe the new state X t+1 Get the reward r t Estimate the value of the new state v(X t+1 ) using (13) Calculate the error signal ∆Q using (14) Update q-values using (15) where w i (X t ) is the firing strength of rule i which is determined by the membership functions of crisp input X t using fuzzy and operation and A i is the corresponding action/consequent of rule i from the exploration/exploitation policy.
where q(i, A i ) is the q-value associated with rule i and its selected action where w i (X t+1 ) is the firing strength of rule i evaluated from the new state X t+1 and q(i, A max ) is the maximum q-value for rule i.
where γ is the discount factor.
where α is the learning rate.
In our scenario, we define the membership functions for the traffic load, energy arrival and battery as well as actions and reward functions, as follows. fuzzy sets are defined with linguistic terms "Very Low", "Low", "Medium", "High" and "Very High" as shown in Fig. 5. Hence, the fuzzification step involves mapping the input corresponding firing strength using (11). This defuzzification method is commonly applied in zero order Sugeno fuzzy systems [36] and is known to be computationally efficient. In our controllers, since the set of actions are limited (3 operative modes of each vSC), the crisp output obtained by the defuzzification method using (11) is converted to a nearest integer, which corresponds to an operative mode of a vSC.

3) Reward:
The reward function is the same as defined for QL control in (10).

D. Control without coordination
This section presents the control methods where each vSCs take actions independently without any system wide information. In this case, as opposed to the QL and FQL methods described above, the vSCs did not have the load level of the MBS and it is not considered in their decision making process. As a result the state/rules of un-coordinated methods are given by: Therefore, the un-coordinated methods have 5 × 5 × 5 = 125 rules/states for FQL and QL methods respectively. We call these methods as Un-Coordinated FQL (U-FQL) and Un-Coordinated QL (U-QL). The actions and reward definitions of U-FQL and U-QL methods are the same to the actions and rewards defined above for both FQL and QL controls.

A. Simulation Scenario
According to the traffic model defined in Section IV-C, user activities are categorized based on [37] as heavy users with an activity of 900 MB/hr and ordinary users with an activity of 112.5 MB/hr. The solar energy traces are generated using the SolarStat tool [30] for the city of Los Angeles. For the PV modules, we have considered the commercial Panasonic N235B. These

B. Off-line Training
In this section we analyze the behavior of the system when the training is performed off-line.
In particular, we considered one year as an episode with time granularity of one hour, since it allows to achieve a correct dimensioning of the solar power system for cellular base stations, as shown in [38]. Hence, every hour the agents choose actions corresponding to one of the possible operative modes of the vSCs with the goal of minimizing grid energy consumption, while avoiding system outage.

1) Training Analysis:
The training phase requires calibrating the parameters of the algorithm that have the strongest impact appropriately. These parameters are the learning rate (α), the exploration parameter ( ) and discount factor (γ). Moreover, we also adopt a discount process on these parameters in order to guide the exploration toward the stability. In particular, we are applying an exploration discount factor of 0.5 at the beginning of each epoch until the agent reaches minimum level of exploration, which is equivalent to 3%.
The cumulative reward of FQL and QL methods for a system composed of 3 vSCs and a MBS with in a residential area traffic profile are shown in Fig. 6. The cumulative reward is normalized with respect to a cumulative reward bound which is determined off-line using DP [9]. As it can be seen from Fig. 6a, FQL based control can achieve a cumulative reward very close to the optimal bound (more than 97%). In addition, the choice of training parameters affects the convergence of the FQL control. The cumulative reward is shown to be sensitive to the exploration and learning rate parameter choices as it can reach the 85% level in the case of α = 0.01 and = 0.5. The cumulative reward of QL based control in residential area is shown in Fig. 6b. In the best case, the cumulative reward obtained by QL (94%) is close to the optimal bound but lower with respect to FQL.
The normalized cumulative reward for a system of 3 vSCs deployed in an office area is shown in Fig. 7. These results show that at the best case, FQL and QL controls in an office area are able to gain a cumulative reward of about 99% and 97% with respect to the optimal bound, respectively. The results in Fig. 6 and Fig. 7 also show that, FQL based control is able to accumulate rewards faster than QL. In residential scenario, the FQL method is able to get around 95% of the reward in less than 5 epochs where as QL requires about 15 epochs to reach the same level of cumulative reward. For an office scenario, FQL achieves a cumulative reward of about 97% in less than 5 epochs, whereas QL requires about 20 epochs to reach 96% level.
Moreover, higher initial exploration rate is important for both FQL and QL, since it enables to explore more actions randomly during the initial phase of the training. Thus, the agent in the vSC has already discovered a higher number of rules-actions/state-actions pairs for FQL and QL, respectively, which, in turn, help to avoid entering local optima.
The same analysis has been also applied for scenarios with 5, 7, 10, 12 and 15 vSCs. The maximum cumulative reward obtained by both QL and FQL based controllers in residential and office area are shown in Fig. 8. The results show that FQL is able to accumulate higher reward compared to QL, 35% more with 15 vSCs. It is to be noted that, the maximum cumulative reward is decreasing as the number of vSCs increases. This is due to the higher load in the system injected by the vSCs which generates higher system drop rate and, in turn, reduces the immediate reward. Moreover, as the number of vSCs increases, conflicts among the actions of the agents can emerge which can impact the immediate reward obtainable. In addition, the cumulative reward is higher in an office area. In fact, the peak of traffic in the residential profile occurs during the early night (11 pm), as shown in Fig. 3, when the energy income is low, thus forcing the agents to switch-off or choose actions with more computational offloading to MBS.
Finally, the maximum cumulative reward gap between FQL and QL increases with the number of vSCs, as can be seen in Fig. 8a and Fig. 8b. This highlights the better suitability of FQL control especially in a network of higher number of vSCs.  Fig. 9 and Fig. 10 for residential and office area, respectively. Moreover the functional split selection behaviors for an average summer day are shown in Fig. 11 and Fig.   12 for off-line, FQL and QL controls in residential and office area, respectively. These results show that both FQL and QL polices usually switch-off most of the vSCs during very low traffic periods, as done by the off-line policy. However, the polices substantially differ in their respective functional split options selection when switched on. In this regard, QL is adopting a more conservative approach by selecting more PHY-RF split option as compared to the other solutions. In fact, FQL has a more similar behavior with respect to the off-line solution thanks to its higher flexibility in policy selection for both MAC-PHY and PHY-RF splits. In residential area and on average winter day, the MAC-PHY selection rates are 51%, 46% and 23% for offline, FQL and QL controls respectively. For average summer day, the residential area MAC-PHY selection rate rises to 77%, 68% and 34% in off-line, FQL and QL solutions respectively. On the other hand, on average winter day in office area, the MAC-PHY selection rates are 62%, 50% and 34% for off-line, FQL and QL controls respectively. For average summer day, the office area MAC-PHY selection rate rises to 81%, 70% and 44% in off-line, FQL and QL solutions respectively. Hence, the FQL policy is able to have higher adaptation to the energy income of the seasons. It is also interesting to note that for an office area, the off-line solution configuration is predominantly between switch-off and MAC-PHY split. This can be clearly seen in Fig. 12, where the off-line solution has 0% PHY-RF selection rate on an average summer day.  Table II. These results show that FQL is able to achieve very low grid energy consumption which is only 4% to 5% higher than the energy consumption value obtained by the off-line bound for both office and residential profiles. On the other hand, QL policy consumes relatively higher grid energy which is about 8% to 10% higher than the off-line policy, for both office and residential area traffic. This can also be deduced from the policy behaviors in Fig. 9, Fig. 10, Fig. 11 and Fig. 12, which show that FQL has higher MAC-PHY selection rate than QL. In particular, the FQL policy shows adaptation to a higher energy income in summer months by increasing the selection rate of MAC-PHY split. This behavior is also observed from the off-line solution.
With higher MAC-PHY selection rate, more energy saving can be achieved since most of the BB processing functions are performed locally at vSCs. The results in Table II also  In residential area traffic, the FQL controller achieves an energy saving of up to 12% and an average drop rate of up to 10% less than the QL control. Moreover, FQL controller is able to achieve energy saving of up to 17% and average drop rate of up to 8% less than QL control in an office area traffic profile. The better performance by the FQL is aligned with the higher cumulative rewards obtained by the FQL controller as shown in Fig. 8

C. Policy Validation
In this section we evaluate the behavior of the system in real deployment scenario with a training performed off-line with simulation. In detail, we will validate the proposed FQL and QL based controllers using a new environment, which is characterized by an energy arrival and traffic demand profiles which are different from the environment used for simulated training. In this case we are using the algorithms with pre-trained Q-values and with an exploration rate of 5%. The validation of the policies along with the training environment policy evaluation for 3, 5, 7, 10, 12 and 15 vSCs for a year of operation are shown in Table III. The validation results in Table III show that both FQL and QL are able to adapt their behaviors to the new validation environment. This is confirmed by both grid energy and average drop rate performances that are very close to the corresponding policy evaluation results. These results give an insight that using simulated trained Q-values / rules-actions consequents for QL / FQL respectively, continuously exploring new actions in the new environment and updating the corresponding Q-tables is a viable approach in real deployment scenarios.     Here, we perform the evaluation of the proposed FQL and QL based controls in run-time deployment scenario without pre-training, i.e., the vSCs are learning on the job while they are in operation. In this case, all the Q-values / rules-actions consequences are initialized to 0 for QL and FQL, respectively. An exploration/exploitation strategy is used for the learning of vSCs. In  Fig. 15. The results show that FQL is able to gain higher cumulative rewards than QL starting from the first year of operation. As a result, it is more suitable for run-time training of the vSCs than QL. Moreover, lower values of the learning rate are better for FQL, whereas QL requires relatively higher learning rate. The exploration rate parameter shown in Fig. 15 are initial exploration rates, which are continuously discounted as the training progresses until reaching the minimum level of exploration, which is set at 5%.
The grid energy consumption performances of both FQL and QL controls for a run-time training and operation are shown in Fig. 16. Moreover the average drop rate performances of run-time controls are shown in Fig. 17. These results show that FQL policy is more suitable for run-time application, as shown both in terms of grid energy consumption and average drop rate. Run-time FQL is able to gain an energy saving ranging from 10% to 17% with an average drop rate of 2.5% to 13% less, than run-time QL, in the first year of deployment.
However, compared to policy validation results based on pre-trained agents, shown in Table   III, the run-time training and operation results shown in Fig. 16 and Fig. 17 display lower performances. For FQL controller, the validation results of pre-trained agents shown in Table   III, achieve energy saving ranging from 5% to 9.5% with a drop rate of 0.4% to 1% less, than the run-time results in Fig. 16 and Fig. 17. Moreover, for QL, the validation results in Table III, show an energy saving ranging from 7% to 19% with a drop rate of 3% to 12% less, than the run-time results. Hence, to get closer to the optimization goals, it is better to initialize vSCs' agents with some knowledge prior to their deployment. This can be in the form of simulated training of the vSCs, as shown in VI-B. As a result, training of the vSCs in a simulative environment prior to their deployment and allowing them to explore new knowledge while in operation is a more appropriate approach in real deployments.

VII. CONCLUSIONS
In this paper we have proposed a network scenario where the computational processes of the vSCs powered solely by energy harvesters and batteries may be shared with on gridconnected central MEC-server at the MBS for part of their BB processing. We have stated the energy minimization problem and proposed multi-agent RL to solve it. Distributed Fuzzy Q-Learning and Q-Learning on-line algorithms are tailored for our purposes. Coordination among the multiple agents is achieved by broadcasting system level information (i.e. the traffic load at MBS) to the independent learners. Finally, we have evaluated the network performance (in terms of energy consumption and traffic drop rate) by our multi-agent RL solutions with different levels of coordination and compared them against the off-line performance bound and static solutions.
Our results confirm that coordination via broadcasting may achieve higher system level gains than un-coordinated solutions and cumulative rewards closer to the off-line bounds.
Moreover, our analysis permits to evaluate the benefits of continuous state/action representation for the learning algorithms in terms of faster convergence, higher cumulative reward and more adaptation to changing environments.