Self-organized femtocells: a Fuzzy Q-Learning approach

We introduce in this paper the innovative concept of self-organized femtocells for future generation broadband cellular networks. Since the home is the basic unit at which femtocells will be located, their deployment will be massive and their number and position unknown to the operator. This requires femtocells to be autonomous and self-organized, and able to work without human intervention. We propose self-organization to be implemented through Reinforcement Learning (RL) and femtocells to make transmission decisions as a multiagent system, with the objective of maximizing the system capacity and not generating additional interference to the traditional macrocell network. In particular, we manage the femto-to-macro aggregated interference, in realistic wireless settings, by means of Q-Learning (QL) techniques, which allow the femtocells to learn online and distributively the most appropriate resource allocation policy by continuous interactions with the environment. However, QL is based on discrete representation of state and action spaces, which makes the proposed approach not independent of the environment and designer criterion, since it requires a significant human intervention in the definition of the state and action spaces. As a result, we propose to optimize the self-organization capabilities of the proposed scheme by combining QL with the Fuzzy Inference System theory. We then propose a Fuzzy Q-Learning approach which allows avoiding the subjectivity of the QL design with continuous state and action representation, besides improving performance and convergence capabilities. We evaluate simulation results in a 3rd Generation Partnership Project (3GPP) compliant scenario and we compare them to heuristic approaches. Results will show the unique ability of these RL approaches to self-adapt to the dynamics of realistic wireless scenarios. Finally, we discuss the implementability of the proposed schemes in 3GPP systems, and in terms of memory and computational requirements.


Introduction
In the last years, global mobile data traffic has increased more than 130 % per year, and it is predicted to continue augmenting at a compound annual growth rate of 78 %.This growth comes as a result of the proliferation of dataoriented devices, i.e. smartphones, tablets, laptops with mobile broadband, etc., and with them, the emergence and availability of services and applications.There is then the urgent need of further wireless capacity.In this sense, it has been recently observed [1] that since 1957 there has been a million fold increase in wireless capacity.If broken down into constituent components, a 25-times improvement is due to a wider spectrum, a 5-times gain to chopping the spectrum into smaller slices, a 5-times enhance is due to advances in modulation and coding schemes and a 1,600times increase is due to reduced cell sizes and transmit distances that allow efficient spatial reuse of spectrum.Currently, wireless cellular systems with one Long Term Evolution-Advanced (LTE-A) Base Station (BS), whose standards are defined in Generation Partnership Project (3GPP) release 10 [2], can achieve a performance close to the optimal information theoretic capacity limit.As a result, further gains strongly depend on the development of an advanced network topology, reducing the distance between the transmitter and the receiver.This is where the innovative concept of femtocells [1] naturally fits in.
Femtocells are short range, low power, low cost cellular BSs designed to serve small areas such as a home, or an office environment, providing radio coverage of a certain cellular network standard.They are connected to the service provider via broadband connection, e.g., Digital Subscriber Line (DSL) or optical fiber.Due to these characteristics, they can be deployed far more densely than macrocells, and femtocell spectrum can be reused more efficiently.Femtocells enable reduced distance between the transmitter and the receiver, and reduced transmit power while maintaining good indoor coverage, since penetration losses and outdoor propagation attenuation partially insulate the femtocell from inter and intracell interference.Besides, as femtocells serve around 1-5 users, they can devote a larger portion of resources to the connected users compared to macrocells.Indoor subscribers are served through the user-installed femtocell, providing high data rates and reliable traffic, while the operator reduces traffic on the macrocell, thus focusing only on outdoor and mobile users.
In this novel cellular ecosystem, characterized by the coexistence of heterogeneous types of nodes, a notable number of network parameters with complex interdependencies has to be considered and optimized without relying on human intervention.This is proposed to be achieved through the concept of a self-organized macro-femto network [3,4] able to autonomously self-configure and selfoptimize the complex Radio Access Network (RAN) without planning actions, but reacting in response to the dynamics of the network.This will allow to increase network performance and service quality avoiding the expensive human interventions of traditional cellular networks.
From the operators' perspective, since a large amount of traffic can be offloaded from macrocells, macrocell sites can be reduced, which would result in important Capital Expenditures (CAPEX) savings in the RAN and in the backhaul.This will also lead to associated savings on the Operational Expenditure (OPEX).Preliminary studies have shown that 60,000 USD/year maintenance of a macrocell reduces to 10,000 USD/year for an equivalent capacity femto network [1], and that further savings can be achieved on the OPEX through the implementation of a self-organized femtocell network.
Despite the economic and technical benefits promised by the introduction of this new technology, the deployment of femtocells will also cause some problems to operators.
To achieve the expected spectral efficiency, macro and femto layers should operate in the same frequency band, and therefore, the interference becomes more random and harder to control.In addition, since femtocells will be placed by end consumers, their number and position is unknown to the network operator.As a consequence, the interference cannot be handled by means of a centralized frequency planning, which generates a distributed interference management problem.
In this paper we investigate the autonomous interference coordination in a heterogeneous macro-femto network, from the point of view of the aggregated interference that a femtocell network may generate at macrocell users.The interpretation we give to these autonomous decisions relies on the theory of self-organization, which we propose to implement following a Reinforcement Learning (RL) formulation [5], since it takes decisions based on environmental sensed information allowing an online learning.Each femtocell is considered as an autonomous agent able to adjust to the interference of the surrounding environment.Femtocells are therefore modeled as a multiagent system [6], lacking of a central authority and working in a coordinated fashion to find a stable and reliable behavior function.
Specifically, among different options, the authors have already proposed to solve the interference management problem from femto to macro system through Q-Learning (QL) [7,8], since it is the most representative RL algorithm and it provides the most meaningful results for our application, as demonstrated in [9].In addition, in the framework of 3GPP releases 8, 9 or 10, we have also proposed QL supported by Partially Observable Markov Decision Process (POMDP), to be executed without relying on information to be exchanged between macro and femto layers through the X2 interface [10].QL works by quantifying, through a Q-function, the quality of a given action in a certain state of the environment.The Q-function is usually represented by means of a Q-table, where the so called Q-values are stored and updated online.Consequently, states characterizing the environmental situation and the available actions have to be represented by discrete values, which makes mandatory the use of thresholds.Selecting the mentioned thresholds for the state representation, and setting the amount and the values of the available actions entails an important intervention of the learning system designer, which is against the intrinsic philosophy of self-organization.When designing the learning system, the selection of the amount of states representing the environment and the actions available in each state, plays an important role in the agent behavior.The size of those sets directly affects the system adaptability and therefore, its performance.Besides, it is directly related with the feasibility in the knowledge representation, i.e. when the number of state-action pairs is large or the input variables are continuous, the memory requirement to store the Q-table may become impracticable, as well as the required learning time.Also, using fixed thresholds for defining discrete states may lead to sudden state transitions or abrupt action changes.A solution to the previously presented weak points, is to use a form of continues state and action representation without the need of near infinite Q-tables.This would allow to build a system capable of working independently from the scenario and designer criterion, which is coherent with the self-organized requirements of future networks.To this end, we propose to improve the QL algorithm, by the introduction of Fuzzy Inference System (FIS), in order to represent state and action spaces continuously.This approach is called Fuzzy Q-Learning (FQL).It was introduced by Berenji in [11] and then extended by Glorennec and Jouffe in [12] and [13].In the field of wireless communications, it was applied in [14,15], for control decisions in cognitive and Universal Mobile Telecommunications System (UMTS) networks respectively, and in [16], for implementation of self-organized antenna configuration.Additionally to the benefits already mentioned, FQL offers other interesting advantages such as: (1) a more compact and effective expertness representation mechanism and (2) the possibility of speeding up the learning process by incorporating offline expert knowledge in the inference rules.Some preliminary results on system capacity for a low complex scenario for the proposed approach have been presented by the authors in [17].Finally, we also discuss in this paper the practical implementability of the proposed approaches in 3GPP systems and in state of the art processors.
To sum up, the paper's main contributions are: (1) We define a QL algorithm to solve the femto-tomacro interference management problem.(2) We improve the femtocells self-organization capabilities by the introduction of fuzzy logic.We propose FQL to optimize the self-organization capabilities, enhance the system performances, increase the learning speed and eliminate subjectivity on the design of state and action spaces.(3) We discuss the implications of implementing the proposed schemes in 3GPP systems and the memory and computational requirements, evaluating if they are compliant with the state of the art processors.
This paper is structured as follows.In Sect.2, we present the system model.In Sect.3, we describe the QL basics and then we present the proposed QL algorithm.In Sect.4, we present the proposed FQL algorithm.In Sect.5, we describe relevant simulation results.In Sect.6, we discuss the practical implementation requirements, in terms of 3GPP standard, memory and computational cost and finally, Sect.7 summarizes the main conclusions.

System model
We consider a heterogeneous wireless network composed of a set of M macrocells that coexist with F femtocells.The M ¼ jMj macrocells form a regular hexagonal network layout with inter-site distance D, and provide coverage over the entire network, consisting of both indoor and outdoor users.The F ¼ jFj femtocells are placed indoors within the macro-cellular coverage area following the 3GPP dual stripe deployment model [18].Both, macro and femtocells, operate in the same frequency band, which allows to increase the spectral efficiency per area through spatial frequency reuse.
An Orthogonal Frequency Division Multiple Access (OFDMA) downlink is considered, where the system bandwidth B is divided into R Resource Blocks (RBs), with B ¼ R Á B RB : A RB represents one basic time-frequency unit that occupies the bandwidth B RB over time T. Associated with each macro and femto BS are U M macro and U F femto users, respectively.The multiuser resource assignment that distributes the R RBs among the U M macro and U F femto users, is carried out by a proportional fair scheduler.
We denote by p f ¼ ðp where h mu m r accounts for the link gain between the transmitting macro BS m and its macrouser u m ; while h nu m r and h fu m r represent the link gain of the interference that BSs n2M and f impose on macrouser u m , respectively.Finally, r 2 denotes the thermal noise power.
Likewise, the SINR of femtouser u f 2U F ; who is allocated in RB r by femtocell f is: where h nu r f and h mu r f indicate the link gain between BS n2F and m and femtouser u f , respectively.
The capacity that femto and macrocell achieve, represented by C m and C f , is the sum over the individual capacities of all RBs, which is upper bounded by the Shannon's capacity, where i = (m, f), respectively: where C r f and C r m are the capacities of femto and macrocell BSs in RB r, respectively.Then, the total capacity of the network amounts to: 3 Learning with discrete state and action spaces: distributed Q-Learning In Machine Learning (ML) literature, the ability of learning new behaviors online and automatically adapting to the temporal dynamics of the system is commonly associated with RL [19].At each learning iteration, the agent perceives the state of the environment and takes an action to transit to a new state.A scalar cost is received, which evaluates the quality of this transition.In a wireless setting, the single agent approach can be applied in scenarios characterized by only one decision maker.On the other hand, the multiagent approach is applied in situations where the intelligence has to be distributed across multiple nodes.In femtocells with co-channel operation the interference cannot be handled by the operator by means of centralized network planning, because the number and location of femtocells are unknown to the operator and the absence of coordination with the macro network generates a distributed interference management problem.We map the problem onto a multiagent system.The learning process of each femto node is characterized by R parallel tasks, one per each RB.The theoretical framework can be found in stochastic games [20], which are non-cooperative games defined by the following quintuple fF ; S; A; P; Cg; where: • F ¼ f1; 2; . ..; Fg is the set of agents (femto BSs).
• S ¼ fs 1 ; . ..; s k g is the set of possible states and k is the number of states of the environment the agent can perceive.• A ¼ fa 1 ; . ..; a l g is the set of actions and l is the number of available actions per state.• C : S Â A ! R is the cost function, which is fed back to each agent after the execution of a certain action.• P is a probabilistic transition function, defining the probability of migrating from one state to another, provided the execution of a certain joint action.
For each independent agent, and learning process, the state transition function probabilistically specifies the next state of the environment as a function of its current state and the joint action.The cost function specifies expected instantaneous cost as a function of current state and action.The objective is to find a policy that minimizes the cost of each state x, i.e. an optimal policy for the infinite-horizon discounted model [19].
We assume that the environment is a finite-state, discrete-time stochastic dynamical system.The interactions between the multiagent system and the environment at each time instant t and for RB r consist of the following sequence, where the notation referred to t and r is omitted for the sake of simplicity.
• The agent f senses the state x 2 S: • Based on x, agent f selects an action ¼ a 2 A: • As a result, the environment makes a transition to the new state y 2 S: • The transition to the state y generates a cost c = c(x, a), for agent f. • The cost c is fed back to the agent and the process is repeated.
The objective of each agent is to find an optimal policy p Ã ðxÞ 2 A for each x, to minimize some cumulative measure of the cost c(x, a) received over time.For each agent f and learning process r, we define an evaluation function, denoted by Q(x, a), as the expected total discount cost over an infinite time.

Qðx; aÞ
where 0 B c \ 1 is a discount factor, and by x(t) we indicate the state of the environment at time t.If the selected action a at time t, following the policy p(x) corresponds to the optimal policy p * (x), the Q-function is minimized with respect to the current state.Let P x,y (a) be the transition probability from state x to state y, when action a is executed.Then, (5) can be expressed as: where Cðx; aÞ ¼ Efcðx; aÞg denotes the expected value of c(x, a), a(t) represents the action taken at time t and a 0 = a(t ? 1) is the action taken in state y.Equation (6) indicates that the Q-function of the current state-action pair, for each agent f and learning process r, can be represented in terms of the expected immediate cost of the current state-action pair and the Q-function of the next state-action pairs.The task of QL is to determine an optimal stationary policy p * without knowing C(x, a) and P x,y (a), which makes it well suited for learning a power allocation policy in a femtocell system.The principle of Bellman's optimality [7] assures that there is at least one p * , such that: Applying the Bellman's criterion, first we have to find an intermediate maximum of Q(x, a), denoted by Q * (x, a), where the intermediate evaluation function for every possible next state-action pair (y, a 0 ) is maximized, and the optimal action is performed with respect to each next state y.Q * (x, a) is: Then, we can determine the optimal action a * with respect to the current state x.In other words, we can determine p * .Therefore, Q * (x, a * ) is maximum, and can be expressed as: As a result, the Q-value Q(x, a) represents the expected increased reward for executing action a at state x and then following policy p thereafter.
The QL process tries to find Q * (x, a) in a recursive manner using available information (x, a, y, c), where, again, x and y are the states at time t and t ? 1, respectively; and a and c are the action taken at time t and the immediate cost due to a at x, respectively.The QL rule to update the Q-values relative to agent f and learning process r is: where a is the learning rate.For more details about RL and QL the reader is referred to [5,7].

Q-Learning design
In this section we present the set of available actions A for the femto BSs, the state space S and the selected cost function C; which define our QL method.
(1) Action The set of possible actions A ¼ fa 1 ; . ..; a l g are the l transmission power levels that the femto BS can assign to RB r. (2) State The state of the environment for RB r and femtocell f is defined by the vector: • Femtocell total transmission power P" ow f indicator: One of the requirements of femtocells is to transmit at low power levels.In order to guarantee that the femtocell total transmission power over all RBs is below the threshold P max F , we include in the state definition the " Pow f indicator.As a result, " Pow f is a binary indicator defined as follows: • Macrocell capacity " C m r indicator: We consider a macrocell capacity indicator since another main requirement for femtocells is not to jeopardize the macrocell performance.We define " C m r as a binary indicator to determine whether the capacity of macrouser u m in RB r of macrocell m most affected by the activity of femtocell f, is above or below the macrocell capacity threshold C Th M , which is the minimum capacity per RB that the macrocell has to fulfill.
• Femtocell capacity " C f r indicator: In our design, we aim to maintain the above mentioned constraints, but we also want to maximize the capacity reached by the femtocells, so as to control the femto-to-femto interference.This is why, the third component of the state vector is a femtocell capacity " C f r indicator.We normalize C r f with respect to C min F , as a result, min is the minimum capacity to guarantee an acceptable service to the femtousers.We divide the possible values in four intervals given by: Notice that " C f r has been defined based on simulation results and on authors expert knowledge.In order to eliminate the inherent subjectivity in the state definition, it should be continuously represented, which is what we propose in Sect.4.3 with the introduction of FQL.
(3) Cost The cost assesses the immediate return incurred due to the assignment of a certain action in a given state.The considered cost for RB r and femtocell f is: where K is a constant initialized to a high value.The rational behind this cost function is that if the macrocell capacity at RB r is below the threshold C Th M and the femtocell total transmission power is above the maximum allowed P max F , the agent will receive a high cost value K. On the other hand, if both constraints are fulfilled the femtocell will focus on maximizing its own capacity and the macrocell capacity at RB r.
3.2 Q-Learning procedure Q-Learning (QL) works by letting each learning task of the femto BS f estimate a Q-function, whose values, the Q-values Q(x, a), for each x 2 S and a 2 A represent the appropriateness of selecting action a in state x.The Q-function represents the expert knowledge of the agent, and has to be stored in a representation mechanism.The most intuitive and common representation mechanism is a lookup table.In this way, QL represents its Q-function in a Q-table, one for each RB, with a dimension of k 9 l.
The QL approach is based on an iterative algorithm which consists in estimating the Q-values on the run.The estimation for agent f and RB r is performed as follows: (1) all Q-values are initialized to an arbitrary high number; (2) the agent measures the current state of the environment for each RB; (3) then, it selects the action a corresponding to the lowest Q-value; (4) the agent executes the selected action, which causes a transition to a new state; (5) the environment sends a feedback c to the agent; (6) the agent updates Q(x, a) based on Eq. ( 9); (7) the cycle is repeated from step (3).In any learning algorithm exploration is needed in the action selection process in order to guarantee an optimal solution.We include exploration through e-greedy action selection policy, i.e. with probability 1 À e we choose the action a associated with the minimum Q-value, and with probability e the action is selected randomly.For the sake of clarity, in Algorithm 1 we describe the pseudo-code.
4 Distributed Q-Learning with continuous state and action spaces Our experience in working with QL in dynamic scenarios is that the performances of interference management and the convergence capabilities strongly depend on the granularity in the state and action spaces definition.We show, as an example, how the system performance in terms of macrocell capacity decreases when reducing the dimension of the action space.In particular, we observe that by reducing by 17 % the set of available actions, the macrocell system performance may experience a reduction of up to 10 %, as presented in more details in Sect. 5 The use of discrete state and action sets may result in an inexact state representation and/or the selection of a not accurate enough action in a given situation.This is why we propose to improve the QL approach, by the introduction of FIS, in order to represent state and action spaces continuously.Additional advantages are: (1) a more compact and effective expertness representation mechanism; (2) the avoidance of subjectivity and human intervention in the algorithm design, when selecting discrete sets for S and A and (3) the possibility of speeding up the learning process by incorporating offline expert knowledge in the inference rules.In what follows we first give a brief overview of the FIS concept that we will use and then we present the proposed FQL algorithm.

Fuzzy inference systems
Fuzzy inference, is the process of formulating the mapping from a given input to an output using fuzzy logic.The mapping provides a basis from which decisions can be made, or patterns discerned.The parallel nature of the rules is one of the more important aspects of fuzzy logic systems.Instead of sharp switching between modes based on breakpoints, logic flows smoothly from different regions of behavior depending on the dominant rule.
The purpose of fuzzy systems is to perform as control systems considering that many times real problems cannot be efficiently expressed through mathematical models.So, fuzzy set theory models the vagueness that exists in real world problems.According to this theory, when X is a fuzzy set and x is a relevant object, the proposition ''x is a member of X '' is not necessarily true or false, but it may be true or false only in some degree, the degree to which x is actually a member of X [21,22].
In fuzzy logic, each object can be labeled by a linguistic term, where a linguistic term is a word as ''small'', ''medium'', ''large'', etc.As a result, x is defined as a linguistic variable.Each linguistic variable is associated with a term set T(x), which is the set of names of linguistic values of x.Each element in T(x) is a fuzzy set.In our work, we refer to the Takagi-Sugeno FIS, which is given by generic rules: where x ~¼ fx 1 ; . ..; x z g is an input and X i h is a fuzzy set in the domain of x h , for h¼1; . ..; z and i¼1; . ..; n: We have denoted by z the dimension of the input space and by n the number of rules.With O we refer to the system's output, where the output function, o i ðx ~Þ; is a polynomial function of x ~: In our work we use the 0-Takagi-Sugeno FIS, which means that the output polynomial is a constant.
A fuzzy inference process consists of three parts: fuzzification of the input variables, computation of truth values and defuzzification.The first step takes the fuzzy inputs x ãnd determines to which degree they belong to each of the appropriate fuzzy sets via membership functions.A fuzzy set X i h is characterized by a membership function l X i h ðx h Þ that associates each point in X i h with a real number in the interval [0, 1].This number represents the grade of membership of x h to X i h [23].After the inputs are fuzzified, for each rule we know to which degree each part of the inputs is satisfied.If the inputs of a given rule have more than one part, the fuzzy operator is applied to obtain one number that represents the result of the inputs for that rule.This number is known as the truth value of the considered rule.The inputs to the fuzzy operator consist of two or more membership values from fuzzified input variables.The output is a single truth value given by: In general, each rule is also described by a membership function which gives the degree of reliability of the rule.We consider the reliability to be a constant, equal for all the rules.Since decisions are based on the testing of all the rules in a FIS, the rules must be combined in some manner in order to make a decision.The input for the defuzzification process is a fuzzy set and the output is a single number given by: In the following section we present the FQL which is used to approximate the action value function in RL problems through Takagi-Sugeno FIS.

Fuzzy Q-Learning
Let us consider an input state vector x ~; represented by L fuzzy linguistic variables.For each RB r we denote " S ¼ f" s 1 ; . ..; " s n g the set of fuzzy state vectors of L linguistic variables.For state " s i ; we denote A ¼ fa 1 ; . ..; a l g the set of possible actions.The rule representation of FQL for state " s i is: s i ; Then a 1 with qð" s i ; a 1 Þ . . .or a j with qð" s i ; a j Þ . . .or a l with qð" s i ; a l Þ where a j is the jth action candidate which is possible to choose for state " s i ; and qð" s i ; a j Þ is the fuzzy Q-value for each state-action pair ð" s i ; a j Þ.The number of state-action pairs for each state " s i equals the number of the elements in the action set, i.e. each antecedent has l possible consequences.As one associates actions in every state in QL, one associates several competing solutions in every rule in FQL.As a result, every fuzzy rule needs to choose an action a j from the action candidate set A by an action selection policy.A fuzzy Q-value, which is incrementally updated, is associated to each conclusion.The result of FQL is the output of the defuzzification process.The first output is the inferred action after defuzzifying the n rules and is given by: where w i represents the truth value (i.e. the fuzzy-AND operator) of the rule representation of FQL for " s i ; and â is the action selected for state " s i ; after applying e-greedy selection policy.The second output represents the Q-value for the state-action pair ðx ~; aÞ; and is given by: Qðx ~; aÞ ¼ P n i¼1 w i Â qð" s i ; âÞ Q-values have to be updated after the action selection process.Since there is a fuzzy Q-value per each state-action pair, in each iteration, n fuzzy Q-values have to be updated based on: qð" s i ; âÞ ¼ qð" s i ; âÞ þ aDqð" s i ; âÞ ð 19Þ where Dqð" s i ; âÞ ¼ ½c þ cðQðy ~; a 0 Þ À Qðx ~; aÞÞ Â w i P n i¼1 w i ; c represents the cost obtained applying action a in state vector x ~and Qðy ~; a 0 Þ is the next-state optimal Q-value defined as: is the optimal action for the next state " v i ; after the execution of action â in the fuzzy state " s i :

FQL-based interference management
In this section we present the proposed FQL algorithm.Cost Eq. ( 14), available actions and environment description are the same as for the QL.The state representation for the FQL is given by the vector state: Differently from the QL case, the components of the state vector are the actual values of the input variables, which eliminates the subjectivity of the state vector definition of QL.
Figure 1 shows the FQL structure which consists of a four layer FIS.The functionalities of each layer are the following: • Layer 1 This layer has as input three linguistic variables defined by the term sets: TðPow f Þ ¼ {Very Low (VL), Low (L), Medium (M), High (H), Very High (VH)}, TðC m r Þ ¼ {L, Medium Low (ML), Medium High (MH), H}, TðC f r Þ ¼{L, ML, MH, H}.Therefore, considering the number of fuzzy sets in the four term sets in layer 1 we have z Every node is defined by a membership function with a bell shape form, as it is represented in Fig. 2. Consequently, the output O 1,h for a generic component x of x ~is given by: where e h and q h are the mean and the variance of the bell shape function associated to node h, respectively.We pick the bell shape over other possible options (e.g., triangular, trapezoidal, etc.), as it is commonly selected when a fuzzy system is combined with learning approaches [24].The membership functions definition for the proposed FQL are based on expert knowledge.Different mean and variance values, included in Table 1, and the amount of fuzzy sets for the four term sets description, have been tested through simulations.In particular, we choose five fuzzy sets for Pow f term set because we are defining a power control approach to manage interference, consequently the power is a driving principle, which affects the system performances.In addition, to keep the total femto transmission power below a maximum value, is an important system requirement.For C r m and C r f fuzzy terms we consider four fuzzy sets per each one, because both indicators are very important from the system performance point of view, but compared to the power, they are only used as indicators and consequently they can be defined with lower granularity than the power parameter.Notice that, subjectivity in the term set definitions is absorbed by the learning precess in layer 3, due to the inherent adaptive capability of FQL algorithms.The impact of the particular membership functions shapes (i.e.mean and variance) is not significant, since the learning process allows self-adaptation of the FIS to the environment [24].
• Layer 2 is the rule nodes layer.It is composed by Each node gives as output the truth value of the ith fuzzy rule.Each node in layer 2 has four input values, one from one linguistic variable of each of the four components of the input state vector, thus the ith rule node is represented by the fuzzy state vector " s i : The layer 2 output O 2,i is the product of four membership values corresponding to the inputs.Truth values are represented as: Fig. 1 Graphic representation of FQL • Layer 3 In this layer each node is an action-select node.
Here the set of possible actions for each layer 3 node are the l power levels.In this layer the amount of nodes is n and they select the action â based on the e-greedy action selection policy explained in Sect.3.2 and the qð" s i ; âÞ values are initialized based on expert knowledge.The node i generates two normalized outputs, which are computed as: • Layer 4 This layer has two output nodes, action node O 4 A and Q-value node O 4 Q , which represent the defuzzification method.The final outputs are given by: 5 Simulation results For performance analysis, system level simulations are conducted.The simulation tool that we have used is a Release 8 3GPP system simulator developed in C?? by CTTC, which focuses on the RAN of Long Term Evolution (LTE).The macro-cellular network is composed of a tesselated two-tier hexagonal cell layout, with M = 19, each having three sectors, and inter-site distance of D = 500 m.The macro BSs are placed at the junction of three hexagonal sectors as illustrated in Fig. 3. Macrousers are located outdoor and are uniformly distributed.Statistics are taken based on the macrousers served by sector F1 of central macrocell, as shown in Fig. 3. Relevant simulation parameters for macro and femtocells, as well as for QL and FQL algorithms are summarized in Table 1.Details on how learning parameters of QL and FQL have been set are given in [9].We consider the 3GPP implementation of frequency-selective fading model specified in [25] (urban macro settings) for macro BS to user propagation, and a spectral block fading model with coherence bandwidth of 750 kHz for indoor propagation.The considered Path Loss (PL) models are based on [18] for the case of urban scenarios.We introduce an occupation ratio, p oc , that determines the probability whether inside an apartment there is a femtocell or not.Each femtocell has a random activation parameter that determines the moment in which the femtocell is switched on and starts its learning process.Users' activity periods are determined based on a Poisson scheme.The set of possible actions available to the fth femto BS A for QL and FQL is composed by l = 60 power levels, ranging from -80 to 10 dBm Effective Radiated Power (ERP).In what follows we present results in terms of system performances and convergence capabilities.

System performance
We compare system performance results obtained by the QL, the FQL and a benchmark algorithm known as Smart Power Control (SPC), which is based on interference measurements and which was proposed by 3GPP in [26].In this algorithm the femtocell BS adjusts its RBs transmission power based on the total received interference at the femtocell BS, according to: where g is a linear scalar that allows altering the slope of power control mapping curve, b is a parameter expressed in dB, both of which are femtocell configuration parameters and R is the number of available RBs.Furthermore, N sc is the number of sub-carriers, P min F is the minimum femtocell transmit power, and E c is the reference signal received power per resource element present at the femto node.
As already explained in the theoretic sections, QL works by defining a finite set of potential actions.This implies a great amount of human intervention in the definition of the algorithm, as the action space should be complete enough to include those values which are appropriate for any situation.At the same time, it should not be too large, as it would increase the search time and the time to learn of the algorithm.Figure 4 represents the system performance in terms of average macrocell capacity, when different action spaces are considered.As it can be observed, the system performances are considerably affected by the selection of the action space, and so the learning time of the algorithm.
In particular, QL with 40 available power levels underperforms the same algorithm implemented with 60 or 50 power levels.The reason is that by reducing the number of available actions, we also reduce the grade of freedom of the algorithm to find its optimal solution.This evaluation justifies the need to further investigate the QL approach and to improve it by means of fuzzy logic, which allows to design an algorithm which is independent of the cardinality of the state and action spaces.Figure 5 depicts the behavior in terms of macrocell capacity as a function of the femtocell occupation ratio, p oc .The FQL algorithm better performs than QL algorithm due to the fact that it is able to better define the state of the environment and consequently to find more accurate actions.This improvement in the system behavior comes associated to the continuous state and action representation allowed by the FQL algorithm.FQL and QL better perform than the benchmark algorithm, since when increasing the density of femtocells, it is not able to adaptively operate to maintain the interference below a threshold.In particular the SPC is not able to react online to the dynamics of the wireless environment where users are moving around, femtos and users operate based on random activity factors, and channel propagation aspects like fading and shadowing are affecting the signals.In addition, it is not capable of keeping memory of previous experience, in order to react to quick changes in the scenario.
Figure 6 shows the system behavior in terms of average femtocell capacity as a function of p oc .It can be observed that the average femtocell capacity decreases with the femtocell density due to the increment of the femto-tofemto interference.Both learning methods maximize the femtocell capacity as it is required by the defined cost function.However, FQL is able to provide better performances than QL, thanks to its optimized self-organization capabilities.Finally, and similarly to the results presented in Fig. 5, both FQL and QL again better perform than SPC for the reasons explained above.

Convergence capabilities
We now discuss the convergence capabilities of the proposed approaches.We compare QL, FQL and we also study the impact of initializing the inference rules of the FIS in order to incorporate in the learning scheme offline expert knowledge.To do this, we implement the Init-FQL, which consists of an expert initialization of some of the Q-values in layer 3 nodes.Specifically, the action selection model chooses the action corresponding to the lowest Q-value, therefore, we propose to initialize the Q-values corresponding to critical states and appropriate actions at a lower initial value than the rest of actions.In this way, the agent is expected to find more adequate solutions since the beginning of the learning process, which results in a faster learning period and a lower interference at macrousers.For instance, Q-values corresponding to states with macrocell capacity ''L'' and high power levels can be initialized at lower Q-values than other states.The rationale behind this is that if the capacity at the macrocell system is low, this means that femtocell's actions corresponding to transmit at high values may jeopardize the macrouser performance and consequently it would be desirable to avoid these actions.
Figure 7 represents the probability that the total transmission power of a femto BS is above the given P max F threshold for a femtocell occupation ratio of 45 %.In particular, we compute the average required iterations for the three learning systems to reach a probability lower than a benchmark fixed at 2 %.Results show that FQL needs 57 % less iterations than the QL algorithm to reach the target, and the Init-FQL needs 95 % less iterations than FQL.
Finally, Fig. 8 represents the probability that the transmission capacity of a macrouser u m allocated in RB r is below the C Th M threshold, for a scenario with a femtocell occupation ratio of 60 %.As it can be observed, in terms of interference, the FQL algorithm is able to better adapt its actions since the beginning of iterations, which results in lower interference at the macrocell system.

Practical implementation
In this section we analyze the practical implementation of the proposed solutions.We focus on the implementation of QL and FQL in a 3GPP system.Also, we evaluate the feasibility of the learning algorithms in terms of memory and computational requirements in state-of-art processors proposed for LTE femtocells.

Practical implementation in 3GPP
For the implementation of QL and FQL algorithms in femto BSs, femtocells need feedback from the macrocell system about the aggregated interference that macrousers trapped in femtocells coverage area are receiving.With this information agents can compute the capacity C r f reached by macrouser u m .The 3GPP LTE network architecture connects neighboring BSs via the X2 interface [27,28], which conveys control information related to handover and interference coordination.We propose to convey through the X2 interface the information about the SINR at macrousers using the following procedure: (1) Macrouser u m determines the cell-ID of surrounding femto BSs, by reading the corresponding Broadcast Channel (BCH).
(2) The cell-IDs of the surrounding femto BSs are reported to the serving macro BS. (3) The macro BS sends the information corresponding to the SINR at macrousers via the X2 interface to its surrounding femto BSs.
For the introduction of this information, the existing X2 protocol has to be improved in order to include the corresponding RBs SINR information.Since, the X2 interface induces significant delays of up to D max ¼ 20 ms, with an average of 10 ms [29], it is assumed that both learning algorithms will perform a learning cycle every 10 ms.In learning algorithms, the expert knowledge can be represented in different ways.The selected representation mechanism is directly related with memory requirements of the learning method.In this section we analyze the memory requirements of both proposed learning methods.We assume femtocells implementing LTE standard with 20 MHz bandwidth channel, which corresponds to 100 RBs.
• QL Since in QL there is a Q-value Q(x, a) per each state-action pair, the memory requirements for this kind of systems are given by the size of the state and action spaces.Therefore, the total memory requirement for a femtocell implementing QL is ðkÂlÞ Á 100 ¼ 96 kB: • FQL In FQL the knowledge is stored in layer 3, each node in this layer has a Q-value per each action, therefore this occupies n 9 l B. So, for the implementation of FQL the total memory requirement is ðn Â lÞ Á 100 ¼ 480 kB:

Computational requirements
We assume that femto BSs functionalities are implemented in a Digital Signal Processor (DSP) C64x of Texas Instruments.In particular, we assume the use of a TMS320C6416 Fixed-Point DSP, commonly used for communication applications and proposed by Texas Instruments to be used in femtocell BSs, since it supports the implementation of LTE protocols.The computational requirements are evaluated in Million Instructions per Second (MIPS), so each sum, multiplication, memory access and storage require one operation.For the case of FQL, more complex operations are required, i.e. exponentials and divisions.We assume that exponential functions are solved through the piecewise linear approximation [30], which results in 11 operations per exponential computation.Divisions in a TMS320C6x DSP need between 18 and 42 operations [31], we assume the worst case of 42 operations.In what follows, we evaluate the computational requirements of each QL and FQL iteration.
• QL The computational cost for QL is given by the Q-value estimation through Eq. ( 9), which is summarized in Table 2.In particular, the total number of operations required per RB is 246.Since a learning iteration is performed every 10 ms, the average latency of the X2 interface, the total amount operations is 2.46 MIPS.• FQL For FQL, the computational requirements are given by the processes performed in each node of the four layers of the FIS.Here, some exponential operations are required in layer 1 to compute the membership values, Eq. ( 21), and some divisions are required to compute the outputs of layer 3 based on Eqs. ( 23) and (24).The computational operations of each layer are summarized in Table 3.The amount of operations per RB is 39643.Therefore the total operations required are 396.43MIPS.
The TMS320C6416 processor has a maximum capacity of 8000 MIPS [32].Therefore, in terms of computational requirements both QL and FQL algorithms can be implemented in the processor.On the other hand, in terms of

Conclusions
In this paper we have introduced the novel concept of selforganized femtocells by proposing two QL algorithms for interference management in a macro-femto network to foster the coexistence of both systems in the same band.
We have shown that, regarding QL, FQL allows to optimize the self-organization capabilities of the proposed approach by continuous representations of state and action spaces and previous expert knowledge inclusion in the fuzzy rules, which eliminates subjectivity in the state and action representation and reduces the learning period.We have demonstrated that FQL outperforms a heuristic approach proposed in 3GPP TR 36.921 and QL.Finally, we have presented an analysis of the implementation requirements of QL and FQL in 3GPP systems.Besides, memory and computational requirements have been discussed showing that the proposed solutions can be implemented in state-of-art processors.

Fig. 2
Fig. 2 Membership functions of the input linguistic variables

Fig. 7 Fig. 6
Fig.7Probability that the femtocell total transmission power is above P max F threshold

Fig. 8
Fig. 8 Probability that the macrocell capacity is below C Th M Threshold

Table 3
Computational requirement for FQL

Table 2
Computational requirement for QL Wireless Netw (2014) 20:441-455 memory requirements, the processor only has cache memory, therefore, learning algorithms would use the 1,280 MB total addressable external Dynamic Random Access Memory (SDRAM).As a result, both QL and FQL algorithms can be easily allocated.