Learning in the Sky: Towards Efficient 3D Placement of UAVs

Deployment of unmanned aerial vehicles (UAVs) as aerial base stations to support cellular networks can deliver a fast and flexible solution for serving high and varying traffic demand. In order to adequately leverage the benefit of UAVs deployment, their efficient placement is of utmost importance, and requires to intelligently adapt to the environment changes. In this paper, we propose novel learning-based mechanisms for the three-dimensional deployment of UAVs assisting terrestrial networks in the downlink for overloaded situations. The problem is modeled as a game among UAVs. To solve the game, we utilize tools from reinforcement learning, and develop low complexity algorithms based on the multi-armed bandit and satisfaction methods to learn UAVs' locations. Simulation results reveal that the proposed satisfaction based UAV placement algorithm can yield significant performance gains up to about 50% and 41% in terms of throughput and the number of outage users, respectively, compared to a learning based benchmark algorithm.


I. INTRODUCTION
Due to the novel types of services and ongoing advances in unmanned aerial vehicle (UAV) technologies, there is a growing consensus on integrating UAVs into cellular networks. It is expected that UAVs will play a prominent role for traffic offloading, capacity enhancement, and disaster recovery to assist terrestrial networks [1]- [3]. From an economic perspective, deploying small cell base stations (BSs) and/or advanced fifth generation (5G) components, such as massive multiple-input and multiple-output (MIMO), may not be cost-effective for big temporary events. In this regard, deployment of UAVs can be considered as an alternative or complement solution to compensate outage and alleviate overload conditions in cellular networks. UAVs are able to establish line-of-sight (LoS) communication links with high probability for ground users resulting in increased coverage, enhanced reliability, and agility [4].
Mobility of UAVs and their flexibility in adjusting their locations significantly impact the LoS probability and the network performance. Most research efforts have addressed this issue from a non-learning perspective [4]- [7]. In [4], the optimum altitude of a single UAV was obtained in order to minimize the outage probability and maximize the coverage region. In [5], the location of a UAV in one-dimensional and two-dimensional (2D) networks was investigated to maximize the average throughput. In [6], the authors developed an algorithm for the three-dimensional (3D) placement of a UAV to maximize the number of covered users in the network. For the sake of simplicity, the problem was decoupled to the placement of the UAV in the horizontal and vertical dimensions. In [7], the 3D placement and trajectory designs for a UAV to improve rate were investigated.
However, there has been a growing attention recently devoted to the use of learning algorithms for the UAV deployment problem [8]- [11]. In [8] and [9], learning based approaches to find the 2D trajectory of UAVs flying with fixed altitudes were proposed. In [10], a learning based approach for 3D placement of a single UAV is developed to maximize network throughput. However, these works do not address the 3D deployment of multiple UAVs integrated into already existing terrestrial networks for the purpose of network overload relief. In [11], the problem of 3D placement was dealt as a two separate optimization problems in horizontal 2D plane and altitude of UAVs, where a prior knowledge of the users locations is required. However, providing this information for UAVs in real time can be challenging. Moreover, the proposed algorithm is not able to adapt to the environment changes. Finally, to our best knowledge, the channel models in the previous learning based reports do not capture the dependency of path loss exponents to the height of UAVs which might have a significant influence on the performance of UAVs [4], [12].
In this paper, we address the 3D placement of multiple UAVs aiming at assisting the existing terrestrial cellular networks to improve the throughput in overloaded and outage scenarios. We formulate an optimization problem using frameworks from game theory, and leverage reinforcement learning algorithms to develop two novel dis-978-1-7281-4490-0/20/$31.00 c 2020 IEEE tributed approaches which allow UAVs to autonomously adjust and optimize their locations adapted to dynamic environments. In the first approach, the problem is transformed to a multi-armed bandit (MAB) problem. To solve the MAB problem, we use the policy of upper confidence bound (UCB) based on the mean observed rewards. The second approach is based on a novel framework in game theory. In this approach, the problem is modeled as a game among UAVs in satisfaction form which comprises a multi-step process. Furthermore, as opposed to the related works, we take into account the network load effect in the problem, representing the BSs capabilities in serving users. Finally, in order to examine our proposed algorithm, we employ a third generation partnership project (3GPP)-based height dependent channel model. Our findings from the simulation results show that the proposed learning based approaches can significantly improve the performance of the network compared to the benchmark algorithms. The proposed satisfaction based learning approach yields better performance, though at a slower convergence time compared to the proposed MAB based learning algorithm. Therefore, there is an important tradeoff between convergence speed and performance when choosing among our proposed learning algorithms.
The rest of this paper is organized as follows. In Section II, we describe the system model. Section III presents the proposed algorithms for the 3D deployment of UAVs in the network. In Section IV, we evaluate the performance of the proposed approaches. Finally, Section V concludes the paper.
Notations: The regular and boldface symbols refer to scalars and matrices, respectively. For any finite set A, the cardinality of set A is denoted by |A|. The function 1 φ denotes the indicator function which equals 1 if event φ is true and 0, otherwise.

II. SYSTEM MODEL
In this section, we describe the system model, including network topology, channel model, and user association method. Figure 1: An illustration of the system model.

A. Network Topology
We consider the downlink of a cellular network consisting of a set of terrestrial BSs M, and a set of UAVs U as the aerial BSs, in a particular area R ∈ R 2 . The UAVs are deployed to support the cellular network in overload and outage situations. The set of total users and the set of users associated to BS b ∈ B at time t are denoted by K and K b (t), respectively, where B = M∪U indicates the total BSs in the system. We assume that all the BSs transmit over the same channel (i.e. co-channel deployment), and the time horizon is discretized into N equally spaced time instants with duration T s . Let and h b (t) are the location of BS b in the horizontal dimension and its altitude at time t, respectively. Moreover, we assume that all the UAVs fly with a constant speed of V . An illustration of the system model is given in Fig. 1

B. Radio Propagation and Signal Quality
We consider a channel model, in which the link between each user k ∈ K and each BS b ∈ B comprises LoS and non-LoS propagation conditions. Let pr LoS b,k denote the probability of having a LoS link between user k and BS b which is determined as follows [12]: (1) where α, β and ξ represent statistical environmentdependent parameters. Here, h b (t) and h k (t) are the altitude of BS b and the altitude of user k at time t, respectively. The horizontal location of user k, and its horizontal distance to BS b at time t are denoted by (x k (t), y k (t)) and respectively. Consequently, the non-LoS probability can be determined as pr NLoS 2 denote the 3D distance between BS b and user k at time t. The path loss between BS b and user k can be expressed as [12]: where superscript z ∈ {LoS, NLoS} denotes LoS and non-LoS components on the link. Here, A z b and δ z b,k are the reference path loss and the path loss exponent, respectively. For small scale fading, we use Nakagamim fading which is a generalized model to characterize various fading environments. Therefore, the channel fading gains Z z undergoing Nakagami fading is Gamma distributed, i.e. Z z ∼ Gamma(m z , 1 mz ), where z ∈ {LoS, NLoS}. Here, m LoS and m NLoS denote the fading parameters for the LoS and non-LoS links, respectively, Therefore, the probability density function (PDF) of the Gamma-distributed channel fading gain Z z is determined as follows: is the Gamma function. We consider m NLoS = 1 that provides Rayleigh distribution, and m LoS > 1 that can approximate the Ricean fading distribution. Moreover, the case of no fading can be modeled by setting m LoS → ∞.
The signal to interference plus noise ratio (SINR) experienced by user k associated with BS b can be formulated as: where p b is the transmit power of BS b, and g b,k (t) is the channel gain between BS b and user k at time t which is determined as follows [12], [13]: where x b,k is a Bernoulli random variable with parameter pr LoS b,k which characterizes the case of LoS or non-LoS propagation occurrence. Parameter σ 2 represents the additive white Gaussian noise (AWGN) power. Therefore, the achievable data rate provided by BS b to user k using Shannon's capacity formula is given by: where ω is the total bandwidth.

C. Load and User Association Policy
Let υ k denote the traffic influx rate of user k, in which each user can have a different quality of service (QoS) requirement (i.e. heterogeneous users). The fraction of time BS b requires to serve the traffic υ k to the location of user k is defined as υ k C b,k (t) . Therefore, the load of BS b at time t is given by [14]: Since the limited resources are available in the network, the loads of BSs cannot exceed one. Thus, if the load of a BS exceeds one, some of its users experience drops in their rates [15].
To associate the users to the BSs, we consider a user association rule based on the best signal strength. Accordingly, each user k connects to the BS b k (t) which offers the highest received signal strength, as follows:

III. LEARNING BASED PLACEMENT ALGORITHMS
In this section, we aim at optimizing the 3D locations of the UAVs for maximizing the throughput, in a distributed manner. Let A(t) = (a 1 (t), . . . , a |U | (t)) denote the locations of the UAVs at time t. Therefore, the optimization problem is formulated as follows: where parameters h min and h max denote the minimum and maximum altitude of the UAVs, respectively. Here, d u,u (t) and d min represent the 3D distance between UAV u and UAV u at time t, and a certain minimum distance, respectively. For the optimization problem (9), the constraints in (9b)-(9c) define the feasible 3D space for the locations of the UAVs. The constraint in (9d) ensures the safety of UAVs and collision avoidance. The constraint in (9e) corresponds to the definition of load, in which it avoids outages and ensures service for the users [15]. The problem of finding optimum 3D locations of the UAVs is complex mainly due to the mobility of the UAVs and temporal traffic statistics. Therefore, we leverage the tools of machine learning to solve the problem. Hereby, each UAV determines its position based on the proposed learning approaches which are presented in Section III-A and III-B.

A. Multi-Armed Bandit Based UAV Placement Approach
In a MAB problem, each bandit (player) chooses an arm from a set of arms (set of actions). After choosing an arm, the bandit observes the reward associated to the arm [16]. The bandit has no initial knowledge about the rewards of the arms. Therefore, it is allowed to explore the different arms to improve the estimated reward, while exploits the arm that can maximize the immediate reward. Accordingly, the MAB approach requires to effectively balance the tradeoff between exploration and exploitation.
We model our problem as a MAB problem. The set of players, actions, and reward function are defined as follows: • Players: the set of UAVs U is considered as the set of players. • Actions: the set of actions is denoted by S u which is defined as the movement in different directions, i.e. S u = {up, down, left, right, forward, backward, no change}.
• Reward: the reward of UAV u can be expressed as follows [16] and [17]: One of the widely used solution method for MAB problems is upper confidence bound (UCB) algorithm. In this method, each player first selects each arm once. Then, for each t > |S u | it selects the action s u UCB (t) as follows [16]: whereR u,i (t) denotes the mean reward of UAV u for playing action s u,i ∈ S u at time t, and n u,i (t) is the number of times that action s u,i has been selected by UAV u until time t. Parameter c > 0 balances the tradeoff between exploration and exploitation, in which high values of c lead to a high level of exploration. The pseudocode for the UCB based approach is presented in Algorithm 1.

B. Satisfaction Based UAV Placement Approach
The 3D deployment problem of the UAVs can be formulated as a noncooperative game in satisfaction form which comprises a multi-step process. In a satisfactionform game, each UAV is interested in the satisfaction of its constraints. In the first phase, UAVs try to be satisfied with the high satisfaction threshold. After a predefined Algorithm 1 : MAB based learning algorithm for the 3D deployment of UAVs 1: Initialization:R u,i (t) = 0, n u,i (t) = 0 for t = 0, ∀u ∈ U, ∀s u,i ∈ S u and i ∈ {1, . . . , |S u |} 2: while t < N do 3: t ← t + 1 4: for each u ∈ U do 5: if ∃s u,i ∈ S u s.t. n u,i (t) = 0 then Update the location a u (t) based on s u UCB (t) 11: Calculate R u (t) according to (10) 12: for ∀s u,i ∈ S u do 13: Update n u,i (t) as: UpdateR u,i (t) as: end for 17: end while time interval, unsatisfied players can decide to reduce their thresholds, and play the game with the redefined thresholds. A game in satisfaction form can be described by the following triplet [14], [18]: where U and S u are, respectively, the set of players and the set of actions similar to the definitions in Section III-A. The correspondence c u (s −u ) ⊆ S u denotes the set of actions that can satisfy the constraints of UAV u given the actions played by all other UAVs, where s −u is the actions of all UAVs except UAV u. Therefore, the correspondence c u (s −u ) can be defined as follows [18]: where κ u is the satisfaction threshold for UAV u. According to the observed utility, each UAV u ∈ U updates a satisfaction indicator ϑ u (t) at time t as follows: where s u (t) is the action played by UAV u at time t. For a game in satisfaction form, an important outcome is called satisfaction equilibrium where all players are satisfied, and c u (s −u ) is not empty for each player u.
The notion of satisfaction equilibrium can be formulated as a fixed point as follows [18]: Definition 1 (Satisfaction Equilibrium): An action profile s * = (s * u , s * −u ) is a satisfaction equilibrium if ∀u ∈ U, s * u ∈ c u (s * −u ). However, for a given satisfaction threshold, some UAVs may be faced with the situations where they can not be satisfied. Therefore, a satisfaction equilibrium may not always exist. In this context, we can reduce the satisfaction threshold κ u for unsatisfied UAVs after a certain time interval τ . Therefore, we use an adaptive threshold approach. In this approach, after each time interval τ , an unsatisfied UAV u decreases its satisfaction threshold κ u to κ u . (1 − δ), where 0 < δ < 1 denotes a constant coefficient used to decrease the satisfaction threshold.
To solve the game in a distributed manner, we use a learning algorithm, and assume that each UAV u can observe its obtained utility. If the UAV is satisfied with its current utility (i.e. ϑ u (t) = 1), the UAV has no incentive to change its action. Otherwise, it may change its action according to the probability distribution π u (t) = (π u,1 (t), . . . , π u,|Su| (t)), where π u,i (t) is the probability assigned to action s u,i ∈ S u with i ∈ {1, . . . , |S u |}. Therefore, each UAV u updates the probability assigned to each action s u,i ∈ S u as follows [14]: if ϑ u (t) = 1 L u (π u,i (t)) otherwise. (15) Here, where µ u (t) = 1 100.t+1 is the learning rate of UAV u. The parameter q u (t) is computed as q u (t) = , where R max is the maximum utility that UAV u can achieve.
The pseudocode for the proposed satisfaction approach is presented in Algorithm 2.
Proposition 1: The procedure described in Algorithm 2 converges to an equilibrium of the game G in finite time.
Proof. Since the set of players and the set of actions are finite, the convergence of the algorithm in finite time is ensured. Furthermore, due to the fact that for all u ∈ U and for all s u,i ∈ S u , π u,i > 0, thus each action profile will be played at least once, and when an equilibrium is played by the players, the convergence is observed.

IV. SIMULATION RESULTS
To evaluate the performance of our proposed approaches, we consider a hexagonal layout with radius 250 m, and one terrestrial BS located in the center of the area. The set of users are uniformly distributed in the area, and their locations are fixed during one simulation.
Algorithm 2 : Satisfaction based learning algorithm for the 3D deployment of UAVs 1: Initialization: t = 0, π u,i (t) = 1 |Su| , ϑ u (0) = 0, ∀u ∈ U, ∀s u,i ∈ S u and i ∈ {1, . . . , |S u |} 2: while t < N do 3: for each u ∈ U do 5: if ϑ u (t − 1) = 1 then  if mod(t, τ ) = 0 then Update the location a u (t) based on s u (t) 14: Calculate utility R u (t) according to (10) 15: Calculate ϑ u (t) according to (14) 16: Update π u (t) according to (15) 17: end for 18: end while In our simulations, we assume that the number of UAVs is 4, unless we mention it is set to be others values. We conduct multiple simulations for various configurations, and average over the independent runs. The simulation parameters are summarized in Table I. Furthermore, we demonstrate the performance gain of our proposed learning based schemes over the following benchmark references: • Blind placement approach: the 3D locations of the UAVs are chosen randomly. • Q-learning based placement approach: the UAVs update their locations based on the Q-learning approach proposed in [8]. It is assumed that the altitude of the UAVs is 100 m.
In the proposed approaches and Q-learning approach, the initial locations of the UAVs are chosen according to a heuristic approach, in which the horizontal location of a new UAV is determined from a predefined horizontal locations to achieve the furthest distances from the BSs in the system similar to [19]. Fig. 2 depicts the convergence behavior of the MAB and satisfaction based approaches in terms of average throughput during the simulations for a network with 500 users. As can be observed, the satisfaction based learning approach needs more iterations to converge compared to the MAB based learning approach. It shows that for the MAB based approach and the satisfaction based approach, the average number of iterations for convergence reaches up to about 450 and 800 iterations, respectively. Furthermore, the proposed satisfaction approach yields better performance compared to the MAB based algorithm. As a result, there is a tradeoff between convergence speed and performance. Fig. 3 illustrates the average rate per user. We can observe that as the number of users increases, average rate per user decreases due to increasing load and the availability of limited resource in the network. Since the satisfaction based approach optimizes the 3D locations of UAVs in terms of satisfying UAV's throughput, it improves the average rate of users compared to the benchmark algorithms. For instance, for a network with 400 users, the satisfaction based approach yields 26.25%, 49.28%, and 97.29% average user's rate improvement compared to the MAB based algorithm, Q-learning, and blind approaches, respectively.
In Fig. 4, we show the average number of outage users (i.e. the users that their rates are less than υ k ) per BS. The figure shows that the average number of outage users increases with an increase in the number of users. This is mainly due to the fact that, a limited resource is available in the network. Therefore, with increasing the number of users, some of them may experience reductions in their rates due to the overloaded BSs. Since the proposed satisfaction approach improves the average rate compared to the other approaches, it yields better performance in terms of average number of outage users. The performance gain of the satisfaction based approach compared to the MAB, Q-learning and blind approaches is up to 33.1%, 41.2%, and 48.9%, respectively, for a network with 250 users. Accordingly, the satisfaction approach is more resistant to the higher traffic demand. Fig. 5 illustrates the impact of the number of UAVs on the performance of the system for the proposed and benchmark approaches. As can be seen, satisfaction based approach outperforms others, though the advantage over the MAB decreases as the number of UAVs increases. Moreover, one can see that for very low number of UAVs the Q-learning slightly performs better than MAB. This figure, in general, reveals that the number of required UAVs for a target performance decreases using learning methods, and hence the UAVs networking is even more cost-effective. For instance, to reduce the outage users to 100 with learning approach 2 UAVs are needed as compared to 3 UAVs when adopting a blind approach. Such benefit of learning improves with less number of tolerated outage users.
V. CONCLUSION In this paper, we have proposed two novel algorithms for the placement of UAVs integrated into terrestrial cellular networks to compensate outage and overload. The proposed approaches, which are based on the MAB and satisfaction methods, leverage the tools from game theory and machine learning to learn the 3D locations of the UAVs. Our results have shown that the proposed approaches can significantly outperform the benchmark algorithms in terms of both throughput and outage users. Furthermore, our proposed approaches do not require full information of the network (e.g. the locations of the users), and they do not impose significant amount of information exchange in the network. Therefore, they can be executed in a distributed manner.