Mobility and Blockage-aware Communications in Millimeter-Wave Vehicular Networks

Mobility may degrade the performance of next-generation vehicular networks operating at the millimeter-wave spectrum: frequent loss of alignment and blockages require repeated beam training and handover, thus incurring huge overhead. In this paper, an adaptive and joint design of beam training, data transmission and handover is proposed, that exploits the mobility process of mobile users (MUs) and the dynamics of blockages to optimally trade-off throughput and power consumption. At each time slot, the serving base station decides to perform either beam training, data communication, or handover when blockage is detected. The decision problem is cast as a partially observable Markov decision process, and the goal is to maximize the throughput delivered to the MU, subject to an average power constraint. To address the high dimensionality of the problem, an approximate dynamic programming algorithm based on a variant of PERSEUS [1] is developed, where both the primal and dual functions are simultaneously optimized to meet the power constraint. Numerical results show that the PERSEUS-based policy has near-optimal performance, and achieves a 55% gain in spectral efficiency compared to a baseline scheme with periodic beam training. Motivated by the structure of the PERSEUS-based policy, two heuristic policies with lower computational cost are proposed. These are shown to achieve a performance comparable to that of PERSEUS, with just a 10% loss in spectral efficiency.


I. INTRODUCTION
Current sub-6GHz vehicular communication systems cannot support the demand of future applications such as autonomous driving, augmented reality and infotainment, due to limited bandwidth availability [3]. To support this demand, new solutions are being explored that leverage the huge amount of bandwidth in the 30 − 300 GHz band, the so called millimeter-wave (mm-wave) spectrum. While communication at these frequencies is ideal to support high capacity demands, it relies on highly directional transmissions and it is susceptible to blockages and mis-alignment, which are exacerbated in highly mobile environments. In this paper, we show that knowledge of the mobility process of mobile users and of blockage dynamics are extremely important in the design of communication strategies: the faster the environment we operate in, the more frequent the loss of alignment and blockages, and the more resources need to be allocated to maintain beam alignment and perform handover to compensate for blockage. However, two key questions arise: How do we leverage the system dynamics to optimize the communication performance? How much do we gain by doing so? To address these questions, in this paper we envision the use of adaptive communication strategies and their formulation via partially observable (PO) Markov decision processes (MDPs).
In the proposed scenario, two base stations (BSs) on both sides of a road link serve a mobile user (MU) moving along it. At any time, the MU is associated with one of the two BSs Part of this work has been submitted to IEEE ICC 2020 [1]. ‡  (the serving BS). To estimate the position of the MU within the road link and enable directional data transmission (DT), the serving BS performs beam training (BT); to compensate for blockage, it performs handover (HO) of the data traffic to the other BS on the opposite side of the road link. The goal is to design the BT/DT/HO strategy so as to maximize the throughput delivered to the MU, subject to an average power constraint. We formulate the optimization problem as a POMDP, and develop an approximate dynamic programming algorithm based on PERSEUS [2]. To meet the average power constraint requirement, we design a variant of PERSEUS which simultaneously optimizes the primal and dual functions, and demonstrate its convergence numerically. Our numerical evaluations based on a data-driven mobility model demonstrate that the PERSEUS-based policy performs very closely to a genie-aided upper bound in which the position of the MU and the blockage states are known, and outperforms a baseline scheme with periodic beam training by up to 55% in spectral efficiency. Motivated by its structure, we design two heuristic policies with lower computational cost -belief-based and finite-state-machine-based heuristics -and show numerically that they incur a small 10% degradation in spectral efficiency compared to PERSEUS-based. Related Work: Beam training design for mm-wave systems has been an area of extensive research for the past decade, with various approaches proposed, such as beam sweeping [4], estimation of angles of arrival (AoA) and of departure (AoD) [5], and data-assisted schemes [6]. Despite their simplicity, the overhead of these algorithms may ultimately offset the benefits of beamforming in highly mobile environments [3]. While wider beams require less beam training, they result in a lower beamforming gain, hence smaller achievable capacity [7]. While contextual information, such as GPS readings of vehicles [6], may alleviate this overhead, it does not eliminate the need for beam training due to noise and inaccuracies in GPS acquisition. Thus, the design of schemes that alleviate the beam training overhead is of great importance.
In most of the aforementioned works, a priori information on the vehicle's mobility as well as blockage dynamics is not leveraged in the design of BT/DT protocols. In contrast, we contend that leveraging such information via adaptive communications can greatly improve the performance of automotive networks [8]. In our previous work [9], we bridge this gap by designing adaptive strategies for BT/DT that leverage a priori mobility information via POMDPs, but with no consideration of blockage (hence no handover). Compared to [4], which is based on a "worst-case" mobility pattern, in our previous work [9] we assume a statistical mobility model, where the position of the MU evolves following Markovian dynamics, and exploit these dynamics via POMDPs.
A distinctive feature of the mm-wave channel is its highly dynamic behavior, due to the occurrence of blockages on a very short time scale [10]. In this respect, cell selection repre-sents a fundamental functionality to preserve communication in the event of link obstruction: to this end, the quality of the mm-wave link needs to be tightly tracked, and the MU should rapidly switch to another BS in response to the fast-varying link state. Several MDP-based handoff strategies have been investigated over the past decade to solve this problem in the < 5GHz range [11], [12], as MDPs naturally allow to capture dynamics in the link state. However, these techniques cannot be readily applied to the mm-wave frequencies, which exhibit peculiar features such as fast-varying blockage dynamics. In this paper, we develop techniques to quickly detect blockages and restore beam-alignment via handover.
Related work that applies machine learning to mm-wave networks includes [13]- [17], revealing a growing interest in the design of schemes that exploit side information to enhance the overall network performance. For example, a coordinated beamforming technique using a combination of deep learning and ray-tracing is proposed in [13], demonstrating its ability to efficiently adapt to changing environments. More recent solutions are based on multi-armed bandit, by leveraging contextual information to reduce the training overhead as in [14], or the beam alignment feedback to improve the beam search in the next rounds as in [15]- [17]. However, no handover strategies are considered in these works, resulting in limited ability to combat blockage. In addition, these works neglect the impact of realistic mobility and blockage processes on the performance. Compared to this line of works, in this paper, we design adaptive communication strategies that leverage statistical information on the mobility and blockage processes in the selection of BT/DT/HO actions, with the goal to optimize the average long-term communication performance of the system. This approach is in contrast to strategies that either use non-adaptive algorithms [13], lack a mechanism to perform handover [14]- [17], or assume a non realistic mobility pattern in their design.
Our Contributions: 1) We define a POMDP framework to optimize the BT/DT/HO strategy in a mm-wave vehicular network, subject to mobility of the MU and time-varying blockage; based on this POMDP formulation, we formulate an optimization problem with the goal to maximize throughput subject to an average power constraint; 2) We propose a novel feedback mechanism for BT, which reports the ID of the strongest beam if the received power is above a threshold (a design parameter), otherwise it reports ∅ to indicate mis-alignment or blockage. We analyze in closed form the feedback distribution and the probability of incorrect detection; 3) To address the complexity of POMDPs, we use PERSEUS [2], an approximate point-based value iteration (PBVI) algorithm which optimizes the value function on a subset of belief points representative of the belief space. However, differently from PERSEUS (which is unconstrained), we incorporate the average power constraint via a Lagrangian formulation, and incorporate a dual optimization step to solve the constrained problem. We demonstrate its convergence numerically. 4) Inspired by the PERSEUS-based policy, we propose two heuristics with lower computational cost and near-optimal performance, namely belief-based (B-HEU) and finite- state-machine-based (FSM-HEU) heuristic policies, and analyze the performance of FSM-HEU in closed form. 5) We present numerical results for both the idealized sectored antenna model with abstracted mobility model, and a more realistic scenario with analog beamforming and Gauss-Markov mobility. The proposed PERSEUS-based, B-HEU and FSM-HEU are shown to outperform a baseline scheme that performs periodic beam-alignment by a factor 2 in spectral efficiency; additionally, B-HEU and FSM-HEU are shown to achieve near optimal performance (up to 10% degradation with respect to PERSEUS-based), at a fraction of the computational complexity of PERSEUS. Finally, our results depict a good match between the numerical results based on the analysis and the ones based on analog beamforming with Gauss-Markov mobility, thus validating the accuracy of the analysis presented in the paper. The rest of the paper is organized as follows. In Section II, we introduce the system model, followed by the POMDP formulation in Section III and its optimization in Section IV. In Section V, we present two heuristic policies and analysis of FSM-HEU. We present numerical results in Section VI, followed by concluding remarks in Section VII.
II. SYSTEM MODEL We consider the scenario of Fig. 1, where multiple base stations (BSs) serve a mobile user (MU) moving along a road. At any time, the MU is associated with one BSthe serving BS, which performs data transmission (DT) using beamforming to create a directional link, along with beam training (BT) to maintain alignment. The communication links are subject to time-varying blockages, which cause the signal quality to drop abruptly and DT to fail. As soon as the serving BS detects blockage, it may decide to perform handover (HO) to the BS on the other side of the road, which then continues the process of BT/DT/HO, until either another blockage event is detected, or the MU exits the coverage area of the two BSs.
In this work, we focus on a specific segment of the road link covered by a pair of BSs, as depicted in the framed area of Fig. 1. Within this segment, the BT/DT/HO process continues until the MU exits the coverage area of the two BSs. In this context, we investigate the design of the BT/DT/HO strategy during a transmission episode, defined as the time interval between the two instants when the MU enters and exits the coverage area of the two BSs. The goal is to maximize the average throughput delivered to the MU subject to an average power constraint. Note that, when the episode terminates, the MU enters the coverage area of another pair of BSs, and the same analysis may be applied to each segment traversed.
Both BSs are at a distance D from the road segment, symmetrically with respect to the road, and use a discrete set of narrow beams to communicate with the MU. To this end, the road segment, of length L 2D tan(Θ/2) and angular range Θ, is partitioned into N S sectors of equal length 1 ∆ s = L/N S , indexed by s ∈ S ≡ {1, . . . , N S }. Each sector is then associated with one transmission beamformer c (s) , with angular support and beamwidth θ s = |Φ s |. Note that ∪ s∈S Φ s = [−Θ/2, Θ/2] and s∈S θ s = Θ, so that the ensemble of all beams spans the entire angular region covered by the two BSs. Time is discretized into time-slots of duration ∆ t , corresponding to the transmission of a beacon signal during BT or of a data fragment during DT. Next, we describe the MU mobility, signal and channel models used throughout the paper.

A. MU Mobility Model
Let x k ∈[0, L] be the position of the MU at time-slot k and be the associated sector index, where ζ(·) maps the MU position x k ∈ [0, L] to sector index S k ∈ S. x k evolves over time according to a random process. Synthetic mobility models proposed in the literature [18], [19] may not adequately represent the specific context operated by the two BSs, thus necessitating the use of a data-driven model to characterize the system performance. Since the goal of the two BSs is to perform data communication using one out of S directional beams, it is sufficient to determine the dynamics of {S k , k ≥ 0}, via the onestep transition probability P ss ′ =P[S k+1 =s ′ |S k =s], ∀s, s ′ ∈S.
To do so: first, N trajectories {x where χ(·) is the indicator function. S In the following analysis, we assume that P is known, and we define the absorbing states with Pss=1, to model the end of the transmission episode. In Sec. VI, we will present 1 The equal length assumption is made for the sake of notational convenience. However, the analysis is valid for non-uniform length sectors. numerical simulations based on the Gauss-Markov mobility model [19], with P estimated as in (2).

B. Signal and Channel Models
Within the kth time-slot, L symbols are transmitted by the serving BS, denoted with the index I k ∈ {1, 2}. Let x k ∈C L be the transmitted signal with E[ x k 2 2 ]=L. Assuming isotropic MU, its received signal is expressed as where P k is the average transmit power of the serving BS; h k ∈C 1×Mtx is the channel vector; M tx is the number of antenna elements at each BS; c k ∈C Mtx×1 with c k 2 2 = 1 is the analog beamforming vector; w k ∼CN (0, σ 2 w I) with σ 2 w =(1 + F )N 0 W tot is additive white Gaussian noise (AWGN), N 0 is the noise power spectral density, W tot is the signal bandwidth and F is the noise figure of the receiver.
In this paper, we model h k as a single line of sight (LOS) path with binary blockage [20] and diffuse multipath components [21], where B (i) k ∈{0, 1} denotes the binary blockage variable of BS i, equal to 1 if the LOS path is unobstructed, equal to zero otherwise; d tx (ψ k )∈C Mtx is the BS array response vector, and ψ k sin(φ k )=(x k −L/2)/d(φ k ) is the spatial angle corresponding to the angle of departure (AoD, computed with respect to the perpendicular to the array) φ k ∈[−Θ/2, Θ/2]; h k ∼CN (0, σ 2 h ) is the complex channel gain of the LOS component, i.i.d. over slots, with σ 2 h =1/ℓ(φ k ); ℓ(φ k ) = ( 4πd(φ k ) λc ) 2 denotes the path loss as function BS-MU distance d(φ k )=D 1+ tan(φ k ) 2 (see Fig. 1); λ c =c/f c is the wavelength. Finally, h DIF k denotes the channel corresponding to diffuse multipath components with coefficients h DIF k,l and AoD φ DIF k,l ; we model h DIF k as zero-mean complex Gaussian, with i.i.d. entries (over time and over antennas), each with variance These components have been shown to be much weaker than the LOS path (up to 100× weaker at a BS-MU distance of only 10 meters [20]), so that Then, letting G tx (c, φ) = M tx |d tx (sin φ) H c| 2 be the beamforming gain of the serving BS and Θ tx = d tx (ψ) H c be its phase, the signal received at the MU can be expressed as where Ω k h DIF k c k ∼ CN (0, σ 2 DIF ) is the contribution due to the diffuse multipath channel components. The SNR averaged over the fading coefficients is given as The blockage state B (i) k evolves over time as a binary Markov chain with transition probabilities Since the two BSs are on opposite sides of the road, they experience different types of obstructions to the MU. We thus model the processes {B 2} as independent of each other, with Markov dynamics (8).

C. Sectored antenna model
In this paper, we use the sectored antenna model to approximate the BS beamforming gain G tx (c (s) , sin(φ)), as also used in [8], [17]. As we will show in Section VI, when coupled with appropriate design of the beamforming vector c (s) [22], the sectored model provides an accurate and analytically tractable approximation of the actual beamforming gain.
Consider the s-th beam spanning the s-th sector, with angular support Φ s and beamforming vector c (s) . Under the sectored model, the beamforming gain is such Based on this model, we now derive expressions for the transmission power to achieve a target SNR at the receiver. We denote the case in which the MU is in the mainlobe with no blockage (φ∈Φ s and B ) as "no active beam." In the case of active beam, from (7) we have where we note that σ 2 DIF ≪σ 2 h G tx (c (s) , φ) ≈ Υ s , ∀φ ∈ Φ s , i.e., the signal strength of the LOS component is much larger than diffuse multipath components within the mainlobe. Otherwise, in case of no active beam we find that Herein, we use a worst-case approximation of the SNR for detection performance in the case of no active beam, ∈{0, 1}, sector s and sidelobe angle φ. In other words, to achieve a target SNR within the mainlobe of sector s, the BS should transmit with power given by (9); however, if the signal is blocked or the MU is on the sidelobe (or both), the associated worst-case SNR is ρSNR. Note that, in this case, data transmission is in outage since ρ ≪ 1 (numerically, we found ρ = −22dB). In addition, the larger ρ, the more difficult it is for the BS to detect whether the signal is blocked or not, or whether the MU is in the mainlobe or sidelobe; for this reason, ρ is defined by maximizing the SNR over all possible sidelobe and blockage states as in (11).

D. Beam Training (BT) and Data Transmission (DT)
We now introduce the BT and DT operations performed by the serving BS. We let (S k , I k , B (1) k , B (2) k ) be the state of the system in time-slot k, where S k ∈ S denotes the index of the sector occupied by the MU, I k is the index of the serving BS, and B (i) k ∈ {0, 1} is the blockage state of the ith BS. BT phase: At the start of a BT phase, the BS selects a set of sectorsŜ BT over which it will send the beacons x k for BT, and a target SNR SNR BT . The beacon transmission is done in sequence, using one slot for each sector in the setŜ BT . Therefore, the duration of the BT phase is T BT |Ŝ BT |+1, including the last slot for feedback signaling from the MU to the BS. Let i ∈ {0, . . ., T BT − 2} be the ith timeslot of the BT phase, andŝ i ∈Ŝ BT be the sector scanned by the BS. The MU processes the received signal y k+i with a matched filter, Upon collecting the sequence {zŝ, ∀ŝ ∈Ŝ BT }, the MU generates the feedback signal In other words, if all the matched filter outputs are weaker than η BT , Y =∅ indicates that no beam is deemed sufficient for data transmission, either due to blockage (B (I) k =0), or the MU being located outside of the BT beams (S k / ∈Ŝ BT ). Otherwise, Y =ŝ * indicates the ID of the strongest beam detected.
We now perform a probabilistic analysis of the feedback. To this end, we assume that the state variables do not change during the transmission of the beacon sequences, i.e., This is a reasonable assumption, since the duration of the BT phase (×0.1ms) is typically much shorter than the time required by the MU to change sector (×100ms) or the timescales of blockage (×100ms). With this assumption, given the system state (s, I, b 1 , b 2 ) during BT, the signal sequence {zŝ, ∀ŝ∈Ŝ BT } is independent acrossŝ, due to the i.i.d. nature of h k+i and w k+i . In addition, in case of active beam (s=ŝ and b I =1), by using (6) and (9), zŝ has exponential distribution with mean 1+SNR BT L, zŝ∼E(1+SNR BT L); otherwise (no active beam, s =ŝ or b I =0) zŝ∼E(1+ρSNR BT L). Now, let us consider separately the two events {s / ∈Ŝ BT } ∪ {b I = 0} and {s ∈Ŝ BT } ∩ {b I = 1}, denoted respectively as "no active beam inŜ BT " or "active beam in s ∈Ŝ BT ". It follows that If there is no active beam inŜ BT , then the probability of generating the feedback signal Y = ∅ (i.e., of correctly detecting no active beams within the sectors scanned in the BT phase) is since Y = ∅ is equivalent to zŝ ≤ η BT , ∀ŝ ∈Ŝ BT , and zŝ are independent acrossŝ, conditional on the system state. Similarly, if there is an active beam in s ∈Ŝ BT , the probability of incorrectly detecting no active beam is If there is no active beam inŜ BT , the probability of generating the feedback signalŝ * (i.e., of detecting incorrectly that a strong beam is available) is in fact, zŝ are i.i.d. across beams, conditional on no active beam, so that incorrect detections are uniform across the feedback outcomesŝ * ∈Ŝ BT . Instead, when there is an active beam inŜ BT , we need to further distinguish between the two cases s =ŝ * (the strongest beam is detected correctly) and s =ŝ * (incorrect detection). The probability of correctly detecting the strongest beam is where in the first step we used the definition of Y = s, i.e., z s must be greater than the threshold η BT , and all other zŝ must be smaller than z s ; in the last step, we used Newton's binomial theorem to solve the integral. Finally, the probability of incorrectly detecting the strongest beamŝ * = s is since, similarly to (16), erroneous detections are uniform across the remaining |Ŝ BT |−1 sectors. Since Y =∅ represents no active beam detected, we choose η BT so that the misdetection and false alarm probabilities are both equal to δ BT , yielding from (15)-(16) (over allŝ∈Ŝ BT ), The value of η BT and the corresponding δ BT for a given SNR BT and |Ŝ BT | can be found numerically using the bisection method, since the left-and right-hand sides of (19) are decreasing and increasing functions of η BT , respectively. DT phase: At the start of the DT phase, the BS chooses a sectorŝ∈S used for data transmission, along with the duration T DT of the DT frame, the target average SNR at the receiver SNR DT , and a target transmission rateR DT ; the last slot is used for the feedback signal from the MU to the BS, as described below. We assume that a fixed fraction κ ∈ (0, 1) out of L symbols in each slot is used for channel estimation. Then, if an active beam is present inŝ, and assuming that channel estimation errors are negligible compared to the noise level (achieved with a sufficiently long pilot sequence κL), from the signal model (6), we find that outage occurs if In this paper, we designR DT based on the notion of ǫ−outage capacity, i.e.,R DT is the largest rate such that P OUT (R DT , SNR DT ) ≤ ǫ, for a target outage probability ǫ < 1. Forcing (21) equal to ǫ, this can be expressed as In other words, the transmission is successful with probability 1 − ǫ, and the average rate (throughput) is where (1−κ) accounts for the channel estimation overhead. In the sequel, we select ǫ to maximize the throughput, yielding the optimal ǫ * (SNR DT ) at a given SNR SNR DT as the unique fixed point of dT (ǫ, SNR DT )/dǫ = 0, or equivalently, We denote the resulting throughput maximized over ǫ as T * (SNR DT ) T (ǫ * (SNR DT ), SNR DT ).
In this paper, we envision a mechanism in which the pilot signal transmitted in the second last slot of the DT phase is used to generate the binary feedback signal transmitted by the MU to the BS in the last slot of the DT phase. As in (12) for the BT feedback, Y =ŝ denotes active beam detected, whereas Y = ∅ denotes no active beam, due to either loss of alignment or blockage. Similarly to (12), is based on the pilot signal x (p) k+TDT−2 (of duration κL) and on the corresponding signal y t ) = (s, I, b 1 , b 2 ) at time t = k+T DT −2 (second last slot) can be computed as a special case of (15) and (16) with |Ŝ BT |= 1 (since in the DT phase only one sectorŝ is used for data transmission) and κL in place of L (since only a fraction κ out of L symbols is used for the pilot signal), yielding the probability of incorrectly detecting an active beam as and the probability of incorrectly detecting no active beam as As in the BT phase, we choose η DT so that the probabilities of misdetection and false alarm are both equal to δ DT , yielding

III. POMDP FORMULATION
We now formulate the problem of optimizing the BT, DT and HO strategy as a constrained POMDP. In the following, we define the elements of this POMDP. States: the state is denoted as u k (S k , I k , B k ∈{0, 1} is the blockage state of BS i∈{1, 2}. With the absorbing statē s to denote the episode termination, the overall state space is thenŪ=U∪{s}. Actions: the serving BS can perform three actions: beam training (BT), data transmission (DT), or handover (HO). However, differently from standard POMDPs in which each action takes one slot, in this paper we generalize the model to actions taking multiple slots, as explained next.
If action HO is chosen, the data plane is transferred to the other BS, which becomes the serving one for the successive time-slots, until HO is chosen again. This action requires T HO time-slots to be completed, modeling the delay to coordinate the transfer of the data traffic between the two BSs.
If actions BT is chosen, the serving BS chooses the setŜ of sectors to scan and the target SNR SNR BT . The transmission power is then found via (9), and the feedback error probability δ BT is found by solving (19). The duration of the BT action is T BT = |Ŝ|+1: |Ŝ| slots for scanning the set of sectorsŜ, and one slot for the feedback from the MU to the serving BS.
If action DT is chosen, then the serving BS selects the sector s covered during data communication, along with the duration T DT ≥2 of the data communication session, and the target SNR SNR DT . The transmission power is then determined via (9), and the transmission rate is given by (22) to achieve ǫ-outage capacity, so that the resulting expected throughput (in case of LOS and correct alignment) is T * (SNR DT ). The duration of the data communication session T DT includes the second last slot to generate the feedback signal, which is transmitted from the MU to the BS in the last slot. The feedback error probability δ DT is the unique fixed point of (27).
We represent compactly these actions as (c, Π c )∈ A, with action space A, where c∈{BT, DT, HO} refers to the action class and Π c =(Ŝ, SNR, T ) specifies the corresponding parameters:Ŝ ⊆ S is a sub-set of sectors, used during the action, SNR is the target SNR, so that the transmission power of the corresponding action is given by (9), and T is the action duration. Specifically, T = |Ŝ|+1 for a BT action,Ŝ = {ŝ} for a DT action, and Π c = (∅, 0, T HO ) for an HO action.
Observations: upon selecting action A k ∈A of duration T in slot k and executing it in state u k ∈U, the BS observes Y k from the set Y=S∪{∅}∪{s}. Y k =s denotes that the MU exited the coverage area of the two BSs, hence the episode terminates; otherwise, Y k denotes the feedback signal after the action is completed, as described earlier for the BT and DT actions in (13) and (24) (we set Y k =∅ under the HO action). Transition and Observation probabilities: Let P(U k+T = u ′ , Y k = y|U k = u, A k = a) be the probability of moving from state u ∈ U to state u ′ ∈Ū and observing y ∈ Y under action a ∈ A of duration T . To derive it, let u = (s, I, b 1 , b 2 ) be the current state, a = (c, Π c ) be the selected action and y ∈ Y be the observation. It is useful to define the T -steps transition probability of the MU from s to s ′ as S ss ′ (T ) P(S k+T =s ′ |S k =s) = [P T ] ss ′ , ∀s ′ ∈ S ∪ {s}, and the T -steps transition probability of the blockage state of , this is given in closed form as where we have defined the steady state probabilities of B If u ′ =s (episode termination), then the observation signal is deterministically Y k =s, so that we obtain i.e., it is equivalent to the probability of exiting the coverage area of the two BSs in T steps. We now focus on the case u ′ =s, i.e., the MU is still within the coverage area, and let u ′ = (s ′ , I ′ , b ′ 1 , b ′ 2 ) be the next state. Under the HO action a=(HO, ∅, 0, T HO ), of duration T =T HO , then necessarily I ′ = I as a result of the handover operation, and the observation signal is deterministically Y k =∅, yielding i.e., the MU moves from s to s ′ in T steps, and the blockage state of BS i moves from b i to b ′ i in T slots. The expression given above is due to the facts that the mobility of the MU is independent of blockage events, and blockages of the two BSs are independent of each other.
Under the BT action a = (BT,Ŝ, SNR, T ), of duration T = |Ŝ|+1, then necessarily I ′ = I (no handover), and the observation signal is Y k = y ∈Ŝ ∪ {∅} (see the signaling mechanism defined in the BT phase in Sec. II). Therefore, where P(Y =y|Ŝ, S k =s, B (I) k =b I ) has been defined in (14)-(18) for the cases of active beam s∈Ŝ and no active beam.
Finally, under the DT action a=(DT, {ŝ}, SNR, T ), then necessarily I ′ =I (no handover), and the observation signal is Y k =y ∈ {ŝ, ∅} (see the signaling mechanism defined in the DT phase in Sec. II). However, in this case the feedback signal is generated based on the second last slot, i.e., it depends on the state U k+T −2 at time k+T −2. By computing the marginal with respect to S k+T −2 =s ′′ and B The energy cost of an HO action is e(u, a) = 0; that of DT or BT action a=(c,Ŝ, SNR, T ) is expressed from (9) as (note that T = |Ŝ|+1 for a BT action and |Ŝ|= 1 for a DT action) Note that the last slot of the DT or BT phases is reserved to the feedback transmission, with no energy cost for the BS. Policy and Belief updates: Since the agent cannot directly observe the system state u k , we define the belief β ∈ B, i.e., the probability distribution over system states, given the information collected so far at the BS. Given β, the serving BS selects an action a according to a policy a = π(β), part of our design in Sec. IV; then upon executing the action a and receiving the feedback signal y, the BS updates the belief for the next decision interval according to Bayes' rule as with P(u ′ , y|u, a) P(U k+T =u ′ , Y k =y|U k =u, A k =a) given by (31)-(28). We denote the belief update as β ′ = B(y, a, β).

IV. OPTIMIZATION PROBLEM
Our goal is to determine a policy π (i.e., a map from beliefs to actions) that maximizes the expected throughput, under an average power constraintP avg . Using Little's Theorem [23], the average rate and power consumption are, respectively, expressed asT whereR π tot ,Ē π tot are the total expected number of bits transmitted and energy cost during an episode;D tot is the expected episode duration, which only depends on the mobility process but is independent of the policy π. Therefore, we aim to solve whereβ * 0 is the initial belief. We opt for a Lagrangian relaxation to handle the cost constraint, and define L λ (u, a)=r(u, a)−λe(u, a) for λ≥0. For a generic policy π, we define its value function as 2 The goal is to determine the optimal policy π * which maximizes the value function, i.e., The optimal dual variable is then found via the dual problem It is well known that the optimal value function for a given λ uniquely satisfies Bellman's optimality equation [2]  ∀β∈B. The optimal value function V * λ can be arbitrarily well approximated via the value iteration algorithm V n+1 =H λ [V n ], where V 0 (β)=0, ∀β. Moreover, V n is a piece-wise linear and concave function [2], so that, at any stage of the value iteration algorithm, it can be expressed by a finite set of hyperplanes where β · α = u β(u)α(u) denotes inner product. Each hyperplane (α (r) , α (e) ) ∈ Q n is associated with an action a α ∈ A, so that the maximizing hyperplane α * in (43) defines the policy π n (b) = a α * . Note that a distinguishing feature of our approach compared to [2] is that we define distinct hyperplanes α (r) for the reward and α (c) for the cost; as we will see later, this approach allows to more efficiently track changes in the dual variable λ, as part of the dual problem (41), and to compute the expected total reward and cost as where (α (r) * , α (e) * ) = arg max α∈Qn β · (α (r) − λα (e) ). (44) It can be shown (see for instance [24]) that the set of hyperplanes is updated recursively as Q n+1 ≡ (r(·, a), e(·, a))+ û,y P(û, y|·, a)(α (r) y (û), α (e) y (û)): so that the cardinality grows as A n+1 = |Q n+1 |= A |Y| n |A|= |A| |Y| n -doubly exponentially with the number of iterations. For this reason, computing optimal planning solutions for POMDPs is an intractable problem for any reasonably sized task. This calls for approximate solution techniques, e.g., PERSEUS [2], which we introduce next.
The approximate backup operation of PERSEUS is given by Algorithm 1, which takes as input a set of hyperplanes Q n and the corresponding actions, and outputs a new set Q n+1 along with their corresponding actions. To do so: in line 4, a belief point is chosen randomly fromB temp ; in lines 5-7, the hyperplane associated with each action a ∈ A is computed; in particular, line 6 computes the hyperplane associated with the future value function V n (B(y, a, β)), for each possible observation y resulting in the belief update B(y, a, β); line 7 instead performs the backup operation to determine the new hyper-Algorithm 1: function PERSEUS input : λ,B, Q n , {a n α , α ∈ Q n } 1 Init:Ṽ n+1 (β) = −∞, ∀β ∈B;B temp ≡B, Q n+1 = ∅ 2Ṽ n (β) ← max α∈Qnβ · (α (r) − λα (e) ), and maximizer (α  plane of V n+1 (β) associated to action a; line 8 determines the optimal action that maximizes the value function, so that lines 5-8 yield overall the value iteration update V n+1 (β) = max a E U,Y |a,β [r(U, a)−λe(U, a)+V n (B(Y, a, β))]; in lines 9-12, the new hyperplane and the associated action is added to the set Q n+1 , but only if it yields an improvement in the value function V n+1 (β)>Ṽ n (β); otherwise, the previous hyperplane is used; finally, lines 13-14 update the set of unimproved belief based on the newly added hyperplane; only the belief points that have not been improved are part of the next iterations of the algorithm, until all beliefs have been improved (emptyB temp ). Overall, the algorithm guarantees monotonic improvements of the value function inB. Key to the performance of PERSEUS is the design ofB, which should be representative of the beliefs encountered in the system dynamics. In the PBVI literature [24], most of the strategies to designB focus on selecting reachable belief points, rather than covering uniformly the entire belief simplex. We choose the belief points in the following two steps. An initial belief points set B 0 is selected deterministically to cover uniformly the belief space. followed by expansion of B 0 using the Stochastic simulation and exploratory action (SSEA) algorithm [24] to yield the expanded belief points setB. After initializing B 0 , given B n at iteration n, for each β ∈ B n , SSEA performs a one step forward simulation with each action in the action set, thus producing new beliefs {β a , ∀a ∈ A}. At this point, it computes the L1 distance between each new β a and its closest neighbor in B n , and adds the point β a * farthest away from B n , so as to more widely cover the belief space. This expansion is performed multiple times to obtainB.
After returning the set of hyperplanes Q n+1 , the associated actions {a n+1 α , ∀α ∈ Q n+1 }, and the dual variable λ n , the (approximately) optimal action to be selected when operating under the belief β can be computed as along with the approximate expected reward and cost via (44).
In Fig. 2, we plot a time-series of the following variables for a portion of an episode executed under the PERSEUSbased policy (Algorithms 1 and 2): sector index S k , index of the serving BS I k , its blockage state B (I k ) k , the action class c∈{DT, BT, HO}, the BT and DT feedbacks Y BT and Y DT as defined in (13) and (24). The simulation parameters are listed in Table 1. Initially, the MU is known to be in sector S 0 =1, with LOS conditions for both BSs (B otherwise, blockage is detected and the HO action is executed. It should be noted that, although Algorithm 2 returns an approximately optimal design, it incurs a huge computational cost in POMDPs with large state and action spaces (hence large number of representative belief points). To remedy this, in the subsequent section we propose simple heuristic policies, inspired by the behavior of the PERSEUS-based policy described earlier and depicted in Fig. 2. These policies will be shown numerically to achieve near-optimal performance.

V. HEURISTIC POLICIES
In this section, we present two heuristic policies, namely a belief-based heuristic (B-HEU) and a finite-state-machine (FSM) -based heuristic (FSM-HEU) and present closed-form expressions for the performance of FSM-HEU. Similarly to PERSEUS-based, B-HEU needs to track the belief β, whereas FSM-HEU is solely based on the current observation signal that defines transitions in a FSM. For this reason, FSM-HEU has lower complexity than B-HEU, while achieving only a small degradation in performance (see Sec. VI).

A. FSM-based Heuristic policy (FSM-HEU)
The key idea of FSM-HEU is that it selects actions based solely on a FSM, whose states define the action to be selected, and whose transitions are defined by the observation signal, as depicted in Fig. 3 and described next. In FSM-HEU, we consider the following actions: • the HO action A k = (HO, ∅, 0, T HO ) of duration T HO ; • the BT action A k = (BT, S, SNR BT , T BT ) of duration T BT = |S|+1; in other words, the serving BS performs an exhaustive search, with a fixed SNR SNR BT (determined offline), followed by feedback; • the |S| DT actions (DT,ŝ, SNR DT , T DT ), whereŝ ∈ S; in other words, the serving BS performs DT with fixed SNR SNR DT and duration T DT (both determined offline). For notational convenience, we compactly refer to these actions as HO, BT and (DT,ŝ),ŝ ∈ S, respectively. Let A k ∈ {BT, HO} ∪ {(DT,ŝ) :ŝ ∈ S} be the selected action (the state of the FSM at time k), of duration T , and Y k be the observation signal generated by such action, as described in Sec. III; then, the FSM moves to state A k+T = A(A k , Y k ), which defines the next action A k+T to be selected in the next decision round. Note that A defines transitions in the FSM, and the process continues until the episode terminates.
Let us consider the transitions in the FSM, defined by the function A, depicted in Fig. 3. If A k =BT and the observation signal is Y k =(I,ŝ),ŝ∈S, then the BS detects the strongest beamŝ; hence FSM-HEU switches to DT and uses the DT action A k+T =(DT,ŝ)=A(BT,ŝ) in the next decision round, of duration T DT . On the other hand, if the observation signal is Y k =∅, the BS detects blockage and performs HO to the nonserving BS, so that the new action is A k+T =HO=A(BT, ∅).
If A k =(DT,ŝ), i.e., the DT action is executed on sectorŝ, of duration T DT , and the signal Y k =ŝ is observed, then the BS infers that the signal is still sufficiently strong to continue DT on the same sector, and the same action A k+T =(DT,ŝ)=A((DT,ŝ),ŝ) is selected again. Otherwise (Y k =∅), the BS detects a loss of alignment, hence the BT action A k+T =BT=A((DT,ŝ), ∅) is executed next.
Finally, if A k =HO (the HO action is chosen, with observation signal Y k =∅), then the new serving BS executes the BT action A k+T =BT=A(HO, ∅) next. This procedure continues until the episode terminates.
The performance of FSM-HEU can be computed in closed form by noticing that G k =(U k , A k ), i.e., the underlying system state U k and action A k , form a Markov chain, taking values from the state space To see this, note that the observation Y k and next state U k+T (where T is the duration of the selected action A k ) have joint distribution given by (32)-(28), which solely depends on G k ; then, in view of the FSM of Fig. 3, A k+T = A(A k , Y k ) is a deterministic function of A k and Y k . The state transition probability is then obtained by computing the marginal with respect to the observation signal Y k , yielding We remind that the right hand side of (46) is given by (32)-(28). LetR FSM tot (g) andĒ FSM tot (g) be the total expected number of bits delivered and energy cost under FSM-HEU, starting from state g. Then, with P(g ′ |g) defined in (46) and g = (u, a), where r(·) and e(·) are given by (34)-(35). We can solve these equations in closed form, yieldinḡ

B. Belief-based Heuristic policy (B-HEU)
Differently from FSM-HEU, this policy exploits the belief β k in the decision-making process. However, B-HEU selects actions in a heuristic fashion as described next, as opposed to PERSEUS-based (Algorithm 1), which selects actions (approximately) optimally. To describe this policy, let β be the current belief and I be the index of the serving BS. Let Ξ(s) (b1,b2):bI =1 β(s, I, b 1 , b 2 ) be the marginal probability of the MU occupying sector s with no blockage under the serving BS. Then, Λ s Ξ(s) can be interpreted as the probability of no blockage under the serving BS. Given these quantities, B-HEU operates as follows, with thresholds η 1 , η 2 and η 3 determined offline: if Λ < η 1 , then blockage is detected, hence the HO action is selected; otherwise (Λ ≥ η 1 ), letŝ * = arg max Ξ(s) be the sector most likely occupied by the MU: if Ξ(ŝ * ) ≥ η 2 , i.e., the BS is confident that the MU is in sectorŝ * and there is no blockage, then the BS performs DT over sectorŝ * , with SNR SNR DT and duration T DT determined offline. Otherwise (Λ ≥ η 1 and Ξ(ŝ * ) ≥ η 2 ), the BS is uncertain on the location of the BS, hence it performs BT over the most likely sectors in the setŜ * defined aŝ thus neglecting the least likely sectors whose aggregate probability is less than 1 − η 3 . After selecting the appropriate action based on the belief, the serving BS collects the observation Y k and updates its belief as in (36). Note that, unlike FSM-HEU which performs an exhaustive search during the BT phase, B-HEU exploits the current belief β to perform BT only on the most likely sectors, and therefore incurs less BT overhead.

VI. NUMERICAL RESULTS
In this section, we perform a numerical evaluation of the proposed policies. We also compare their performance with a baseline policy. The baseline policy is the same as FSM-HEU except for one key difference: after executing the DT action, it executes the BT action irrespective of the binary feedback. In other words, A((DT,ŝ), Y )=BT, ∀Y . Note that, if no blockage is detected, baseline mimics the periodic exhaustive search. Its performance can be analyzed in closed form in a similar fashion as for FSM-HEU (see its FSM representation in Fig. 3). The simulation parameters are listed in Table 1.
Using the throughput and power metrics defined in (37), the average spectral efficiency (bps/Hz) and power (dBm) under policy π are expressed asT π /W tot andP π . We choose the initial beliefβ * 0 (u)=χ(u=u 0 ), where u 0 =(1, 1, 1, 1), implying that the MU starts in sector s=1 with LOS conditions for both BS. We use the Gauss-Markov mobility model, with speed v k and position x k of the MU given by where, unless otherwise stated, µ v = 30m/s is the average speed; σ v = 10m/s is the standard deviation of speed; γ = 0.2 is the memory parameter;ṽ k−1 ∼ N (0, 1), i.i.d. over slots.
In the simulations, we show the results corresponding to the analytical model presented in the paper (with P estimated from simulations of 10000 trajectories under the Gauss-Markov model (49), as described in Sec. II-A) as well as the results obtained through Monte-Carlo simulation using the beamformer design [22] and the Gauss-Markov mobility model. In Fig. 4, we depict the convergence of the PBVI Algorithm 2, which optimizes both the policy π and the dual variable λ to meet the power constraintP π ≤P avg . It can be observed that the expected spectral efficiencyR n /D tot /W tot , average powerĒ n /D tot and Lagrangian function V n (β 0 ) + λ n E max converge, andĒ n /D tot converges to the desired average power constraintP avg .
In Fig. 5, we depict the average spectral efficiency against the average power consumption. For the heuristic policies, we set T DT =10 and P BT =P DT is varied from 0dBm to 40dBm. The upper-bound shown in the figure is obtained by a genieaided policy that always executes DT with perfect knowledge of the state (s, I, b 1 , b 2 ) and hence its throughput performance can be upper bounded by (1−π B is the steady-state probability that both BSs are under blockage, resulting in outage. It should be noted that this upper-bound is loose since it is found by assuming the perfect knowledge of state and ignoring the performance degradation due to transitions in s during a DT action and time required to perform handover and transmit feedback. The PERSEUSbased policy π * yields the best performance with negligible performance gap with respect to the upper-bound. It shows a performance gain of up to 10%, 11% and 55% compared to B-HEU, FSM-HEU and baseline, respectively. It is also observed that B-HEU and FSM-HEU yield similar performance. On the other hand, the baseline scheme yields up to 50% degraded performance compared to the proposed adaptive schemes: in fact, it neglects the DT feedback and instead performs periodic BT, thus incurring significant overhead. We observe that the curves corresponding to analysis and the markers, representing simulation points obtained considering analog beam design and Gauss Markov mobility, closely match, thereby demonstrating the accuracy of the model introduced in the paper to model more realistic settings.
In Fig. 6, the spectral efficiency is depicted against the DT time duration T DT used in B-HEU, FSM-HEU and baseline schemes. As observed previously, the PERSEUS-based policy outperforms B-HEU and FSM-HEU, and all of them outperform the baseline scheme. B-HEU and FSM-HEU show similar performance, and achieve near-optimal performance with an optimized value of T DT ≃ 60[slots]. Most remarkably, this near-optimal performance is achieved at a fraction of the complexity of PERSEUS-based. It is observed that the spectral efficiency initially improves by increasing the T DT due to reduced overhead of BT and feedback time. However, after achieving a maximum value at T DT ≃ 60[slots], the spectral efficiency decreases as T DT is further increased. This is attributed to the fact that during very large data transmission periods, loss of alignment and blockages are more likely to occur before the serving BS is able to react to these events. It is also observed that the baseline scheme achieves peak performance at a much higher value of T DT ≃ 180[slots]. In fact, since baseline performs periodic BT, there is a stronger incentive to reduce the overhead by extending the duration of DT, as opposed to B-HEU and FSM-HEU which adapt the duration of DT based on the DT feedback signal.
In Fig. 7, we demonstrate the impact of the mobility process on the performance, by plotting the total expected number of bits delivered successfully under FSM-HEU and baseline versus the average speed µ v , for different values of σ v . In the figure, it can be observed that, as the average speed increases, the performance degrades due to an increase in frequency of  mis-alignments as well as due to a shorter episode duration. On the other hand, the performance loss due to variations in the speed is only significant at very high values of σ v , showing that the proposed heuristic policies are robust to variations in speed. As previously observed, the FSM-HEU outperforms baseline in all range of values considered.

VII. CONCLUSIONS
In this paper, we investigated the design of beamtraining/data-transmission/handover strategies for mm-wave vehicular networks. The mobility and blockage dynamics have been leveraged to obtain the approximately optimal policy via a POMDP formulation and its solution via a point-based value iteration (PBVI) algorithm based on a variation of PERSEUS [2]. Our numerical results demonstrate superior performance of the PERSEUS-based policy compared to a baseline scheme with periodic beam training (up to 2× improvement in spectral efficiency). Inspired by the behavior of the PERSEUS-based policy, we proposed two heuristic policies, which provide low computational alternatives to PBVI and exhibit near-optimal performance (within ∼ 10%).