Learning to Broadcast with Layered Division Multiplexing

—A broadcast/multicast communication system is studied in which layered division multiplexing (LDM) is applied to support differential quality-of-service (QoS) levels. Focusing on a practical scenario in which the transmitter does not know the fading distribution, layer allocation is optimized based on a dataset sampled during deployment. The optimality gap caused by the availability of limited data is bounded via a generalization analysis, and is shown to be monotonically decreasing as the dataset grows larger. Numerical experiments demonstrate that LDM improves spectral efﬁciency even for small datasets; and that, for sufﬁciently large datasets, the proposed mirror-descent-based layer optimization scheme achieves an expected rate close to that achieved when the transmitter knows the fading distribution.


I. INTRODUCTION
Layered division multiplexing (LDM) has been introduced in several standards as an effective means to support differential quality-of-service (QoS) in broadcast and multicast services. With LDM, multiple independent sub-messages, or layers, are superimposed, enabling the decoding of a different number of messages depending on the channel conditions, thus supporting communication at a variable rate [1]- [4]. The most common use of LDM is for multimedia broadcast, as adopted by the Advanced Television Systems Committee (ATSC 3.0) [4], [5], in which LDM supports a robust configuration for mobile receivers and a high-capacity connection for fixed receivers. Other applications include Machine-Type Communication (MTC) and Industry 4.0, in which LDM is considered as a tool to deliver critical control services and best-effort monitoring services [6]- [8]. Maximizing the expected achievable rate, or average rate across all receivers, requires adjusting the layers' rates and power levels as a function of the channel distribution [9]. However, in practice, this distribution is unknown. Accordingly, in this paper, we assume the transmitter has access to a dataset sampled during deployment, from which the rate and power allocation for each layer are optimized. We explore theoretic and algorithmic aspects of this design problem.
Related Work: LDM, also known as the broadcast approach, has been extensively studied as means to improve spectral efficiency in various scenarios. A comprehensive survey of the state-of-the-art is available in [1], and we mention here some representative examples. The broadcast approach for slowly fading single-user channels was investigated in [9], where it This work has been supported by the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Programme (Grant Agreement Nos. 694630 and 725731). was shown that transmitting multiple layers can increase the expected achievable rate, and the optimal power allocation density was derived for an infinite number of layers. The gain of the broadcast approach was also demonstrated in [10] for finite number of layers. Specifically, for quasi-static Rayleigh fading channel, two layers were shown to achieve most of the throughput gain. Importantly, unlike our work, both references [9] and [10] assume that the transmitter knows the fading distribution. In [2], for broadcasting fixed and mobile services, LDM with two layers was shown to outperform time division multiplexing (TDM) and frequency division multiplexing (FDM) in terms of the mobile service's capacitycoverage trade-off. Multicast beamforming was studied in [11] with the goal of minimizing the outage probability for unknown fading distribution, and several gradient-based algorithms were proposed to optimize beamforming based on a dataset of channel samples. Similarly, an alternating gradient descent algorithm was recently proposed in [12] for the joint optimization of the precoding weights and the reconfigurable intelligent surface (RIS) reflection pattern in RIS-aided communication system.
Main Contributions: In this paper, we study the LDMbased broadcasting/multicasting system illustrated in Fig. 1, in which a single-antenna base station (BS) serves singleantenna clients. The channel coefficients and the the fading distribution are assumed to be unknown to the BS. In order to maximize the expected achievable rate, the BS optimizes layer allocation based on a dataset sampled during deployment. At a theoretical level, we bound the optimality gap caused by the availability of limited data via a generalization analysis [13], and characterize the number of samples required to maintain a desired optimality gap. At an algorithmic level, we introduce a mirror-descent based scheme [14] to maximize an empirical estimate of the expected rate. Numerical results demonstrate that broadcasting multiple layers improves spectral efficiency even for small datasets, and that, for sufficiently large datasets, the expected rate is close to that achieved when the BS knows the fading distribution, confirming the sample complexity analysis.
Notation: Random variables and vectors are denoted by lowercase and boldface lowercase Roman-font letters, respectively. Realizations of random variables and vectors are denoted by lowercase and boldface lowercase italic-font letters, respectively. For example, x is a realization of random variable x and x is a realization of random vector x. For any BS Ơ ƞ positive integer K, we define the set [K] {1, 2, . . . , K }. The cardinality and convex hull of a set L are denoted by |L| and conv(L), respectively. The 1 -norm and 2 -norm of a vector s are denoted by s 1 and s 2 , respectively. For two scalars a and b, the indicator of the event a ≥ b is denoted by 1 a ≥b . That is, 1 a ≥b equals one if a ≥ b and zero otherwise. The set of non-negative real numbers is denoted by + . diag(u) represents a diagonal matrix with diagonal given by the vector u.

II. SYSTEM MODEL AND PROBLEM DEFINITION
We consider the system depicted in Fig. 1 in which a singleantenna BS broadcasts a common message to single-antenna clients over a fading broadcast channel. The fading coefficient for each client is drawn from a common fading distribution p h (h), and is assumed to remain constant for the duration of a coding block consisting of n symbols. The common fading distribution p h (h) may take the form of a mixture model, as in [15], in order to account for heterogeneous long-term effects such as path loss and shadowing.
The signal received by a client at time t ∈ [n], denoted by y(t), can be expressed as where P > 0 denotes the BS transmission power; x(t) ∈ ¼ denotes the signal transmitted at time t, which is subject to the average power constraint denotes the quasi-static fading coefficient; and z(t) ∼ CN(0, 1) denotes the additive white Gaussian noise (AWGN). We assume that the BS does not know the fading realizations nor the common fading distribution p h (h), while each client knows its own channel h. Due to the lack of channel state information (CSI), the BS applies layered division multiplexing (LDM) [9] with M layers, or sub-messages, in order to enable differential quality of service at the clients. The transmitted signal x(t) in (1) is accordingly given as where , denotes a symbol from a Gaussian random codebook with average power λ m that is used to encode sub-message w m ∈ [2 nρ m ] of rate ρ m ≥ 0. To satisfy the normalized power constraint in (2), the power-allocation vector λ λ λ (λ 1 , . . . , λ M ) must thus lie in the simplex We refer to message w m and corresponding encoded signal x m (t) as the mth layer. Each client decodes sub-messages by applying successive cancellation decoding (SCD) with the order w 1 , . . . , w M . When decoding layer m ∈ [M], all subsequent layers are treated as AWGN. Each client can hence decode only a subset of layers depending on its channel gain g |h| 2 . We denote by I m M i=m+1 λ i the normalized power level of the inter-layer interference affecting the decoding of layer m, and as p g (g) the distribution of the channel gain g.
We parametrize the rate ρ m of layer m as [9] ρ m (s m ,λ λ λ) where s (s 1 , . . . , s M ) ∈ M + is a non-negative vector set by the BS, and vector s m (s 1 , . . . , s m ) ∈ m + consists of the first m elements of s. Assuming that all previous layers are correctly decoded, the rate achievable for layer m by a client with channel gain g is log 2 (1 + gλ m P/(1 + gI m P)). Therefore, the client can decode all layers up to layer m if and only if its channel gain satisfies the inequality g ≥ s m 1 . Accordingly, given the power and rate allocation vectors λ λ λ and s, the total rate that can be decoded by a client with channel gain g is given as We study the maximization of the expected achievable ratē where the expectation is over the fading distribution p g (g), with respect to the power and rate allocation vectors λ λ λ and s. That is, we consider the optimization problem

III. EMPIRICAL AVERAGE RATE MAXIMIZATION
In this paper, we assume that the BS does not know the fading distribution p g (g), and hence it cannot directly optimize the expected achievable rateR(s,λ λ λ). Instead, we assume that the BS has access to a dataset consisting of N fading realizations sampled in an independent and identically distributed (i.i.d.) manner from distribution p g (g). Based on dataset G, which is collected offline, e.g., during deployment, the BS approximates the expected achievable rate with the empirical averagē The maximization of the average rate (10) over power and rate allocation vectors λ λ λ and s can be expressed as the optimization problem A solution to problem (11) can be practically obtained via an iterative optimization scheme as detailed in Section IV. We emphasize that optimizing the average rateR G (s,λ λ λ) via problem (11) is useful not only when the fading distribution p g (g) is unknown, but also when the direct optimizations in (8) based on knowledge of the distribution p g (g) is not tractable. In this latter case, one can potentially generate the dataset G with an arbitrary number of fading realizations N.

A. Optimality Gap and Sample Complexity
An important theoretical question is whether the expected achievable rate obtained under the power and rate allocation vectors (11) approaches the ground-truth maximum expected achievable rate obtained with vectors (8) as the size of the dataset increases. If so, it would also be interesting to quantify how many samples N are required to achieve a desired level of approximation. This is the subject of this subsection.
To proceed, we define the optimality gap as the difference between the expected rate achieved with optimal power and rate allocation vectors (8) and the expected rate achieved by the empirical rate maximization (11). The optimality gap is random due to the stochastic nature of the dataset G.
To bound the optimality gap, we assume that the norms of the optimal vectors s * and s G in (8) and (11), respectively, can be bounded as max{ s * 1 , s G 1 } ≤ S for some known constant S > 0. Note that this assumption is not restrictive since, in practice, S represents the largest fading gain g that a client is expected to experience. The following proposition bounds the optimality gap under this assumption.
Proposition 1: Let G = {g 1 , . . . , g N } be a dataset of N fading realizations drawn independently from the fading distribution p g (g), and let δ ∈ (0, 1]. With probability at least 1 − δ, the optimality gap (12) is bounded, for rate allocation vectors with bounded norms max{ s * 1 , s G 1 } ≤ S, as Proof: See Appendix A. This result shows that the optimality gap scales with number of data points, N, as O( ln(N)/N), implying that any level of accuracy can be attained as the dataset grows larger, i.e., as N → ∞. Furthermore, for a given desired optimality gap e G ≤ , the required number of data points N, i.e., the sample complexity, satisfies the approximate inequality N ln(N) for large N. Intuitively, the sample complexity increases with the signal-to-noise ratio (SNR) metric SP since, as the achievable rate increases, a better approximation is required to achieve the same subtractive optimality gap.

IV. MIRROR GRADIENT DESCENT
In this section, we introduce a gradient-based iterative optimization procedure to tackle the empirical average rate maximization problem (11). The approach is based on the introduction of a surrogate smooth objective and on mirror descent, as described in the rest of this section and summarized in Algorithm 1.

A. Smooth Surrogate Objective
A first challenge in developing iterative solutions to problem (11), is that the partial derivative of the indicator in the achievable rate expression (6) with respect to vector s equals zero almost everywhere. Therefore, in order to facilitate the application of a gradient-based optimization procedure, we replace the rate R(s,λ λ λ, g) in (6) with the smooth surrogate objective where σ(x) 1/(1+exp(−x)) is the sigmoid function, and the parameter c > 0 determines the trade-off between smoothness and accuracy of the surrogate approximation. As c → ∞, the surrogate (15) tends uniformly to the original rate (6), while smaller values of c yield non-zero partial derivatives with respect to s. Using the approximation (15), we define the surrogate empirical average rate maximization problem as whereR G (s,λ λ λ) denotes the surrogate average ratẽ

B. Mirror Descent
Although the objective in (16) is smooth, plain-vanilla gradient descent cannot be applied to address the optimization (16) due to the domain constraints on the optimization variables (s,λ λ λ) ∈ M + × ∆ M c . To tackle the constraint s ∈ M + , we parametrize the rate-allocation vector s with a vector u ∈ M as

IEEE International Symposium on Information Theory (ISIT)
Furthermore, to satisfy the constraint λ λ λ ∈ ∆ M c , we consider a mirror-descent based scheme which adapts the updates to the geometry of the simplex ∆ M c via the exponentiated gradient [16]. Overall, this leads to the updates GD(u; G,λ λ λ) (19) and The resulting procedure to optimize the empirical average rate is summarized in Algorithm 1.

V. NUMERICAL RESULTS
In this section, we evaluate the expected rateR s G ,λ λ λ G for parameters (s G ,λ λ λ G ) obtained via Algorithm 1 with learning rates η = γ = 0.01 and sigmoid smoothness parameter c = 10.
The expected rate is averaged over 1000 datasets G, which we denote as ¾ G R s G ,λ λ λ G .
In Fig. 2, we plot the expected achievable rate as a function of the number of layers M with power P = 20dB, Rayleigh fading distribution, and dataset of size N = 10, 100, and 1000. For this special case, the ideal optimal solution obtained by using infinite layers and assuming that the fading distribution is known was derived in [9], and is used as an upper bound. Furthermore, we plot for reference the expected rate achieved with finite number of layers when the BS knows the fading distribution, which is obtained by replacing the surrogate empirical average rateR G (s,λ λ λ) with the expected rateR(s,λ λ λ) in the gradient-based updates (19)- (20). First, confirming the sample complexity analysis in Section III-A, for sufficiently large datasets, the expected rate is close to that achieved when the BS knows the fading distribution. Furthermore, using multiple layers provides notable gain over a single layer, even for small datasets. Finally, the expected rate achieved with M = 6 layers and sufficiently large dataset is seen to be close to the upper bound.
In Fig. 3, we plot the ratio of the expected rate achieved via LDM with M layers to the expected rate achieved with a single layer as a function of the power P with Rayleigh fading distribution and dataset of size 1000. It is observed that the gain of LDM increases with power P. Intuitively, this is because, for sufficiently high power, splitting the last layer, while keeping the same norm s 1 , has a negligible impact on the rate ρ M (s,λ λ λ) but adds another layer that is much more likely to be decoded (see eqs. (5)-(7)).

VI. CONCLUSION
In this work, we have studied LDM as an enabler of differential QoS for broadcast/multicast communication systems. We have focused on a practical model in which the fading distribution is unknown, and the transmitter optimizes rate and power allocation for each layer based on a dataset sampled during deployment. The optimality gap caused by the availability of limited data was bounded via a generalization analysis ans was shown to monotonically decrease as the dataset grows larger. To optimize the rate and power allocation parameters, a mirror-descent based scheme was introduced, which, for sufficiently large datasets, was demonstrated via numerical experiments to achieve an expected rate close to that achieved when the BS knows the fading distribution. Among related problems left open by this study, we mention the extension to multiple transmit antennas [11] and to channels with multiple uncoordinated transmitters [17], [18]. An extended version of this work, which introduces the conditional value-at-risk (CVaR) rate performance measure for ultra-reliable communication, and considers meta-learning as a means to reduce sample complexity by leveraging data from previous deployments, is available in [19].
Proposition 2: Let G = {g 1 , . . . , g N } be a dataset of N fading realizations drawn independently from the fading distribution p g (g), and let δ ∈ (0, 1]. With probability at least 1 − δ, uniformly over all s ∈ + , we have Proof: See Appendix B. Proposition 2 implies that, with probability at least 1 − δ, we can bound the difference |R(s,λ λ λ) −R G (s,λ λ λ)|, uniformly over all λ λ λ ∈ ∆ M c and s ∈ M + with s 1 ≤ S, as where (a) follows from (22) and (24); (b) follows from from the triangle inequality and since the rate of each layer is nonnegative; (c) follows from Proposition 2; and (d) holds since S ≥ s 1 . Finally, based on inequalities (21) and (27), we can upper bound the optimality gap as (13).
The true and empirical CCDF can hence be expressed as Next, we bound the expected Rademacher complexity ¾ [Rad(L(g 1 , . . . , g N ))] in (33). We assume, without loss of generality (w.l.o.g.), that the channel realizations g 1 , . . . , g N ∈ G are ordered such that g i ≥ g j for all j ∈ [i]. Note that, if (s, g j ) = 1 for some s ∈ then (s, g i ) = 1 for all j ≤ i ≤ N. Therefore, we have |L(g 1 , . . . , g N )| = N + 1.
2022 IEEE International Symposium on Information Theory (ISIT)