On Linearly Precoded Rate Splitting for Gaussian MIMO Broadcast Channels

In this paper, we consider a general $K$ -user Gaussian multiple-input multiple-output (MIMO) broadcast channel (BC). We assume that the channel state is deterministic and known to all the nodes. While the private-message capacity region is well known to be achievable with dirty paper coding (DPC), we are interested in the simpler linearly precoded transmission schemes. In particular, we focus on linear precoding schemes combined with rate-splitting (RS). First, we derive an achievable rate region with minimum mean square error (MMSE) precoding at the transmitter and joint decoding of the sub-messages at the receivers. Then, we study the achievable sum rate of this scheme and obtain two findings: 1) an analytically tractable upper bound on the sum rate that is shown numerically to be a close approximation, and 2) how to reduce the number of active streams – crucial to the overall complexity – while preserving the sum rate to within a constant loss. The latter results in two practical algorithms: a stream elimination algorithm and a stream ordering algorithm. Finally, we investigate the constant-gap optimality of linearly precoded RS with respect to the capacity. Our result reveals that, while the achievable rate of linear precoding alone can be arbitrarily far from the capacity, the introduction of RS can help achieve the capacity region to within a constant gap in the two-user case. Nevertheless, we prove that the RS scheme’s constant-gap optimality does not extend to the three-user case. Specifically, we show, through a pathological example, that the gap between the sum rate and the sum capacity can be unbounded.


I. INTRODUCTION
T HE capacity region of a multi-antenna (MIMO) broadcast channel (BC) with additive Gaussian noise has been characterized for more than a decade [1], [2]. The capacity achieving scheme is essentially the dirty paper coding (DPC) [3] combined with the minimum mean square Manuscript  error (MMSE) precoding. This BC capacity region can be conveniently represented with the capacity region of the dual multiple-access channel (MAC) via the so-called MAC-BC duality (also known as the uplink-downlink duality) [4], [5].
The main role of the DPC can be regarded as interference mitigation at the transmitter side, i.e., part of the interference is pre-cancelled for a given receiver at the transmitter side. The implementation of DPC is however not trivial, due to its non-linear nature and the fact that it is sensitive to the channel state information at the transmitter side (CSIT) [6]. As such, linear precoding is used in most practical systems instead. Apart from the low implementation complexity, it can be shown that linear precoding schemes such as zero-forcing (ZF) achieve the maximum degrees of freedom (DoF) of the system [7], [8]. Intuitively, ZF is sufficient for the transmitter to exploit all the available dimensions of the signal space in a BC, leading consequently to the DoF optimality. Despite its simplicity, the dimension-counting DoF metric is coarse since it only characterizes the pre-log factor of the achievable rate when the channel gains are bounded while the signal-to-noise ratio (SNR) goes to infinity. As a result, it fails to capture the disparity of the channel strengths among users, and thus in some cases provides little information on the system behavior for different channel realizations. To see how ZF can be useless and the DoF metric can be meaningless for some channel realizations, let us consider the following toy example with two users. Let the channel vectors from the transmitter to the receivers be [ √ 1 − 2 ] and [ √ 1 − 2 −], respectively, that are linearly independent for any ∈ (0, 1). With ZF, the beam directions for the receivers would be [ √ 1 − 2 ] and [− √ 1 − 2 ], respectively, both perpendicular to the other receiver's channel to avoid interference. Provided that each stream has power P/2 and the noise power at each receiver is 1, the achievable rate for each user is log(1 + 2 2 (1 − 2 )P ). Note that the DoF analysis would completely erase the impact of any non-zero and give 1 DoF for each user, while the actual rate can be arbitrarily close to 0 when is close to 0 or to 1. In fact, serving only one user would provide a rate log(1 + P ), much larger than the sum rate of ZF in those extreme cases. Indeed, the two receivers' signal spaces can have a non-negligible overlap so that nullifying interference at the transmitter (e.g., ZF) could be highly suboptimal.
To account for the relative strength of the channel coefficients, one can let the channel gains of different links grow with the SNR polynomially with different exponents, and the resulting pre-log of the achievable rate is called the 0018-9448 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
generalized DoF (GDoF). An even finer characterization is the constant-gap rate, a rate that is within a constant gap to the exact achievable rate for any channel realization. Indeed, since the constant-gap rate and the exact rate have the same pre-log when the SNR goes to infinity, the GDoF and DoF can also be derived from the constant-gap rate. Therefore, we have the following progressive improvements on the rate approximation [9]: DoF GDoF constant-gap.
To circumvent the limitation of interference, we can introduce rate splitting (RS) so that interference is decodable. The idea of using RS to partially mitigate interference was first proposed for the two-user interference channels [10], [11], in which independent messages are sent by independent transmitters to their respective receivers. Essentially, each individual message is split into one private part and one common part, where the common part is decodable by (though not intended to) both receivers. Each receiver decodes and thus can remove the common message from the interfering transmitter. It turned out that such a scheme achieves the capacity region of the two-user interference channel to within 1 bit/s/Hz [12]. The same idea can also be applied to the BCs. In [13], the authors showed that RS can provide a strict sum DoF gain of a BC when only imperfect CSIT is available. Extensions to different settings have been made in later works [14]- [16]. Besides, RS has also been considered for robust transmissions under bounded CSIT errors in [17]. In contrast to the DPC that pre-cancels interference at the transmitter side, RS enables the interference mitigation at the receivers' side by letting the interference decodable by the receivers.
In this work, we are interested in the constant-gap rate of linearly precoded RS schemes for Gaussian MIMO BC. In particular, we consider a general RS scheme with MMSE precoding and joint decoding of the sub-messages at the receivers, and characterize the corresponding achievable rate region. The main contributions of our work are summarized as follows.
• A major challenge in investigating the general K-user case is the large number of rate constraints, up to K2 2 K−1 in general. Characterizing the maximum sum rate, even up to a constant gap, is hard when K is only moderately large. For instance, there are about 3 × 10 5 and 2 × 10 10 constraints when K = 5 and K = 6, respectively. Our contribution here is to analyze the achievable constant-gap sum rate and obtain two meaningful results. First, while the set of rate constraints is large, we can carefully choose a subset that leads to K closed-form upper bounds. Remarkably, the proposed upper bound, as the minimum value of the aforementioned K upper bounds, turns out to be a numerically close approximation. Second, we show how to reduce the number of active streams, which is crucial to the overall complexity, while preserving the constant-gap sum rate. Specifically, we propose two practical algorithms: 1) a stream elimination algorithm that removes streams without causing a rate loss more than a given target value, and 2) a stream ordering algorithm that orders the entire set of 2 K − 1 streams according to their impact to the sum rate. • A central theoretical question on any communication  scheme is whether it achieves the channel capacity or, if not, how far it is from capacity-achieving. Our contribution here is to investigate whether linearly precoded RS is constant-gap optimal, that is, achieves the capacity to within a constant gap (constant-gap capacity in short).
After showing that any linear precoding scheme alone cannot be constant-gap optimal, we go on and prove that with rate-splitting the entire achievable constant-gap rate region coincides with the capacity region in the two-user case. Nevertheless, we show that such optimality does not extend beyond two users. Specifically, we use the derived sum rate upper bound and a simple pathological channel realization to demonstrate an unbounded gap to the sum capacity in the three-user case. In fact, the RS scheme is not even GDoF optimal in this example. We argue that without a proper codebook design for interference alignment or interference pre-cancellation (e.g., DPC), the independent interference streams become overwhelming for each individual receiver to decode. Our study thus reveals a fundamental gap between the receiver-side interference mitigation and the transmitter-side interference mitigation. In the literature, there have been quite a few works on RS for BC in recent years. While some of the works consider perfect CSIT, most of works apply RS to mitigate interference caused by CSIT imperfection. In particular, one can consider the GDoF while letting the CSIT error scales as SNR −β , where β ≥ 0 is used to measure the CSIT accuracy [17]. In such a setting, the presence of the common message to all users is necessary to make full use of the transmit power. It is shown in [9] that using only one common message together with the private messages is optimal in a symmetric K-user setting. Many other works focus on the optimization problems related to precoder designs for different channel models, with perfect or imperfect CSIT assumptions. Some optimize the sum rate [15], while others focus on the pre-log factor at high SNR [16]- [18]. MMSE precoder has also been considered in previous works but is only limited for private messages [19], [20]. In this work, we are interested in the unanswered, yet fundamental, question of whether the RS scheme can be close to optimal in a stronger sense than the DoF even with perfect CSIT (which can be regarded as an extreme case of the imperfect CSIT). Our work's technical and new contribution beyond the state of the art is the constant-gap analyses of linearly precoded RS schemes. Furthermore, while most of the existing works on the RS are limited to one or two layers of common messages [15], [20], our work is the first to characterize an analytical upper bound of the general K-user RS scheme, to the best of our knowledge. This is also the first attempt to propose a constructive way with an analytical criterion to reduce the number of streams according to the actual channel realization, in contrast to previous works that apply numerical simulations for such purposes.
The remainder of the paper is organized as follows.
We present the channel model in Section II. In Section III, we describe the RS scheme in its most general form with MMSE precoding, and derive the achievable constant-gap rate region. The constant-gap sum rate is analyzed in Section IV with the general K-user sum rate upper bound derived in Section IV-A and the stream elimination and ordering algorithms described in Section IV-B. The constant-gap optimality is then investigated in Section V, before we conclude the paper in Section VI.

II. SYSTEM MODEL AND PRELIMINARIES
Notation: In this paper, we use the following notational conventions. For random quantities, we use upper case non-italic letters, e.g., X, for scalars, and upper case non-italic bold letters, e.g., V V V, for vectors. Deterministic quantities are denoted in a rather conventional way with italic letters, e.g., a scalar . . , n}. Subsets are denoted with calligraphic capitalized letters, e.g., K and S. |S| represents the cardinality of the set S. We use 2 K to denote the power set of K, i.e., the collection of all subsets of K. We usē K to denote the complement set of K, i.e., is the whole set. To avoid confusion, we use bold calligraphic letters to specify sets of sets, e.g., S S S, which are referred to as collections. For convenience, we use {A k } k to denote the set {A k : k = 1, . . . , K} where A can be any object. The colon equal ":=" denotes equality by definition or assignment. "Conv" stands for the convex hull operation. (x) + := max {x, 0}. Throughout the paper, we use "≈" for constant-gap approximation.

A. Channel Model
We consider a K-user time-invariant and frequency-flat Gaussian MIMO BC where the transmitter has n t antennas. The channel output at receiver k, or, in a compact form

K
T is the global channel matrix assumed to be deterministic and is known globally. The input sequence is subject to the power constraint 1

B. Capacity Region
Let us assume that the transmitter sends an independent message to each receiver k, k ∈ [K], at a rate R k bits/s/Hz. The capacity region, denoted by C BC ({H H H k } k , P ), is the set of rate-tuples (R 1 , . . . , R K ) such that the probability of decoding error can be arbitrarily small when n → ∞. This capacity region can be conveniently characterized with the so-called MAC-BC duality. Namely, the capacity region of a MIMO BC with power constraint P is the union of the capacity regions of the dual MAC over all individual power constraints that sum to P , i.e., where the region denotes the capacity region of the dual MAC under the individual covariance constraint Q Q Q k , k ∈ [K]. In fact, it is well known (see, e.g., [21], [22] Since the log-det function is increasing with the partial ordering of positive semi-definite matrices, it follows from Q Q Q k P I I I, k ∈ [K], that where n r := K k=1 n r,k . Hence, we have the following lemma. Lemma 1: The BC capacity region C BC ({H H H k } k , P ) is within γ := n t log n r bits/s/Hz to the MAC capacity region
An optimal scheme that achieves the exact capacity region consists in combining the DPC with the MMSE precoding [22]. Specifically, for a given encoding order, each message is first encoded using Costa's DPC that pre-cancels the previously encoded signals, and is then precoded with the MMSE matrix. In this way, each receiver only sees the interference from the messages that are encoded afterwards. Here we can clearly see the duality between the successive encoding of the BC and the successive interference cancellation (SIC) decoding of the MAC. Since part of the interference is pre-cancelled at the transmitter side and the receivers treat the residual interference as additive noise, such a scheme can be regarded as transmitter-side interference mitigation. While the MMSE precoding is linear, the DPC is non-linear and can be implemented with nested lattices [23].

C. Constant-Gap Rate Region
In the following, we give a formal definition of the constant-gap rate region.
The scheme is GDoF optimal if we let each entry of the channel matrix scale as H ij =H ij P αij and Note that any above optimality still holds when we scale the power P by a constant. Therefore, throughout the paper, we scale the power whenever it is convenient.

D. Linear Precoding With Point-to-Point Codes
By removing the DPC from the transmitter, we have a much simpler but strictly suboptimal scheme, namely, the linear precoding scheme. Specifically, by linear precoding, here we refer to a particular class of schemes such that, 1) independent point-to-point Gaussian codebooks 1 are used to encode the K streams; 2) the transmitted signal is a linear combination of the K codewords; and 3) interferences are treated as noise at each receiver. Under these assumptions, a single-letter rate region can be obtained in terms of the input random variable Then, the rate region achieved by such a linear precoding is Note that the region C LP BC ({H H H k } k , P ) is not convex. With a simple time-sharing strategy, we can achieve the convex hull of the region. The time-sharing strategy can also be generalized to the resource-sharing strategy. Specifically, one can divide the whole resource (e.g., time and frequency) into orthogonal portions, say, λ 1 , . . . , λ N , such that λ 1 + · · · + λ N = 1 and λ i > 0, ∀ i. In each portion i of the resource, we can perform the linear precoding with covariance matrices {Q Q Q Although the resource-sharing strategy can improve the achievable rate region, we can show that the improvement is bounded.
Lemma 2: With linear precoding schemes, the achievable rate region with the resource-sharing strategy described above is within n r bits/s/Hz to the region with only time-sharing, that is, Proof: See Appendix A. Therefore, it is without loss of constant-gap optimality to focus on the simple time-sharing strategy.

E. Single-Antenna (SISO) BC
In the single-antenna case, i.e., when the transmitter and all the receivers have each only one antenna, the analysis becomes easier. We can prove that the rate region of the linear scheme Let us consider the two-user case in which |h 1 | |h 2 |. From the MAC-BC duality, let P 1 = P 2 = 1 2 P , the following rates are achievable where we recall that "≈" stands for constant-gap approximation. In contrast, with the linear scheme, the achievable rate of user 2 is R 2 = log(1 +P 2|h2| 2 1+P1|h2| 2 ) whereP 1 ,P 2 are the power for user 1 and user 2, respectively, withP 1 +P 2 ≤ P . In order for user 2 to achieve log(1 + P |h 2 | 2 ) bits/s/Hz with the linear scheme, the interference termP 1 |h 2 | 2 in R 2 must remain bounded while the powerP 2 should be within a constant factor to P . Consequently, user 1's rate must be bounded by a constant, since where the last equality is from the fact thatP 2 is within a constant factor to P . Nevertheless, we know that single-user transmission achieves the sum capacity to within a constant gap. Indeed, if we only serve the user with the strongest channel gain, say, (16) whereas the sum capacity is, from the MAC-BC duality, Remark 1: Note that the capacity region of a SISO BC can be achieved by linear superposition coding, since the channel is stochastically degraded [22]. So a linear scheme does achieve capacity in this case. However, the receivers need to decode a subset of the interfering signals in order to achieve the capacity. Specifically, each receiver needs to first decode all the messages for the receivers with weaker channel gains, then remove the interference before decoding the intended message. In fact, this is a simple form of RS in which the message for the weakest user is indeed a common message that needs to be decoded by (although not intended to) all the users. The performance of linearly precoded RS in a general MIMO setting is the main subject of this paper, and will be treated in Section III.

III. RATE-SPLITTING WITH MMSE PRECODING
In this section, we introduce a RS scheme at the transmitter side, and describe this scheme in the general MIMO case with K users. We shall derive the corresponding achievable rate region in its general form.

A. K-User BC With Common Messages
The considered RS scheme builds on a general K-user scheme with common messages. It is worth mentioning that the capacity region of the two-user MIMO BC with common message has been completely characterized in [24]. In that work, the authors showed that Marton's inner bound based on binning is indeed tight with Gaussian signaling. Here, we shall investigate the general K-user case but only on the achievable rate region with independent point-to-point codebooks.
First, let M K : K ⊆ [K], K = ∅ be a set of 2 K − 1 independent messages, each one with rate R K bits/s/Hz. These messages are encoded with independent Gaussian codebooks, each generated identically and independently according to a distribution whereK := [K]\K and H H HK is a matrix formed by the vertical concatenation of the channel matrices of the users inK, with the convention H H H ∅ = 0; the coefficients {P K } are chosen to satisfy the power constraint K tr (Q Q Q K ) ≤ P . Such a precoding scheme is known as the MMSE precoding. The idea behind the MMSE precoding is to limit the interference power at the unintended receivers. Indeed, the covariance matrix of X X X K at the setK of users is that is, below the AWGN level. Unlike the ZF precoding that completely nullifies interference, the MMSE precoding is known to achieve a better trade-off between interference and signal power. Further, the application of the ZF precoding is possible only when a non-empty interference null space exists, whereas the MMSE precoding is feasible in general. The transmitted signal is a superposition of all the streams Next, each receiver k jointly decodes the set of messages {M K : K k} by treating the interferences {X X X K : K k} as noise. Thus, for each receiver k, it is equivalent to a virtual MAC whose achievable rate region is the set of non-negative rate tuples satisfying, for every collection S S S k ⊆ {K : K k}, The above rate constraints provide the exact characterization of the achievable rate region for any linear precoding scheme (not necessarily the MMSE precoding). Note that the region is quite involved with a large number of parameters. For our purpose, however, it is enough to have an approximate region, i.e., to within a constant gap. This allows us to simplify the region and obtain the following result.
be the set of achievable rate tuples (R K : K ⊆ [K], K = ∅) by the proposed scheme with MMSE precoding satisfying the power constraint P and joint decoding at the receivers. Then, the set of non-negative rate tuples satisfying for all k ∈ [K] and collections S S S k ⊆ {K : K k} forms an achievable constant-gap rate region with respect to R CM BC , where we define for convenience Note that in the above simplification, we have omitted the interference term and replaced P K by P in Q Q Q K , both of which only incur a bounded power loss in terms of K, but can simplfy further analyses. The number of constraints in the above region corresponds to the number of non-empty collections S S S k for all k ∈ [K].
We say that the collection S S S k is minimal if no element is a proper subset of another element. For example, One can always obtain a minimal collection by removing the "smaller" elements, e.g., removing {1} in the previous example and we obtain {{1, 2}, {1, 3}} that is minimal. We say that S S S k can be reduced to a minimal collection denoted by S S S k . It is readily shown that Therefore, we can replace Q Q Q S S S k by Q Q Q S S S k and only lose up to a constant number of bits per channel use. Further, we notice that if both collections S S S k and S S S k can be reduced to S S S k , then S S S k S S S k can also be reduced to S S S k . Hence, in the equivalent class of collections sharing the same minimal S S S k , there is always a maximal collection that is the union of all the collections that can be reduced to S S S k . It follows that for every collection S S S k , there is a minimal S S S k and a maximal S S S k such that In the expression (22), we see that among all the constraints with S S S k having the same S S S k , thus having the same right hand side in (22) up to a constant gap due to (24), the constraint corresponding to S S S k is obviously dominant since it involves all the possible terms on the left hand side. Therefore, we can further simplify the approximate rate region.
is an achievable constant-gap rate region by the proposed scheme; here we define Example 1 (The Two-User Case): When K = 2, the rate region from Proposition 1 becomes 12 +R 12 +R where l

B. K-User BC Without Common Messages: Rate-Splitting
Now, let us get back to the original setting without common messages, i.e., with messages {M k : k = 1, . . . , K}, each one intended exclusively to one user. We can build a scheme without common messages from any scheme with common messages through rate-splitting.
First, we split each message M k of rate R k bits/s/Hz into sub-messages {M (k) K should be decoded by all users in K, although the submessage is intended only to user k.
Then, the each one of which should be decodable by the users in K by construction. These 2 K − 1 re-assembled sub-messages are transmitted with the scheme described in the previous subsection. At the receivers' side, each user k decodes the set of re-assembled sub-messages {M K : K k}, but only keeps the sub-messages {M (k) K : K k} in order to reconstruct the desirable message M k . At this point, the following result becomes straightforward. and The set of such rate tuples is denoted by . It is worth emphasizing the three choices that we have made for the above RS scheme: 1) independent codebooks for different sets of sub-messages, 2) linear spatial MMSE precoding at the transmitter, and 3) decoding common interfering streams by treating other streams as noise at the receivers. Since the proposed scheme allows the receivers to decode partially the interference, it can be regarded as a receiver-side interference mitigation scheme.
Finally, it is possible to ignore a subset T ⊆ [K] of users and only apply the proposed RS scheme to the remaining users in [K] \ T . Together with time sharing, the achievable rate region is described as follows.
Corollary 1: The following convex hull of rate-tuples is achievable with the proposed RS scheme and time sharing Similarly, replacing R RS BC byR RS BC , we obtain an achievable constant-gap region.
In the following, we shall only focus on constant-gap rates for our purpose. For brevity, we drop the term "constant-gap" whenever confusion is not likely.

IV. ACHIEVABLE SUM RATE
In general, the achievable rate region is analytically intractable. Even the numerical evaluation becomes hard for a moderately large number of users due to the exponential growth of the number of sub-messages and doubly exponential growth of the number of constraints. In this section, we shall focus on the achievable sum rate instead of the entire rate region to obtain meaningful insights. First, we shall establish an upper bound on the achievable sum rate of the proposed region. We shall then show how to preserve the achievable sum rate up to a constant loss while reducing the total number of active streams.

A. Sum Rate Upper Bound
Before analyzing the sum rate, we take a closer look at the defining term l S For convenience, we also define where we let C ∅ := 0. Finally, the following relationship between the C's and the l's will be useful.
where we recall that Q Q Q S S S is defined as in (23). In particular, the upper bound is achievable to within a constant gap when K ≤ 3.
Proof: The first upper bound C [K] is trivial since it is the sum capacity of the channel. Alternatively, we can also recover it from the region (26). Indeed, let us consider the sequence of maximal collections S S S k = {S : k ∈ S, i ∈ S, ∀i > k}, k ∈ [K], and their corresponding minimal collections S S S k = {[k]}, k ∈ [K]. Then, the sum rate can be decomposed as where the inequality is from (26) and the first equality is from Lemma 4. We shall now focus on the second term in (39).
For given i, k ∈ [K], let us consider the collection S S S (k) i as defined in (41). It is clearly a minimal collection according to the definition in Section III-A, since no element is a proper subset of another element. Now, let us define the following collection We notice that S S S is a maximal collection according to the definition in Section III-A. From (26), we have Letting k = K in (43), we obtain while letting k = K − 1 in (43), we have Next, we sum up a i and b i as follows.
where in the second equality we exchange the sums over i and k ; in the third one we apply the symmetry and rearrange the K × K−1 k −1 summands into k summations over K k terms; in the last one we define R (k) := K: |K|=k R K . Summing up the right-hand sides of (44) and (45) in the same way, we obtain Similarly, for any k ≤ K − 2, we can apply (43) and obtain Summing over i, we have Consider the following weighted sum over c k and the same weighted sum over the right-hand side of (53), we obtain Summing up (50) and (57), and dividing both sides by K − 1, we have where R (K) Note that in (46), we can also start with Repeating the same steps, we can obtain Since the above upper bound holds for every rate tuple, it is also an upper bound on the maximum sum rate. This concludes the proof of the upper bound (39). We defer the proof of the achievability for K ≤ 3 in Appendix C.
In Table I, we provide the numerical evaluation of the upper bounds and the sum rate of the region (26) for K = 4 and 5 users. 2 Two channel models are considered: Rayleigh fading and the one-ring model [25]. Average rates are obtained with 1000 channel realizations for each distribution, while maximum gap is from all 2000 realizations. We observe that the upper bound is indeed very close to the exact rate in average, and the maximum gap is small as compared to the average rate.
Remark 2: Note that the upper bound in (39) holds when all K users are active. Nevertheless, one can ignore K − K users and apply the RS scheme to the K active users for any K ≤ K. In this case, the above bound is still valid by replacing K with K and replacing H H H with the corresponding submatrix.

B. Stream Elimination and Stream Ordering Algorithms
The general RS scheme transforms K messages into 2 K −1 different sub-messages, and then the BS creates one stream for each sub-message aiming at the corresponding user group. In practice, we would like to reduce the number of streams for lower signaling and decoding complexity. In this section, we first propose an algorithm that eliminates some of the streams without reducing the sum rate for more than a given number of bits per channel use. Then, based on the same idea, we propose a second algorithm that orders all the 2 K − 1 streams, and validate the algorithm through numerical simulation. For simplicity of demonstration, we focus on the MISO case with M ≥ K in the following. Nevertheless, the results can be extended to multi-antenna receivers straightforwardly.

1) Sufficient Conditions to Maintain the Sum
then we can eliminate any stream K such that K ∩ I = ∅ and K ∩Ī = ∅ without losing more than log(c) bits.
Proof: Let i = |I| ∈ [1 : K − 1] and let us assume that I = {1, . . . , i} without loss of generality. The case with an arbitrary I follows in the same way up to a row permutation of H H H. Let x x x K be the signal corresponding to the stream intended for user group K, and the corresponding sub-message and rate are denoted by M K and R K , respectively. Further, let us assume that K ∩ I = ∅ and K ∩Ī = ∅, i.e., K contains at least one user inside I and at least one outside of it. Without loss of generality, let K = K ∪ K with K ⊆ I and K ⊆Ī.
Let W W W ∈ C M×M be such that (60) holds. Then, we define a new signal corresponding to the sub-message M K as x x x K = W W Wx x x K . Due to the condition (60), the received signal corresponding to the sub-message M K at users inĪ is H H HĪx x x K = H H HĪW W Wx x x K = 0 0 0, while the received signal at users in i.e., remains the same as with x x x K . Hence, with the new signaling scheme, users in I see no changes, and users inĪ do not receive any signal related to the sub-message M K . In other words, the sub-message M K can be downgraded to a sub-message to users in K = K ∩ I without degrading the decoding performance of other users.
Next, we evaluate the power loss. x where σ max denotes the maximum singular value of a matrix, with σ max (W W W ) ≤ W W W . Next we scale down the power of x x x K to meet the power constraint, namely, we let x x x K := 1 W W W x x x K . Note that scaling down the power by a factor W W W 2 , we have a rate loss on R K of at most log(W W W 2 ) bits/s/Hz. Since decreasing the power of one stream cannot hurt the other streams, the sum rate loss is at most log(W W W 2 ) ≤ log(c) bits/s/Hz. The proof is complete.
2) Stream Elimination and Ordering: where, with a slight abuse of notation, we use ∞ ∞ ∞ to denote a matrix with infinite norm. Proposition 4 has the following equivalent form.
The following property is straightforward from the algorithm.

Claim 1: The output collection from Algorithm 1, denoted as S S S(c), is decreasing with the threshold c such that S S S(0) = 2 [K] \ ∅ and S S S(∞) = [K].
In practice, in order to reduce the precoding and decoding complexity, one may want to order the streams somehow and use only the "best" ones. From the above discussion, we observe that one way to order the streams is to use the minimum threshold for a stream to be eliminated from Algorithm 1. Specifically, such a threshold is defined as, for each K ⊆ [K] with |K| ≥ 2, ,I =∅ , it is probable that more than one common stream share the same threshold value. In that case, one can introduce a simple randomization to resolve the tie situation. Algorithm 2 summarizes this procedure. Note that streams with larger threshold values are considered "better". The following claim is straightforward from the definition of the minimum threshold in (63).
Claim 2: The minimum threshold c K , for K ⊆ [K] and |K| ≥ 2, as defined in (63), is decreasing with the partial ordering of K.
Note that this is intuitive since demanding more users to decode the same sub-message should become more costly, and therefore higher-order sub-messages have lower priority. With Algorithm 2, one can choose the N "best" streams for an arbitrary number N ≤ 2 K − 1. Although it is a heuristic way to identify a given number of best streams, the complexity is much lower compared to the exact solution. Note that to find the exact solution, one would need to consider all ; the complexity for finding the minimum in (64) is O(2 K ); the sorting has complexity O(2 K log(2 K )) = O(K2 K ). Therefore, the overall complexity of Algorithm 2 is O KM 2 2 K + 2 K 2 K + K2 K = O(4 K ), assuming reasonably that K 2 M ≤ 2 K when K and M become large.
To show that Algorithm 2 can be practically effective, we run a numerical simulation for K = 4 users. We use the one-ring scattering model [25] to introduce spatial correlation, in which scenario RS is particularly useful. In the simulation, we consider two groups with low inter-group correlation and high intra-group correlation. Each of the four users can be associated randomly with one of the groups. We then apply Algorithm 2 to order the streams. In Fig. 1, we show the achievable rate when the N "best" streams out of the total 2 K − 1 streams are activated. We also plot the achievable rate of the 1-layer RS scheme in which all private streams and one common stream to all users (the stream [K]) are activated. We observe that when N = K+1 = 5, the algorithm chooses a common stream to combine with the K = 4 private streams, which improves the sum rate performance. It outperforms the 1-layer scheme that does not depend on the channel realization. This example shows that our algorithm can provide an effective and efficient way to select a given number of streams adapted to the channel condition.
V. CONSTANT-GAP OPTIMALITY AND NON-OPTIMALITY In the previous sections, we have investigated the achievable rate of linear precoded RS schemes. In this section, we are interested in the optimality of such schemes as compared to the capacity region in the constant-gap sense.

A. Linear Precoding Alone Is Not Constant-Gap Optimal
We have shown in Section II-E that even the single-user transmission, as an extreme case of the linear schemes, can achieve the sum capacity to within a constant gap. We shall now show that the same optimality does not hold with multiple antennas with linear precoding alone. For this purpose, we consider a two-user MISO BC with two transmit antennas. Note that, in this case, the channel matrix H H H k ∈ C 1×2 is a row vector for each user k, k = 1, 2, therefore, it is instead denoted by h h h k following our notational convention. As discussed in Section II-B, it is without loss of optimality to consider the following quantity as the sum capacity as we are only interested in the capacity to within a constant gap  (H H HH H H H )) . (66) Note that the first two terms in (66) can be achieved with single-user transmission, by serving the stronger user. Therefore, the only non-trivial case is when log(1 + P 2 det (H H HH H H H )) is the dominating term in (66).
To prove our statement, let us assume that the channel matrix has the following triangular form Now let us restrict ourselves to linear precoding schemes at the transmitter and treating interference as noise at the receivers. In particular, we let X X X = X X X 1 + X X X 2 such that E [X X X 1 X X X H 1 ] = Q Q Q 1 and E [X X X 2 X X X H 2 ] = Q Q Q 2 with the following eigenvalue decompositions where |ũ 1 | 2 = |u 1 | 2 = 1 − |ṽ 1 | 2 = 1 − |v 1 | 2 and λ 1 ≥ μ 1 ≥ 0 without loss of generality; the same convention is applied for Q Q Q 2 . Due to the Gaussian signaling, we have Note that we are only interested in the case with for otherwise it is equivalent to the single-user case to within a constant gap. In this case, the achievable sum rate with linear precoding can be written as We can now maximize over Q Q Q 1 and over Q Q Q 2 separately. In fact, one can show the following lemma. Lemma 6: For any Q Q Q 1 and Q Q Q 2 in (69) and (70), we have Proof: See Appendix B. To show that linear precoding is not constant-gap optimal, we consider high SNR P and let the channel coefficients scale with P as f = P α f and g = P αg for some α f , α g ∈ R.
It follows that the achievable sum rate also scales with P as d LP (α f , α g ) log P + O(1), while the sum capacity scales as d DPC (α f , α g ) log P + O (1). Here, the pre-log factor is the GDoF as explained in Section II-C. We shall show that there exist some (α f , α g ) such that d LP (α f , α g ) < d DPC (α f , α g ).
Remark 3: It is important to emphasize that the above results are based on the assumption of Gaussian signaling. In fact, Gaussian input has been proved to be strictly suboptimal in some multi-user settings. For instance, in [27], the authors have investigated the two-user Gaussian interference channel with point-to-point codes, and showed that a mixed input is needed to achieve the optimal GDoF. There, the mixed input is the sum of a discrete random variable and a Gaussian variable. With the mixed input, the optimal decoding, e.g., maximum likelihood decoding, exploits the structure of the interference and achieves a better performance than in the case with Gaussian interference. Essentially, as the authors of [27] pointed out, the discrete part carries somehow a sort of "common information" that both receivers can exploit. That explains why RS is not needed with such inputs to achieve the optimal GDoF. Note, however, that the optimal decoding in this case may be much more involved than the one for Gaussian interference. The latter only needs a simple nearest neighbour decoding.
In the following, we consider linear precoding schemes with rate-splitting.

B. Linear Precoded RS Is Constant-Gap Optimal With Two Users
The rate region of the two-user BC with RS is given in Example 1 from (28) to (31). Defining and applying the Fourier-Motzkin elimination [22], we obtain the following achievable region which corresponds to the capacity region C MAC ({H H H H k } k , PI I I) of the dual MAC. We thereby establish the constant-gap optimality of the proposed RS scheme with MMSE precoding in the two-user case.

C. Linear Precoded RS Is Constant-Gap Sub-Optimal With Three Users
We shall show that the constant-gap optimality does not extend beyond two users. To that end, we first present the constant-gap sum rate of the three-user case.
Proposition 5: The optimal sum rate R sum of the proposed RS scheme with MMSE precoding in the three-user case is within a constant gap to R * sum := max C 12 , C 13 , C 23 , where l , is defined by (40), and Proof: This is a direct consequence of Proposition 3. Indeed, if only two users out of the three, say, users 1 and 2, are activated, then the sum rate C 12 is achievable to within a constant gap according to Proposition 3. Similarly, C 13 and C 23 can be achieved if we activate another subset instead. If all three users are activated, then the achievable constant-gap sum rate in Proposition 3 becomes the second term inside the max{·} in (82). Note that activating only one user achieves max{C 1 , C 2 , C 3 } that is strictly smaller than max{C 12 , C 13 , C 23 }.
From the above result, we can prove the constant-gap sub-optimality of the proposed RS scheme.
Corollary 3: The proposed scheme is not GDoF optimal (and therefore not constant-gap optimal) in the three-user case.
Proof: In order to prove the suboptimality, it is enough to find a class of channel matrices H H H such that C 123 − R * sum can be arbitrarily large, where C 123 is the sum capacity of the channel. Since R * sum in (82) is still quite involved due to the presence of l (2) k , k = 1, 2, 3, we further upper bound R * sum using the following inequality.
Hence, we have R * sum R * sum := max C 12 , C 13 , C 23 , In the following, we shall show that R * sum can be arbitrarily smaller than C 123 . To that end, we shall focus on the high SNR regime and look at the pre-log of the rate expressions.
Define the pre-log d K := lim P →∞ CK log P , and we have From (85), we have the following upper bound for the pre-log of R * sum , which is strictly smaller than the optimal sum GDoF d 123 = 3 for any 0 < α < 1. This implies the constant-gap sub-optimality of the proposed RS scheme.

D. Deficiency of Receiver-Side Interference Mitigation
One may wonder why the proposed RS scheme is constant-gap optimal in the two-user case but not in the three-user case. In particular, is it possible to improve the current RS scheme with a better precoding (other than the MMSE precoding) or with a more sophisticated decoding scheme? To have a better understanding of why the RS scheme fails in the three-user case, let us have a closer look at the above pathological example. From the dual MAC, we know that a GDoF triple (1, 1, 1) is achievable, e.g., with joint decoding or successive interference cancellation in the uplink receiver. Specifically, the receiver can first decode user 3's message using only the third antenna, obtaining GDoF 1, and remove it before decoding user 1 and user 2's messages from the first and the second antennas, respectively. In the downlink, with DPC, the exact reverse procedure can be applied and the same GDoF triple can be obtained. This is the advantage of transmitter-side interference cancellation where the transmitter manipulates optimally all the signals so that the interference at the receivers' side is minimized.
With the RS scheme, however, the receivers are interference-limited. To see this, let us impose that user 1 and user 2 both have GDoF 1. Thus, full power P must be used for antennas 1 and 2 to send the users' signals, which generates an interference power P 1+α at user 3. Note that user 3's signal, in order not to interfere with user 1 and 2's signals, must be essentially sent from antenna 3, arriving at user 3 with power P . Unless the interference could be fully cancelled or decoded and removed, full GDoF 1 would not be achievable. As shown in Figure 2, we can split the signal 1 into common and private parts √ P X c,1 + √ P 1−α X p,1 with DoF α and 1 − α, respectively. Similarly for signal 2, we use √ P X c,2 + √ P 1−α X p,2 . Signal 3 carries the private information for user 3 and cancels the private parts in signal 1 and 2, namely, √ P X p,3 − √ P X p,1 − √ P X p,2 , so that user 3 receives √ P X p,3 + √ P 1+α (X c,1 +X c,2 )+Z 3 . Note that X p,3 , X c,1 , X c,2 , with a total DoF 1 + 2α, must be decoded by user 3 in order to recover the private DoF of 1. This is impossible since the maximum GDoF for receiver 3 is 1 + α. Instead, user 3 can only achieve a GDoF of 1+α−2α = 1−α. In other words, the RS scheme achieves the (1, 1, 1 − α) GDoF triple instead of (1, 1, 1). Note that the above discussion is independent of the precoding scheme and the decoding scheme, which implies that the sub-optimality of the RS scheme cannot be resolved in these directions.
In fact, the fundamental issue of the RS scheme in the above example is that independent codebooks are used for different streams. Intuitively, the interference signal space becomes too large for any individual receiver. If one could align different interferers into a reduced subspace, however, then the achievable rate could be improved. In particular, in the above case, if the information in X c,1 + X c,2 only occupies a DoF of α instead of 2α, then user 3 could decode the sum of the interferences instead of the individual interferences, and achieves the GDoF 1+α− α = 1. This is precisely the idea of interference alignment [28], [29]. Instead of using independent codebooks, one could use the same lattice codebook for X c,1 and X c,2 in such a way that the sum is still within the same codebook and thus have a reduced rate. Therefore, combining RS and interference alignment, it is possible to reduce the GDoF gap and may be possible to attain constant-gap optimality. One may also improve the performance by using non-linear precoding for interference cancellation. For example, a recent work [30] proposes a RS scheme with Tomlinson-Harashima precoding which has lower implementation complexity than the DPC and is shown to outperform the linear precoding schemes. Nevertheless, such improvements come at the price of a higher complexity at the transmitter side, which limits the practical and theoretical interests.
Finally, it is also worth mentioning that Gaussian signaling is known to achieve the capacity region of a two-user BC with common message [24], while the optimal signaling for more users with common messages is still unknown. The above rate analysis on the three-user RS scheme, and thus the conclusion, may not hold with a different signaling.

VI. CONCLUSION
We have investigated the achievable rate region of linearly precoded rate-splitting schemes in the K-user MIMO broadcast channel to within a constant gap. In particular, we have derived the achievable constant-gap sum rate for K ≤ 3, and obtained closed-form upper bounds for K > 3. The constant-gap results, though asymptotic, provide useful insights that guided us to propose a practical stream elimination algorithm. Our analyses also revealed the constant-gap optimality of linearly precoded RS with respect to the fundamental capacity region in the two-user case. While such optimality does not extend beyond two users, we have provided explanations on the deficiency and potential remedies. The results presented in the initial version of this work have been followed up in [31] with additional precoder optimization and numerical simulations. Therein, the stream elimination algorithm has also been applied and shown effective in practical scenarios.
Note that for K-user broadcast channels with general message sets -even the Gaussian MIMO case with degraded message sets -the capacity region is still unknown. In those cases, rate-splitting goes beyond a method to simplify transmission, as in our case with linear precoding, and becomes an essential tool to improve the achievable rate region when combined with binning [32].
APPENDIX A PROOF OF LEMMA 2 Denote P i := K k=1 tr(Q Q Q (i) k ), i = 1, · · · , N, then the power constraint can be expressed as Define the following variable μ i for each resource portion i From the definition of {μ i } i in (91), we can verify that, for any i, where (98) is from (91); (99) is from (90). Hence, we have, Finally, from (93), the proof is complete.
APPENDIX C PROOF OF THE ACHIEVABILITY OF (39) FOR K = 2 AND K = 3 First we present the following submodularity property that will be useful later.
Proof: Indeed, this inequality can be proved directly using matrix properties or, more conveniently, with mutual information. Let A = A ∪ A with A ∩ A = ∅. Then, identifying C A∪B , C A , C A ∪B , and C A with I(X A∪B ; Y), , and X A ), respectively, the above inequality becomes after applying the chain rule of mutual information. This holds since A. The Two-User Case (K = 2) When K = 2, the rate region is given by (28)- (31). Then, we can verify that any rate quadruple such thatR 1 = l   (23), while for each m ∈ [K], l (m) k is defined as in (40); slightly abusing the notation, C ijk denotes C {i,j,k} as defined in (37); the above relationships between the l's and C's can be verified by their definitions.
Similarly, we can obtain the following constraints on the rates of the re-assembled messages which should be decoded by receivers 2 and 3,R In the following, we shall show that there exists a rate tuple satisfying all the constraints (115)-(129) that achieve the sum rate upper bound (39). Note that when K = 3, the second term in the bound (39) becomes 2 , l Without loss of generality we assume that the receivers are ordered such that 1 + C 2 + C 3 + 3C 123 − C 12 − C 23 − C 13 2 .
In the following, we shall show that all the remaining constraints are satisfied. First, (116) can be rewritten as which is equivalent to the assumption (141). Similarly, we can verify that (117), (121), and (126) are all equivalent to the assumption (141). Then, we shall verify the constraints (119), (122), (123), (127), (128), which all involveR 23 andR 123 , are satisfied. Note thatR 23 andR 123 remain undetermined except for their sum given by (147). Due to the symmetry of the constraints and assumptions on receivers 2 and 3, we only need to consider (119), (122), and (123). The three constraints, combined with (142)-(146), can be rewritten as We need to show that there existsR 23 ≥ 0 andR 123 ≥ 0 such that all the three above constraints and (147) can be satisfied simultaneously. It is enough to show the following.
where the first inequality is from l 2 ≥ C 23 − C 3 due to Lemma 4; the second one is from the assumption (131); the last one is from Lemma 7. • The sum of the right-hand sides of (149) and (150) is larger than the right-hand side of (147). Indeed, we have 1 ) + (C 12 + C 13 − C 1 − C 123 ) where the inequality is from the assumption (141).
• The sum of right-hand sides of (149) and (151) is larger than the right-hand side of (147). Indeed, we have where the inequality follows from the assumption (131). The proof is thus complete.