Rate Splitting for Multi-Antenna Downlink: Precoder Design and Practical Implementation

Rate splitting (RS) is a potentially powerful and flexible technique for multi-antenna downlink transmission. In this paper, we address several technical challenges towards its practical implementation for beyond 5G systems. To this end, we focus on a single-cell system with a multi-antenna base station (BS) and K single-antenna receivers. We consider RS in its most general form, and joint decoding to fully exploit the potential of RS. First, we investigate the achievable rates under joint decoding and formulate the precoder design problems to maximize a general utility function, or to minimize the transmit power under pre-defined rate targets. Building upon the concave-convex procedure (CCCP), we propose precoder design algorithms for an arbitrary number of users. Our proposed algorithms approximate the intractable non-convex problems with a number of successively refined convex problems, and provably converge to stationary points of the original problems. Then, to reduce the decoding complexity, we consider the optimization of the precoder and the decoding order under successive decoding. Further, we propose a stream selection algorithm to reduce the number of precoded signals. With a reduced number of streams and successive decoding at the receivers, our proposed algorithm can even be implemented when the number of users is relatively large, whereas the complexity was previously considered as prohibitively high in the same setting. Finally, we propose a simple adaptation of our algorithms to account for the imperfection of the channel state information at the transmitter. Numerical results demonstrate that the general RS scheme provides a substantial performance gain as compared to state-of-the-art linear precoding schemes, especially with a moderately large number of users.


I. INTRODUCTION
While the first version of the fifth generation (5G) has been recently deployed, many communication requirements for future applications, e.g., exceptionally high bit rates and high energy efficiency, remain unaddressed. A plethora of new multi-antenna (MIMO) transmission techniques, such as cell-free massive MIMO, hybrid beamforming, lens antenna arrays, and large intelligent surface (LIS), have been recently proposed for that purpose. Nevertheless, even with these new techniques, interference is still the fundamental barrier towards a better performance in a wireless MIMO network, especially at downlink [1], [2].
Implementing MIMO downlink is challenging for several reasons. First, to mitigate interference at the receivers' side, precoding that relies on precise channel state information at the transmitter's side (CSIT) is needed. Such information is hard to obtain especially at high mobility. Second, even with perfect CSIT, precoder design is non-trivial. The optimal precoder that achieves the capacity region is known as dirty paper coding (DPC) [3], [4] that is nonlinear. Implementing DPC requires vector quantization that is NP-hard. Furthermore, it is also well known that DPC is quite sensitive to CSIT accuracy [5]. As such, in current systems, linear precoders such as zero forcing (ZF) are used instead. It is well established that ZF achieves the optimum degree of freedom (DoF) at high signal-to-noise (SNR) ratio [6], [7]. However, the authors in [8] show that, despite its DoF optimality, any linear precoding scheme can be far from optimal, since the gap between the achievable sum rate of the best linear scheme and the sum capacity can be unbounded. Indeed, with such linear schemes, we deal with interference in essentially two ways: 1) the transmitter applies interference avoidance by steering the signal of any user into other users' null space, and 2) the receivers treat interference as noise. In other words, linear precoders are designed such that interference power is minimized as compared to the signal power at the receivers' side. The cost to suppress interference can be high when the channels for some of the users are spatially aligned.
To circumvent such limitation, the idea of rate splitting (RS) is basically to introduce a new option to the receivers: interference decoding. Specifically, each individual message is split into private and common parts, which are respectively encoded and carried by different signals. Each common part is decodable by (though not necessarily intended to) multiple receivers. Each receiver can decode and then remove the common part before decoding the private part. In this way, part of the interference has been removed since it is decodable, which improves the February 19, 2020 DRAFT overall performance. Originally, RS is proposed to partially mitigate interference in the twouser interference channels [9], [10], in which independent messages are sent by independent transmitters to their respective receivers. It turns out that such a scheme achieves the capacity region of the two-user interference channel to within one bit per channel use (PCU) [11]. In [12], RS is applied to the multi-antenna broadcast channel (BC) and shown to provide a strict sum DoF gain of a BC when only imperfect CSIT is available. Recently, in [8], the authors establish the optimality of linearly precoded RS in the constant gap sense in the two-user MIMO BC case.
In its most general form, the RS scheme can split each message into as many as 2 K−1 submessages in a K-user channel. Then, the total 2 K−1 K sub-messages are re-assembled into 2 K −1 new messages. The BS creates one directional signal for each re-assembled message, and we also refer to such signals as streams. Precoder design together with power allocation can be done across all the sub-messages. Such great flexibility also comes at the cost of high complexity, for both the precoder design and the decoder implementation. The goal of this work is therefore to investigate the true potential of RS, and to solve some technical challenges towards its practical implementation. Our main contributions can be summarized into the following two items.
Precoder design: We consider RS in its most general form, i.e., with an arbitrary number K of users and an arbitrary subset of active streams. To explore the full potential of RS, we consider joint decoding of all common messages at each receiver. We formulate the precoder design problems to optimize the commonly used performance metrics, such as the weighted sum rate, the worst-user rate, as well as the transmit power (for given target rates). These problems are non-convex and therefore hard to solve in general. Then, building on the concave-convex procedure (CCCP) [13], we propose algorithms to solve approximately the original problems.
Our algorithms can be proved to converge to stationary points of the precoder design problems.
To the best of our knowledge, this is the first work to provide precoder design for the general RS scheme, as well as the first work to combine RS and joint decoding. By constrast, previous works only consider RS in reduced forms or with successive decoding.
Practical implementation: In addition to the general precoder design algorithms, we also propose further adaptations towards practical implementation of the RS scheme.
• To reduce the complexity on the precoder design and the decoding, we propose a new stream elimination algorithm which is then combined with the precoder design algorithm.
The remaining streams are such that the searching space of the decoding order is essentially February 19, 2020 DRAFT reduced. With such an adaptation, the general RS scheme can be applied even for a large number of users. Comparison among different algorithms reveals the substantial complexity reduction from the proposed stream selection algorithm.
• We propose a slight modification of the precoder design algorithms to account for the CSIT imperfection. Specifically, instead of reformulating entirely the problem, we introduce a regularization term in the precoder design formulation according to the CSIT accuracy.
Numerical results show that the proposed regularization is quite effective and can improve significantly the sum rate with imperfect CSIT.
In order to validate the proposed algorithms, we have run numerical simulations and compare the performance to existing schemes. We show that the general RS scheme outperforms substantially state-of-the-art linear precoding schemes, especially with a moderately large number of users (e.g., 8), both in terms of achievable rates and of total transmit power.
Related works: In [14], [15], [16], [17], the authors explore a structured and simplified version of RS, i.e., the 1-layer RS, where each message is split into one private part and only one common part. The common parts of all the messages are encoded into one common stream that should be decoded by all the users, whereas the private parts are unicast to the corresponding receivers.
While the optimization problem in the 1-layer RS is simpler than the general case, it does not take full advantage of the flexibility of the general RS. Thus, the potential of the general RS remains unknown. Although the authors in [16] do mention the general RS scheme, they only tackle the sum rate maximization problem for K = 2, 3 users. In fact, their method does not seem to scale with K, while our formulation applies to an arbitrary number of users and an arbitrary subset of streams. In [18], the authors propose a hierarchical RS, which transmits one outer common message and multiple inner common messages. The outer common message can be decoded by all users while each inner common message is decodable by a subset of users.
However, the authors mainly focus on the asymptotic sum rate analyses in massive MIMO systems and the optimization of precoders of the common messages. Power allocation in [18] is also simplified to equal power allocation among private messages, whereas in our work we optimize the power allocation among all messages. Further, it is worth mentioning that these works consider only successive decoding while we consider both joint decoding and successive decoding in our formulation to explore the full potential of the general RS. In [8], the authors consider the general K-user RS scheme with joint decoding, and show that MMSE precoder achieves constant-gap capacity in the two-user MIMO BC case. They also propose a stream elimination algorithm based on constant-gap argument. However, the constant-gap argument is essentially for the high SNR regime, and the important questions of how to design precoders at finite SNR regime and how to implement successive decoding have not been addressed.
In [18] and [19], RS is considered to maximize the sum rate with imperfect CSIT. Specifically, [18] considers a hierarchy RS and [19] studies the 1-layer RS. To the best of our knowledge, no current reference exploits the precoder design of the general RS with imperfect CSIT. Furthermore, [18] directly assumes regularized-ZF based on the estimated CSI as the precoder of the private messages, and considers only the optimization of the precoders of the common messages.
In [19], the authors mainly focus on the DoF derivation where the power goes to infinity. In terms of rate optimization, [19] proposes an algorithm only based on several samples of the channel estimation error. Such state-of-the-art results can be far from the performance of the general RS with imperfect CSIT. By contrast, in this paper, we propose a simple and effective regularization to account for CSIT imperfection, without changing the main precoder design.
The remainder of the paper is organized as follows. In Section II, we present the channel model, and describe the general RS strategy and its corresponding achievable rate region under joint decoding. Optimization under the general RS for joint decoding is considered in Section III.
Optimization under successive decoding, stream selection and adaptation of our algorithms to imperfect CSIT scenario are presented in Section IV. Simulation results are illustrated and discussed in Section V. Finally, the paper is concluded in Section VI.
Notation: For random quantities, we use upper case non-italic letters, e.g., X, for scalars, upper case non-italic bold letters, e.g., V V V, for vectors, and upper case letter with bold and sans where h h h k ∈ C M ×1 is the channel vector from the BS to user k; Z k [t] ∼ CN (0, 1) is the additive white Gaussian noise (AWGN) with normalized variance and is independent over time. Note that we assume that the channel remains constant during the whole transmission.
The goal of the BS is to transmit K independent messages, M 1 , . . . , M K , to the K users, respectively, in T channel uses. Let M k ∈ M k , then the transmission rate for user k is defined when T → ∞. Unless specified otherwise, we assume that the channel realizations {h h h k } k are perfectly known at the BS and at the receivers.

A. Linear Precoding for Unicast
In the commonly used unicast scheme, K messages are encoded separately, and a linear superposition of the K encoded signals is sent. Specifically, the BS transmits where X X X k ∈ C M ×1 is the encoded signal for message k. Here, we omit the time index and adopt the commonly used single-letter expression where the signals {x x x[t]} are replaced by the random vector X X X for further analysis. We define the covariance matrix Q Q Q k := E{X X X k X X X H k } 0 to specify the precoder for signal k. Accordingly, the transmit power becomes k∈[K] tr (Q Q Q k ). We call such a scheme linear precoding for unicast. The received signal at user k is Assuming Gaussian signaling, we can derive the achievable rate of user k as where the fraction in the logarithm is referred to as the signal-to-interference-and-noise-ratio (SINR) of user k. Essentially, interference is treated as noise in this scheme. One can then optimize over the precoder, via {Q Q Q k } k , for different performance metrics and requirements [20], [21].

B. Linearly Precoded Rate-splitting
For each k ∈ [K], let us define the following collection of subsets of [K] Namely, K (k) collects all 2 K−1 subsets of [K] that contain k. The linearly precoded rate-splitting scheme in the most general form is described as follows.
First, we split each message set M k into sub-message sets, such that where the right-hand side is the Cartesian product of 2 K−1 sets. Thus, any message M k ∈ M k can be equivalently represented by a sub-message tuple M (k) Then, for each non-empty subset K ⊆ [K], we re-assemble the sub-messages M with rate That is, we re-arrange the K2 K−1 sub-messages M (k) . Each of the 2 K − 1 message vectors M M M K is encoded separately 1 and carried by some signal X X X K ∈ C M ×1 for which the precoder is specified by the covariance matrix The transmitted signal is a linear superposition of all these signals At the receivers' side, each user k observes and decodes all message vectors {M M M K : K ∈ K (k) }, by treating the interference K ′ ∈K (k) h h h H k X X X K ′ as noise. Since the desired sub-messages {M Hence, the main idea of RS is to make the interfering messages partially decodable in order to reduce the interference level.
To analyze the achievable rate, we notice that with RS each user is equivalent to a receiver in a multiple access channel (MAC) in which a number of independent messages must be decoded from the received signal. As in a MAC, the achievable rate of RS depends on how the message vectors are decoded. To exploit the full potential of the general RS, we first consider joint decoding. In such case, each receiver k jointly decodes the set of messages {M M M K : K ∈ K (k) }.
Then the achievable rate region of the message vectors is described by the following constraints [8] K∈S (k)

C. Performance Metrics
In this work, we are interested in performance metrics that are related to the achievable rate tuple (R 1 , . . . , R K ). For simplicity, we consider the following utility functionals of the rate tuple.
where the coefficient w k ≥ 0 denotes the weight for user k. Here, f SR , f WSR , and f WUR represent the sum rate, the weighted sum rate, and the worst-user rate, respectively.
We mainly focus on the precoder design problem such that one of the above rate functions are maximized. Specifically, we shall maximize these functions respectively over the 2 K − 1 covariance matrices subject to the transmit power constraint in Section III-A. Another way to the precoder design is to minimize the transmit power for a given target rate tuple as shown in Section III-B. We shall show that these problems can be solved using the same optimization method by applying the same transformation on the constraint functions.
It is worth noting that although the sum rate optimization problem is a special case of the weighted sum rate problem, the optimization technique and complexity can be very different.
That is why we separate the sum rate problem from the general weighted sum rate problem.

III. OPTIMAL PRECODER DESIGN
In this section, we investigate optimization problems for precoder design for the general RS scheme under joint decoding. We first consider the rate maximization problems and then the power minimization problem. For convenience, we introduce the following notations on the rates of the sub-messages R R R := R , the rates of the re-assembled messages , and the covariance matrices

A. Rate Maximization
The rate maximization problems have been widely considered in wireless communications.
For example, such problems have been studied for BC in a variety of scenarios, i.e., downlink unicast [22], [23], downlink multicast [24], and multi-group multicast [25]. Consider the following transmit power constraint where P is the power budget. We would like to maximize the utility functions of the rate tuple, f m (R 1 , . . . , R K ), m = SR, WSR, WUR, subject to the rate constraints in (14) and the constraints on the covariance matrices in (11) and (16). Specifically, we formulate the following general rate maximization problem s.t. (11), (14), (16), where f m (·) is given by (15), (14) is Note that f m (R 1 , . . . , R K ) is concave, the constraints in (11) are convex, and the constraint in (16) is linear. In addition, (14) can be rewritten as Note that each constraint function in (17) can be regarded as a difference of two convex functions.
Therefore, P JD m is a difference of convex functions (DC) programming. A stationary point of P JD m can be obtained by CCCP [13]. The main idea is to solve a sequence of successively refined approximate convex problems, each of which is obtained by linearizing the concave part in (17) and preserving the remaining convexity of P JD m . Specifically, at the i-th iteration, the derivative of the concave term log 1 where Q Q Q(i − 1) denotes the optimal solution of the approximate convex problem at the (i − 1)-th iteration. Therefore, we can linearize the concave term in (17) at Q Q Q(i − 1) as follows In the following, we shall provide the details of the CCCP for obtaining a stationary point of P JD m for m = SR, WSR, WUR, respectively. First, consider m = SR. Since P JD SR can be simplified to (14), (16).
Note that the rates of the sub-messages R R R do not appear in the simplified form of P JD SR , and the number of variables is reduced from M 2 (2 K −1)+K2 K−1 to (M 2 +1)(2 K −1). The approximate convex problem of P JD SR at the i-th iteration is given bỹ (11), (16), Next, consider m = WSR. P JD WSR can be expressed as (16), The approximate convex problem of P JD WSR at the i-th iteration is given bỹ Then, consider m = WUR. We introduce an extra slack variable y which serves as a lower bound of min Thus, P JD WUR can be equivalently transformed to the following problem (16), (21), (23).
Finally, the details of the CCCP for obtaining a stationary point of P JD m , for m = SR, WSR, WUR, respectively, are summarized in Algorithm 1 in which J m (i) is given by Algorithm 1 Obtaining A Stationary Point of P JD m 1: Choose any feasible covariance matrices Q Q Q(0) of P JD m , and set i = 1. Obtain an optimal solution ofP JD , y(i))} for m = SR, WSR, WUR, respectively, with an interior point method. 4: , y(i))} obtained by Algorithm 1 converge to a stationary point 2 of P JD m for m = SR, WSR, WUR, respectively.
Proof. We have shown that P JD m is a DC programming and we propose to solve it with CCCP. It has been validated in [13] that solving DC programming through CCCP always returns a stationary point.
Note that Algorithm 1, based on CCCP, usually converges faster than conventional gradient methods, as it exploits the partial concavity of P JD m . By [26], we know that the number of iterations of Algorithm 1 does not scale with the problem size. Thus, the computational complexity order for Algorithm 1 is the same as that for solvingP JD m (i) in Step 3. When an interior point method is applied, the computational complexity for solvingP JD The initial value for Q Q Q(0) can be chosen randomly (provided that feasibility is ensured) or through a heuristic method. One of the possible initial values for the covariance matrices can be the ZF precoder or the MMSE precoder used in [8]. In practice, we can run Algorithm 1 multiple times with different feasible initial points Q Q Q(0) to obtain multiple stationary points, and choose the stationary point with the best objective value as a suboptimal solution.

B. Power Minimization
Another relevant problem in wireless communications is power efficiency optimization, i.e., to minimize the transmit power for a given target rate tuple (r 1 , . . . , r K ). Such power minimization problem has been studied extensively for BC in a variety of communication scenarios, i.e., downlink unicast [28], [29], downlink multicast [30] and multi-group multicast [31]. Furthermore, precoder design that minimizes the total power consumption while ensuring target user rates is also studied in emerging scenarios such as large-scale multi-cell multi-user MIMO systems [32], or in the presence of eavesdroppers [33]. Considering the following rate constraints where r k , k ∈ [K], are the target rates. We would like to minimize the transmit power subject to the rate constraints in (14) and (25) and the constraints on the covariance matrices in (11).
Specifically, we formulate the power minimization problem for the general RS scheme as follows , (14), (25).
As in the rate maximization problems presented previously, the above power minimization problem can also be regarded as a DC programming. Hence, a stationary point of P JD PM can be obtained using CCCP. Specifically, at the i-th iteration, the approximate convex problem of P JD PM is given byP s.t. (11), (22), (25).
The complete algorithm and its convergence proof are similar to those of P JD m . Thus, we omit the details due to space limitation.

IV. PRACTICAL CONSIDERATIONS FOR FUTURE IMPLEMENTATION
In the previous section, we have formulated the precoder design problems for the general RS scheme with joint decoding, and proposed iterative algorithms for their solutions. Nevertheless, there are a number of challenges for its implementation for possible applications in future wireless networks.
First, the number of streams 2 K −1 can be large when K is large, which increases significantly the precoding and decoding complexity as compared to the case with only K private streams.
Second, the number of rate constraints for joint decoding can be as large as K 2 2 K−1 − 1 as one can verify from (14). The computational complexity of the proposed precoder design algorithms can be formidable with a large K. Third, perfect CSIT may be hard to obtain in practice. We should account for the CSIT inaccuracy in our design.
To address the above challenges, we propose the following solutions: • Use successive decoding to reduce the decoding complexity at the receivers' side.
• Apply a stream selection algorithm to reduce the computational complexity at the BS and to further reduce the decoding complexity at the receivers' side.
• Adjust the current precoder design algorithms so that the CSIT error is taken into account.

A. Successive Decoding
In this subsection, we consider the precoder design problems with successive decoding. For k ∈ [K], let us assume that the 2 K−1 elements in K (k) is somehow ordered such that the n-th . We define π π π (k) := π k,1 , . . . , π k,2 K−1 as the permutation vector of length 2 K−1 , which is used to specify the decoding order at user k. Specifically, at round n ∈ [2 K−1 ], each user k decodes stream K (k) π k,n by treating streams {K (k) π k,n ′ : n ′ > n} as noise. For a given decoding order π π π (k) , the achievable rate region is thus defined by the following constraints Let us introduce the notation on the decoding orders π π π := π π π (k) k∈ [K] . The rate maximization problem can be formulated as follows s.t. (11), (16), (26).
P SD m is a challenging mixed discrete-continuous optimization problem with (M 2 + 1)(2 K − 1) continuous variables Q Q Q Q Q Q Q Q Q and R R R, (2 K−1 !) K possible values for the discrete variable π π π and K 2 K−1 +2 K constraints. One straightforward way to solve P SD m is to first solve the optimization with respect to Q Q Q Q Q Q Q Q Q and R R R for a given π π π, denoted by P SD m (π π π), and then solve the optimization with respect to π π π using exhaustive search over all (2 K−1 !) K possible values for π π π.
First, we solve P SD m (π π π) for a given π π π using CCCP. To avoid redundancy, we present in detail the following sum rate maximization under successive decoding, optimization under the other two rate criteria can be followed analogously. The transmit power minimization can also be February 19, 2020 DRAFT handled similarly. For a given decoding order π π π, the sum rate optimization under successive decoding is formulated as (11), (16), (26).
P SD SR (π π π) is a nonconvex problem with (M 2 + 1)(2 K −1) variables and K 2 K−1 + 2 K constraints. Note that the objective function and the constraint in (16) are linear, and the constraints in (11) are convex. In addition, (26) can be rewritten as and be regarded as a difference of two convex functions. Similarly to P JD SR in Section III-A, P SD SR (π π π) is a DC programming and a stationary point of P SD SR (π π π) can be obtained using CCCP. The approximate convex problem of P SD SR (π π π) at the i-th iteration is given bỹ P SD SR (π π π, i) : max (16), where L SD k,n (Q Q Q; Q Q Q(i − 1)) corresponds to the linearization of the concave term in (27) at Q Q Q(i − 1), and is given by with {Q Q Q(i − 1)} being the optimal solution ofP SD SR (π π π, i − 1) at the (i − 1)-th iteration. Then, we can perform exhaustive search over all (2 K−1 !) K possible decoding orders. The details are summarized in Algorithm 2. 3 Algorithm 2 Solving P SD SR with Exhaustive Search 1: Set R † SR =0. 2: for all possible values for π π π do 3: Choose any feasible covariance matrices Q Q Q(0) of P SD SR (π π π), and set i = 1.
Similarly, the computational complexity for solvingP SD SR (π π π) with a given π π π using CCCP is O M 6 K 0.5 2 3.5K [27]. As the number of all possible choices for π π π is (2 K−1 !) K , the computational complexity for Algorithm 2 is O M 6 K 0.5 2 3.5K (2 K−1 !) K . Although performing Algorithm 2 brings higher computational complexity than performing Algorithm 1 at the BS, successive decoding has much lower implementation cost at the receivers' side. Since the BS has much higher computing capability than the receivers, trading extra complexity at the BS for reduced decoding complexity at the receivers seems to be a right choice.

B. Stream Selection
The prohibitively high computational complexity for Algorithm 2 at large K comes from the excessive number of all possible streams and the exhaustive search of decoding order. In fact,  we found out by numerical experiments that in most of the time not all the streams are needed to achieve the best rate performance. In Fig. 1, we plot a histogram of the number of active streams for the precoder design obtained by Algorithm 1. We see that it is rare that we need to activate all the streams. Indeed, as illustrated in Fig. 2 Further, in [8] the authors have demonstrated that even when the optimal solution activates all the streams, removing some streams only incurs a marginal rate loss. The complexity reduction brought by such operation is significant and thus appealing for practical implementation.
In this subsection, our goal is to find an efficient way to identify a "small" number of "good" streams. We propose a stream selection algorithm, which consists of two steps. The first step is to reduce the number of streams from 2 K − 1 to a relatively small number. In the second step, from the remaining streams returned by the first step, we further select maximum non-overlapping collections of streams. Maximum non-overlapping collection will be defined in detail later. This stream selection algorithm considerably facilitates the implementation of successive decoding.
In the first step, we apply the stream elimination algorithm (SEA) in [8] to reduce the number of streams to smaller than N SEA in total 4 . SEA has been proposed to eliminate some of the streams without losing more than a constant number of bits PCU. Alternatively, SEA provides a way to gradually eliminate the streams so as to identify a given number of remaining streams. This is practically relevant since the number of remaining streams represents the level of precoding/decoding complexity. Let the collection S be the set of the remaining streams returned by SEA. The above claim implies that if SEA has some performance guarantee in terms of rate gap, then a similar rate guarantee can be obtained in terms of the rate functions of our interest. Since the rate gap is an upper bound, it may appear large for practical uses. Nevertheless, our numerical results show that such guarantee provides meaningful improvement even in practical scenarios.
In the second step, we further select maximum non-overlapping collections of S. Without loss of generality, S can be partitioned according to the cardinality of the sets inside it, namely, where S k only contains sets of cardinality k. We also refer to S k as layer k streams in the following. LetS ⊆ S be a sub-collection, and let us also partitionS according to the cardinality of its sets, i.e.,S = K k=1S k . We are interested inS such thatS k is a maximum non-overlapping collection of the sets of S k , for all k ∈ [K].

Definition 1.
For all k ∈ [K],S k is a maximum non-overlapping collection of the sets of S k , if the following constraints are satisfied • K 1 ∩ K 2 = ∅, for any distinct sets K 1 and K 2 inS k ; It is possible that there are more than one maximum non-overlapping collection for each layer.
Let D k denote the number of maximum non-overlapping collections for S k , for all k ∈ [K]. Then there are k D k differentS in total, which are assembled in U. EachS suggests a possible set of active streams for transmission. Algorithm 3 summarizes the above two-step stream selection procedure. Now let us focus on solving P SD SR for a givenS. For eachS ∈ U, the following claim can be easily shown.

Algorithm 3 Stream Selection Algorithm
1: Set a target number N SEA and use SEA in [8] to reduce the 2 K − 1 streams to at most N SEA streams, denoted by the collection S.
Note that each user k ∈ [K] only needs to decode the streams inK (k) . According to the above claim, there is at most one stream per layer, which implies that each user can decode all the streams successively with the following rule. Let us introduce the notationπ π π (k) S to specify the above unique decoding order at user k for a givenS. We further define the decoding orderπ π πS := π π π (k) S k∈ [K] , which is particular for the givenS since each elementπ π π (k) S is unique. Therefore, for each possibleS, we can solve P SD SR (π π πS) withK (k) := K (k) ∩S replacing K (k) using CCCP, and return a stationary point. By now, we can summarize the details for solving P SD SR with the proposed stream selection in Algorithm 4. The following example can help clarify the above stream selection and successive decoding rule.  Obtain an optimal solution of P SD SR (π π πS, i) withK (k) := K (k) ∩S replacing K (k) , denoted by {(Q Q Q(i),R R R(i))}, with an interior point method. 6: Set i = i + 1.
Hence, U 1 =S 1 = S 1 and D 1 = 1. Next, we put the D 2 = 3 possible maximum non-overlapping collectionsS 2 in U 2 : Similarly, there are D 3 = 2 possible maximum non-overlapping collectionsS 3 in U 3 : As a result, there are D 1 D 2 D 3 = 6 possible maximum non-overlapping collections of S in U : Note that for any of the six collectionsS ∈ U, one can apply the same procedure, and finally return the best collectionS ⋄ with the highest rate R ⋄ SR .
Remark IV.1. If the users decode the streams with increasing cardinality order instead, each user first decodes its own private message followed by the common messages. In such case, to mitigate the interference while decoding the private message, the power allocated to common messages is likely to be suppressed under noise level, and the benefit of the general RS is therefore not fully exploited.
The proposed stream selection algorithm essentially reduces the number of active streams and removes the excessive search over the decoding order. However, stream selection itself adds extra complexity. Thus, we would like to investigate on the overall complexity of Algorithm 4, i.e., solve P SD SR with the proposed stream selection algorithm. Algorithm 4 consists of three parts, 1) perform SEA, 2) select differentS and assemble them inŨ, 3) solve P SD SR (π π πS) for each S. The complexity of applying SEA to reduce the 2 K − 1 streams to at most N SEA streams is O(KM 2 2 K ) [8]. Then to find all the possibleS, we can first establish a lookup table offline for a sufficiently large K. As long as N SEA ≤ 2 K − 1, we can constructŨ by searching in the table, regardless of the channel realization. Therefore, the complexity of finding differentS can be neglected. The complexity of the third step involves the complexity of solving P SD SR (π π πS) for allS. However, it is hard to characterize the exact value of D k for an arbitrary N SEA in general. The difficulty comes from the unknown overlap among the remaining streams after SEA. Therefore, we present an upper bound on the complexity for this part. For a givenS, P SD SR (π π πS) is a convex problem with at most (M 2 + 1)N SEA variables and 1 + (K + 1)N SEA constraints. Therefore, the worst-case complexity for solving P SD SR (π π πS) for a givenS using CCCP is O (M 6 N 3.5 SEA K 0.5 ). Next, to characterize D k , we consider all the streams of layer k, i.e., K k in total, if there exists at least one layer-k stream after SEA. Note that such calculation may incorporate several streams already eliminated by SEA, and hence leads to an upper bound.
Let k * (N SEA ) denote the smallest k such that k k ′ =1 K k ′ ≥ N SEA . Then the number of possiblẽ S is no greater than , which can be further upper bounded As a result, the worst-case complexity of solving P SD SR with the proposed stream . The above calculation is omitted for brevity.
Let us consider M = K = 4, then the complexity of solving P SD SR without stream selection is about 10 26 , while the upper bound on the worst-case complexity of solving P SD SR with our stream selection is about 10 8 if N SEA = 6. The above comparison suggests that the complexity of precoder design after our stream selection algorithm is considerably reduced.

C. Imperfect CSIT
In practice, the CSIT is obtained by direct estimation or limited feedback from the receivers. Therefore, the information may be inaccurate or outdated. A common model for imperfect CSIT is to assume that A proper way to characterize the performance with imperfect CSIT is the outage formulation, e.g., to find out the achievable rate tuples for a given outage probability. But it is hard to derive closed-form expressions exploitable for precoder optimization. Therefore, instead of reformulating entirely the problem with imperfect CSIT, we seek an adaptation of the one derived for perfect CSIT. To that end, we adopt an ergodic formulation which replaces (14) with the following constraint 5 where H H H k :=ĥ h h k +H H H k is the random channel vector; and the expectation is over the CSI error.
Note that in the above expression, we implicitly assume that the random channel vector is known at the receivers, whereas only the expectation on the right-hand side of (33) is known at the transmitter's side. Such formulation is also widely used in the literature of robust optimization [27], [34]. There are two issues with the above formulation though. First, we need to know the exact distribution of the CSI error to compute the expectation. This may not be possible in practice. Second, even with the exact distribution, finding the expectation can be computationally costly. In order to capture the CSI error in a simple way, we work with a lower bound on the expectation on the right-hand side of (33).
Proof. The expectation can be rewritten as the difference of two expectations, i.e., where the second term can be upper bounded with Jensen's inequality as Next, we look at the first term on the right-hand side of (35). We consider a simple form of it where the random vector σG G G has the same distribution asH H H k ; A A A is some positive semi-definite matrix. To prove the claim, it is enough to show that f (σ) defined above is non-decreasing with σ. To that end, let us take the first derivative of f (σ).
≥ log e · E G G G 2Re{G G G H A A Aĥ h h} where (38) is obtained by putting the derivative inside the expectation; 6 (39) comes from the fact to obtain the lower bound (40) and to symmetrize the denominator with respect to G G G; the last equality is from the fact that G G G is symmetrically distributed and that the function inside the expectation is odd with respect to G G G. Indeed, for any odd function g(·), Note that the symmetry of the CSI error is a mild assumption that can be satisfied in practical situations. Therefore, to take into account the CSI inaccuracy for joint decoding, we propose to replace (14) with the following constraint Similarly, for successive decoding, we can replace (26) with the following constraint With the above constraints, we can still apply CCCP to solve the precoder design problems under imperfect CSIT, as presented in the Section III and in Section IV-A, since the presences of the CSI error terms in (42) and (43) do not change the convexity of the denominators. Essentially, here we try to optimize some lower bounds of the rate functions over a set of covariance matrices.
Then we can apply the precoders returned by the algorithms, and achieve at least as good as the lower bounds predict.

V. NUMERICAL RESULTS
In this section, we provide some numerical results to illustrate the performance of the proposed k w w w k , where w w w k ∈ C r k ×1 ∼ CN (0, I I I), Λ k is an r k ×r k diagonal matrix whose elements are the nonzero eigenvalues of R R R k , and U U U k ∈ C M ×r k is the tall unitary matrix formed by the corresponding eigenvectors.
We further consider the one-ring scattering model [35]. Then the correlation between the channel coefficients of antennas 1 ≤ m, p ≤ M is given by Suppose that K users are selected to form G groups based on the similarity of their channel covariance matrices. We let K g denote the number of users in group g, such that K = G g=1 K g . We make the same assumption as in [35] that users in the same group g share the same covariance matrix R R R g = U U U g Λ g U U U g . More precisely, the channel vector of user k in group g is given by The baseline schemes are the following: • The sum capacity C sum that can be derived with the MAC-BC duality as The above problem is convex and can be solved efficiently.
• The unicast scheme with the covariance matrices obtained by solving the following problem tr (Q Q Q k ) ≤ P.
• The 1-layer RS (the only common message sent by the BS is intended to all the users) with successive decoding [14], [15], [17]. We first verify the performance of the general RS scheme with joint decoding for a small number of users, i.e., K = 3. To that end, we assume G = 2 groups, and each user can be inside any of the clusters with equal probability. We set the azimuth angle of the g-th group as θ g = − π 3 + 15π 180 (g − 1) and assume the same AS for all groups as ∆ = 5π 180 . Algorithm 1 is applied to obtain a stationary point of the sum rate of the general RS with joint decoding. It is demonstrated in Fig. 3 that both the 1-layer RS and the general RS considerably increase the sum rate compared to unicast, especially at high SNR. In addition, with only 3 more streams in the general RS, we can improve the sum rate by more than 1 bit compared to the 1-layer RS at medium-high SNR. As the users in the same group have correlated channels, it is reasonable to decode the interference inside the group instead of treating it as noise. Since the users are spatially correlated in the one-ring scattering model, the performance of the ZF scheme degrades severely as demonstrated in Fig. 3. Therefore, in the following, we will not consider ZF. Now let us adopt Algorithm 4 to solve P SD SR with the proposed stream selection for the 4-user case. We set N SEA = 15. Thus the first step in Algorithm 3 actually preserves all the 2 K − 1 steams. Since eachS generated by Algorithm 3 contains 8 streams instead of the original 15 streams, the decoding complexity for each user is substantially reduced. Recall that for a given collectionS, the decoding order is fixed after the stream selection as explained in Section IV-B. We consider two scenarios with disjoint and overlapping eigen-subspaces, respectively, in Fig. 4. We set θ g = − π 3 + ∆ + 15π 180 (g − 1) and ∆ = 5π 180 for the former scenario, while θ g = − π 3 + 10π 180 (g − 1) and ∆ = 15π 180 for the latter. In the scenario with disjoint eigen-subspaces, inter-group interference is small and common messages across different groups are not needed, the 1-layer RS boils down to the unicast scheme. It is worth mentioning that in the simulation, the 1-layer RS slightly outperforms the unicast since we randomly put four users in two groups. Therefore, it is possible that all the users are in the same group so that they all benefit from the common message. The general RS after stream selection, which has three additional streams, further improves the sum rate. However, in such disjoint case, extra streams only brings slight intra-group rate increase. In contrast, in the overlapping case, the common message to all the users increases the sum rate by 2 bits compared to unicast. Apart from this, a large gain is further enabled by three additional inter-group common messages in the general RS.
In Fig. 5, we consider a larger number of users, i.e., K = 8. In this simulation, we assume that 8 users can be inside any of the two groups with equal probability. We set ∆ = 20π 180 and θ g = − π 3 + π 8 (g − 1), which corresponds to overlapping eigen-spaces between the two groups. In this case, the general RS scheme involves 255 streams, which is infeasible due to high complexity. Thanks to the stream selection algorithm, it is possible to investigate the performance of the general RS under such relatively large number of users. We set N SEA = 38, then each S generated by Algorithm 3 contains 30 streams. Furthermore, the number of constraints in successive decoding is reduced to 16, which is implementable in practice. Fig. 5 suggests that the 1-layer RS slightly outperforms the unicast, while the general RS further provides a large gain thanks to the use of multiple inter-group common messages. This observation confirms the effectiveness of our stream selection in identifying "good" streams. Variance of the CSIT estimation error i.i.d. circularly symmetric Gaussian with variances σ 2 and 1 − σ 2 , respectively. In Fig. 6, the sum rate with the proposed regularization is calculated as follows. We first optimize the sum rate based on the lower bound, i.e., replace the rate constraints in (26) by those in (43) in Algorithm 2, and achieve a group of feasible Q Q Q. Then, we substitute such returned Q Q Q into the real rate constraints in (26) to get a set of achievable rates, and thus the sum rate. To obtain the sum rate of the general RS without regularization, we solve P SIC SR withĤ H H directly replacing H H H, and then substitute the returned Q Q Q into (26) with the real channel matrix H H H to get the achievable rates.
The results show that the proposed regularization brings remarkable rate improvement especially when the CSIT error σ 2 is large, e.g., almost up to 200% gain when σ 2 = 0.9. In Fig.7, we compare the performance of the unicast and the general RS after stream selection. The sum rate of the unicast scheme with regularization is obtained by steps similar to those for the general RS with regularization. We can observe that the general RS always outperforms the unicast scheme even with a large estimation error.

VI. CONCLUSION
In this work, we have investigated the general RS scheme applied for multi-antenna downlink communications. We have proposed a full range of novel solutions including precoder design, stream selection, and imperfect CSIT regularization. We have run numerical simulations showing that our RS solutions, even under practical constraints, can provide substantial performance gains over existing schemes. It is worth noting that since RS is linear, the implementation cost of the proposed algorithms are comparable to those applied in practical systems. In summary, our study has demonstrated that the general RS is a viable way to mitigate interference in future multi-antenna networks.