Optimum Performance of Short Block Length Codes Under Multivariate Stationary Rayleigh Fading

The performance of short block length codes in the presence of multi-variate stationary Rayleigh fading coherent channels is studied. The channel model is inspired by multi-symbol OFDM transmissions for ultra-reliable low latency communication (URLLC) services, according to which information is sent in a small number of OFDM symbols to reduce latency. More specifically, the dispersion of the coherent fading channel is generalized from the scalar case, which is well known in the literature, to the multivariate one. The obtained expressions are then particularized to the Rayleigh fading statistics and expressed in closed form as a series expansion. Finally, a high SNR approximation of the ergodic capacity and channel dispersion is derived for this particular fading choice. Results show that the code performance depends on the time-frequency fading correlation through very simple functions of the channel correlation coefficients. In particular, it is shown that the channel dispersion converges to a constant plus the sum of the dilogarithms of the correlation coefficients of the channel process. These formulas provide a very useful tool to design physical layer system parameters from channel correlation information measured at the receiver.


G COMMUNICATION systems introduce two new
use cases, namely massive machine type communications (MMTC) and ultra-reliable low latency communications (URLLC), whose data traffic is served by short packets, given the little amount of information contained in each sporadic message (MMTC) or the extreme requirements in terms of latency (URLLC). Unfortunately, short-packet based communications are not well represented by classical information theory results, which are known to be significant only for asymptotically large code block lengths. The interest in these new type of communication strategies has motivated the introduction of coding theorems for not-so-large codewords, which can be seen as second-order refinements of conventional channel coding theorems.
In order to introduce this theory, consider an arbitrary noisy communication channel and denote by N * cw (N, ) the maximum cardinality (number of codewords) of a codebook with block length N that can be decoded with block error probability no greater than . Denote by R * (N, ) the associated coding rate, defined as in nats per channel access. The evaluation of R * (N, ) in closed form is extremely difficult in channels of interest, so that most of the insights can be gained when the block length converges to infinity, namely N → ∞. Recent information theory results [1] state that, for a large number of channels, this quantity accepts the following asymptotic expansion: where C denotes the channel capacity, Q(·) is the Gaussian Q-function 1 and V is a certain channel dependent quantity, usually referred to as channel dispersion. The above formula can be seen as a refinement of the conventional Shannon channel coding theorem, which states that the maximum achievable coding rate that guarantees an arbitrarily low for a sufficiently large block length N is given by the channel capacity C. According to this formula, the channel dispersion V establishes a penalty term of order 1/ √ N that separates the maximum achievable rate from the actual capacity. This type of expansion had been known for discrete memoryless channels since the 60s [2], [3], but it was only recently that equivalent expressions were proven for more interesting channel models [4].
One of the most relevant channels for which an expansion like the one in (1) has been established is the AWGN channel. If x ∈ C N denotes the transmitted codeword, the observation y ∈ C N can be expressed as where w ∼ CN 0, σ 2 I N is a noise vector (assumed circularly symmetric white Gaussian distributed) and where we impose the average power constraint x 2 ≤ N P . The capacity and dispersion for this channel are respectively given by [4] C AWGN (ρ) = c (ρ) V AWGN (ρ) = 1 − d 2 (ρ) where ρ = P/σ 2 is the signal to noise ratio and where c (ρ) = log (1 + ρ) and d (ρ) = dc (ρ) dρ = (1 + ρ) −1 . (2) The availability of this new type of short block length results facilitates the design of communication strategies that have traditionally been optimized for capacity maximization. As illustrated in [5], the approximation in (1) allows one to optimize system parameters (such as the duplexing time in TDD schemes, the size of the slot in random access procedures or the throughput of ARQ schemes) so as to guarantee a certain end-to-end code-agnostic system performance. In [6] the performance of different codes is compared in the URLLC regime, and it is concluded that modern codes such as LDPC and turbo codes show a certain performance gap with respect to the bound in (1). This gap can be considerably reduced by employing more classical code designs such as BCH at the expense of decoding complexity. More interesting to wireless communication problems are channel models subject to variability, that is fading. The small block-length achievable rates under different fading channel models have also been the subject of deep research during the past decade. In this general setting the observation y ∈ C N takes the more general form where x and w are as before, denotes element-wise multiplication and h ∈ C N is the channel response, assumed here to be random and known to the receiver (coherent detection). Depending on how the randomness of h is modeled, different channel fading models are obtained. For example, the quasi-static fading channel model assumes that h = h1 N where h is a random coefficient that is independently chosen for each codeword and kept constant for the whole transmission. It is well known that the Shannon capacity of this channel model is zero [7], but it was shown in [8] that one can still establish an asymptotic expansion of the maximum achievable rate as the one in (1) with C equal to the outage capacity and V = 0. In other words, the channel dispersion of the quasi-static fading channel model is identically zero, so that . This shows that the maximum achievable rate converges to the outage capacity at a much higher rate O(N −1 log N ) as compared to the conventional channel dispersion term of the AWGN or other fading channel models, which is of order O(1/ √ N ). The same fading channel model was analyzed in [9] in a MIMO setting assuming that the number of antennas at both sides of the communication link converged to infinity. On the other hand, [10] considers the problem of power control for this channel model under the assumption of an average power constraint on the whole codebook. A different channel model was studied in [11], where the channel h was assumed to be an N -length realization of a scalar stationary fading process. It was shown in [11] that, under several technical conditions on the fading process, the maximum achievable rate for this channel accepts an expansion as in (1), with C and V replaced with respectively, where E denotes expectation and h denotes a general entry of h. In the above expression, we have introduced the quantity x n which is usually referred to as the long term variance of the process x = (x n ) n∈N , with V denoting variance. Thus, according to this fading channel model, the maximum achievable rate converges to the ergodic capacity C ERG (ρ) as N → ∞, which only depends on the individual distribution of the entries of h and is therefore independent of the fading dynamics. The variability of the fading process plays a fundamental role in the channel dispersion, which clearly depends on the crosscorrelation coefficients of the channel fading process through the long term variance L((c(|h n | 2 ρ)) n∈N ).
A similar fading model, although simpler to analyze, is the coherent block-fading channel model. According to this model, the codeword is divided into L sub-blocks of length M , that is N = M L . The channel is assumed to be constant during the duration of each sub-block and independent and identically distributed across sub-blocks, so that in practice we can use (3) where h =h ⊗ 1 M , ⊗ representing the Kronecker product andh ∈ C L denoting a certain random vector of independent and identically distributed entries. In fact, assuming L → ∞, the result in [11] can be reformulated for the block fading model with the same capacity and dispersion, where the long term variance takes the simplified form [12] L c |h n | 2 ρ This was later generalized to different MIMO settings in [13]- [15]. This paper is concerned with OFDM transmissions supporting URLLC communications, where the codeword spans a sufficiently high bandwidth and therefore the channel in (3) may experiment strong variations. In this sense, it appears that the channel model in [11] is the most relevant for this type of transmission. In fact, we consider here a slightly more general version of this channel, the main difference with respect to [11] being that the codeword is allowed to span several OFDM symbols, which may experiment similar or completely uncorrelated channel frequency responses. Some of these results were previously presented in a conference version of this paper [16].
Before going into the technical details, it should be pointed out that there exist other powerful techniques to analyze the coding performance in the short block length regime that are not considered herein. In particular, saddlepoint techniques have recently been shown to provide more accurate asymptotic expansions than those established in (1), see e.g. [17]- [19]. Thus, a future research direction could target the refinement of the results presented here by using more sophisticated techniques such as those based on saddlepoint approximations.
The rest of the paper is organized as follows. Section II introduces the multivariate coherent stationary channel and justifies its relevance in current OFDM-based wireless standards. It also provides a coding theorem for this type of channel and particularizes this result to the case where the channel is Rayleigh distributed. Section II then analyzes the perfomance under Rayleigh fading at large values of the SNR, showing that in this regime the finite block length performance depends on some specific functions of the channel correlation coefficients. Finally, Section IV provides some numerical studies and Section V concludes the paper.

II. SYSTEM MODEL AND MAXIMUM ACHIEVABLE RATE
Let us consider the transmission of a codeword over a set of L OFDM symbols subject to time and frequency channel selectivity. We assume that channel state information is not available at the transmitter but perfectly known at the receiver (coherent reception). Assuming that M denotes the number of occupied subcarriers, the block length will be given by N = M L, and therefore we may represent the codeword as an M × L complex matrix X. The encoder output is split into L different parts of equal dimension M (see further Figure 1), which correspond to the columns of X, denoted by x l , l = 1, . . . , L. Each column vector x l is OFDM-modulated onto M subcarriers and transmitted through a frequency selective channel with frequency response given by the M -dimensional column vector h l . The received signal in the frequency domain for the lth OFDM symbol can be written as where w l denotes a noise vector. As before, we model w l as independent realizations of a random vector w l ∼ CN 0, σ 2 I M . Furthermore, we will assume that the power of the transmitted signal is limited for each OFDM symbol, that is Observe that this is fundamentally different from a conventional MIMO signal model, where the Hadamard product is replaced with a conventional matrix product (i.e. Y = HX + W).
In this paper, we will model the channel matrix H as bidimensional random process that has identically distributed entries but presents correlation in both dimensions. These correlations will model the practical dynamics of the fading across both frequency and time domains. Furthermore, in order to model practical URLLC transmissions, we will generally assume that the matrix H is tall, in the sense that M L. More specifically, we assume that H is a realization of an L-variate stationary process that evolves along the frequency domain. Indeed, leth (m) ∈ C 1×L , m = 1, . . . , M, denote the mth row of the channel matrix H, so that In this paper, we will model the sequence as a (discrete-index) stationary multivariate process with zero mean Note that the covariance does not depend on m 0 because of the stationarity of the process. We also assume that the channel distribution is absolutely continuous and that the fading coefficients are normalized so that the diagonal entries of R 0 are all equal to one, whereas the off-diagonal ones are strictly lower than one in magnitude. Let h m,l denote the (m, l)th entry of the channel matrix H. We denote by ρ h l,l (m) the ρ-mixing coefficients of the associated bidimensional process {h m,l } m,l , defined as follows [20]. Letting F m1,m2 l denote the sigmaalgebra generated by the random variables {h m1,l , . . . , h m2,l }, where the supremum is over all square-integrable functions f, g defined on these sigma-algebras, such that We assume that these ρ-mixing coefficients decay sufficiently fast when they are considered as functions of their lag m, namely for some positive K 1 , K 2 . The above condition is verified in practice by a wide range of channel models. In fact, it can be shown that the ρ-mixing coefficient decays exponentially when the channel has a rational spectrum density function [21]. Therefore, any channel obtained as a finite-order auto-regressive moving avarage (ARMA) process satisfies this condition.
We have now all the ingredients to characterize the maximum achievable rate of this channel. The following result is a generalization of the one in [11] to the case L ≥ 1.
Theorem 1: Under the above assumptions, there exists a code with codeword length M L and maximum probability of error with minimum rate is the associated channel dispersion and where L L (ρ) denotes the long term variance of the bidimensional capacity process, that is Reciprocally, any code of block length M L and maximum error probability of error such that 2 max 1≤i≤M |(x l ) i | ≤ √ P M θ for some 0 < θ < 1/12 has rate bounded by (8).
Proof: See Appendix A.

1) Particularization to the Rayleigh Case:
We can particularize the above expressions for the specific case where the fading is Rayleigh distributed, that is when the channel coefficients h m,l follow a normalized circularly symmetric Gaussian random law. Let us denote by E 1 (x) the exponential integral, defined as We will also let γ denote the Euler-Mascheroni constant, Γ (r, x) the upper incomplete gamma function and n F k the generalized hypergeometric function [22, 9.14]. For any sequence of L × L matrices (A n ) n∈Z , we define the scalarvalued operator This operator simply returns the normalized sum of all the entries of the matrix sequence, except for the diagonal of the matrix with index n = 0. Corollary 1: In addition to the assumptions in Theorem 1, assume that the channel entries h m,l are zero-mean circularly symmetric Gaussian distributed (Rayleigh fading). Define the absolute correlation coefficient as φ l,l (m) = (Rm) l,l . We can express the channel capacity as whereas the channel dispersion takes the form and where we have defined the L × L matrix V c (m) with (l, l )th entry equal to (14), shown at the bottom of the next page. Proof: Indeed, the capacity expression in (11) follows directly from partial integration. On the other hand, we can also trivially see that where we have used the fact that E 1 (x) = −x −1 exp (−x). Regarding the long term variance, we can use stationarity to write Observe that, under condition (7) and given the definition of the matrix V c (m) above, we have and this shows (12). The final result follows from the evaluation of the second moments of the capacity process c 2 ρ |h m,l | 2 , namely the evaluation of E c 2 ρ |h| 2 when |h| 2 is exponentially distributed with mean 1, and the evaluation of is a circularly symmetric complex Gaussian vector with zero mean and unit variances E |h 1 | 2 = E |h 2 | 2 = 1. Regarding the first, we can trivially use [23, eq. (15)] and directly obtain (13). For the cross-covariance computation, we can follow the approach [24] and use a series representation of the modified Bessel function of the first kind I 0 in the formula Inserting the Taylor expansion into the above integral, we obtain where Using integration by parts and [22, 3.381.3] in order to solve I k and inserting the result into (16), we obtain (14). The above expression can be evaluated numerically by taking a finite number of terms in the sum of (14). In practice, one can increase the number of terms of the series until the contribution of additional terms falls below a certain error. Obviously, the number of terms that need to be computed increases as the correlation coefficient φ becomes close to one. In any case, it appears difficult to draw any insight from the complicated form of the resulting variance/covariance expressions. Furthermore, in order to evaluate the long term variance, one needs the collection of all the correlation coefficients of the time-frequency grid, namely φ l,l (m) for 1 ≤ l, l ≤ L and m = 1, . . . , M, which should be measured at the receiver and fed back to the transmitter in order to make system design decisions. We will see in the next subsection that a much simpler expression (which depend on some functions of the coefficients φ l,l (m), rather than the whole collection) can be obtained by investigating the asymptotic behavior of these quantities for large signal to noise ratios (SNR).

III. HIGH SNR ANALYSIS
In this section, we analyze the behavior of the dispersion in (9) as the signal to noise ratio (SNR) converges to infinity, namely ρ → ∞. The asymptotic behavior of the capacity C (ρ) can be obtained by noticing that the exponential integral can be expanded as and therefore Regarding the channel dispersion V L (ρ), we can expand it as established in the following result.
Proposition 1: As ρ → ∞, the channel dispersion V L (ρ) accepts the expansion where we have defined the three coefficients where |R m | 2 is a matrix with the modulus squared entries of R m defined in (6), 1 L is an L-dimensional column vector with ones, Li 2 (x) is the dilogarithm function x k /k 2 and where matrix functions are applied entry-wise.
Proof: In order to analyze the asymptotic behavior of the channel dispersion, we recall the expression in (12). The last term can be directly analyzed by using the expansion in (17), so we only need to evaluate the asymptotic behavior of V[c(ρ |h| 2 )] and the matrix V c (m) when ρ → ∞.

Regarding the variance V[c(ρ |h|
2 )] term, using the fact that For the covariance matrix V c (m) the expansion follows directly from the following result, which is proven in Appendix B. Lemma 1: Assume that (h 1 , h 2 ) are jointly complex circularly symmetric Gaussian distributed with standardized marginals and correlation where Li 2 (x) is the dilogarithm function defined above.
Inserting all these expansions into the dispersion expression of (12), we obtain the result.
Interestingly enough, the above asymptotic approximation depends on the fading dynamics only through three specific parameters μ 1 , μ 2 and μ 3 that can be trivially computed from the average dilogarithm and complementary logarithm of the fading correlation coefficients in the time-frequency grid. These quantities can be easily measured at the receiver in the long run and fed back to the transmitter for system optimization. Since all these parameters depend on the channel correlation, they do not need to be updated very often. Now, if we denote by R (ρ) the actual code rate in a practical system, we may asymptotically approximate the error probability of the optimum decoder by disregarding the error terms in (9) as We can use the large SNR analysis above to study the asymptotic behavior of the argument of the Q function. Indeed, by using the expansion in Proposition 1 for V L (ρ) and the corresponding approximation of C (ρ) in (18) we obtain the high SNR approximation in (21), shown at the bottom of this page, which contains all the error terms up to the order O ρ −1 . It will be shown in the next section that the approximation in (21) provides very accurate evaluations of the original quantity for a relevant range of block error rates in typical URLLC requirements. This is because URLLC services typically require very high reliability (in the order of 1 − 10 −5 ), which in the absence of additional diversity (e.g. produced by multi-antenna transceivers) can only be achieved at significantly high SNR.

IV. NUMERICAL VALIDATION
The objective is to analyze the latency-reliability tradeoff of the transmission scheme that has been depicted in Figure 1. In particular, we study the possibility of reducing the BLER by transmitting parts of the codeword in OFDM symbols that are located far apart in the time domain, which therefore go through more uncorrelated channels. This leads to increased diversity, which is of course achieved at the expense of end-toend latency. Indeed, separating the codeword transmission into widely spaced OFDM symbols improves performance because of the higher channel diversity, but increases latency because of the longer transmission time. More specifically, we consider a transmission of an M L-dimensional codeword transmitted in L OFDM symbols, equispaced in the time domain but transmitted every T OFDM symbols (see further Figure 1). Therefore, the M L-dimensional codeword is transmitted in a total of (L − 1)T + 1 OFDM symbols. The channel fading where ρ f and ρ t represent the correlation coefficients in the time and frequency domains. In this numerical study, we fixed the correlation coefficient in the frequency domain to ρ f = 0.9, whereas the correlation coefficient in the time domain was fixed to ρ t = (0.9) T where here again T is the number of OFDM symbols between two consecutive transmissions as in Figure 1.
In order to investigate the performance prediction capabilities of the derived formulas, we considered first a simple scenario where the transmitted codeword occupied M = 128 subcarriers and L = 2 consecutive OFDM symbols (so that T = 1 in this particular example). Figure 2 represents the block error rate (BLER) performance of the LDPC code proposed for the 5G New Radio standard (graph 1) [25]. Both simulated BLER (solid lines) and performance bounds (semi-dotted lines) are shown in the figure, and results are given for AWGN and Rayleigh fading channels. Three different modulation and coding rates are considered, namely R = 1 bps/Hz with 4-QAM (blue lines), R = 2 bps/Hz with 16-QAM (red lines) and R = 3 bps/Hz with 64-QAM (green lines). These modulations were selected from the set of square QAM constellations as those that provided the lowest BLER in the SNR region of interest. It can be seen that in all cases the performance of the real codes is around 1 − 2 dB worse than the best performance as predicted by the Gaussian approximation, both for AGWN and fading channels. It can be concluded from these figures that the performance gap for the Rayleigh fading case is similar to the performance gap observed in other well studied channels, such as AWGN.
In order to investigate the tradeoff between transmission latency and coding performance, we considered next a situation where the total number of used subcarriers was fixed to M = 512 and the number of OFDM symbols was fixed to L = 3 (corresponding to a 1536-dimensional codeword). The transmission rate (in number of information bits per channel access) was fixed according to the signal to noise ratio as 90 of the total achievable capacity, that is R (ρ) = 0.9C(ρ). Performance was measured in terms of the approximate BLER bound as given by (20), together with its corresponding high SNR approximation in (21). Figure 3 represents the BLER as a function of the signal to noise ratio (SNR) for different values of the number T of OFDM symbols between consecutive transmissions. Apart from the performance for the exact C (ρ) and V L (ρ), we also represent the asymptotic approximation as given in (21) in dotted line. Observe that increasing the separation between consecutive OFDM symbols always increases performance, basically because the L different parts of the codeword go through a more uncorrelated channel and are therefore received with a higher degree of diversity. It should also be pointed out that, for conventional block error rates around 10 −5 , the high SNR approximations are very good approximations of the actual predicted BLER. This is important, since the high SNR approximation can be very easily evaluated, whereas the original one is much more complex and usually requires evaluating a high number of terms in the series of (14), especially when the correlation factor is high.
In Figure 4 we represent the BLER as a function of the number T of OFDM separation symbols 3 for different values of the SNR. Given a value of the SNR, this figure provides us with the minimum latency (in number T of OFDM separation symbols) that is required in order to achieve a certain reliability (BLER). It therefore illustrates the fundamental tradeoff between latency and reliability in a practical stationary coherent channel. For example, if the SNR is fixed to 25dB, the minimum number of OFDM separation symbols to achieve a reliability of 10 −5 is T = 2 (which correspond to a total latency of 5 OFDM symbols), whereas if the SNR drops to 20dB, we can only guarantee this performance by increasing the symbol separation to T = 8 (corresponding to 17 OFDM symbols). The high SNR approximation provides a very accurate prediction of the original bound, especially when the SNR is above 20dB.

V. CONCLUSION
The performance of short block-length codes in the presence of stationary multi-variate Rayleigh fading has been studied in terms of its capacity and asymptotic channel dispersion. The channel model has been inspired by multi-symbol OFDM transmissions for URLLC, where transmissions are typically concentrated in the frequency domain, occupying a large number of subcarriers in order to minimize latency, and span different OFDM symbols that can be separated in the time domain in order to provide diversity. A novel high SNR approximation of these quantities has been derived, and results provide a very simple design rule that depends on different functions of the fading correlation in the timefrequency axis. The usefulness of these results has been illustrated in the context of time-frequency transmit strategies for this type of channel, where it has been shown that the optimum transmission configuration can be chosen through a very simple feedback based on long-term fading statistics. As a final remark, we should point out that, even if the paper has focused on single antenna channels, the results presented here could be generalized to include other relevant transceiver architectures for URLLC services, such as multiantenna (MIMO) techniques. Furthermore, the assumption that channel state information is available at the receiver is quite restrictive in practical terms, since acquiring this information typically requires the insertion of pilot symbols that penalize the transmission rate. There are several recent contributions in the literature trying to characterize this for block fading channels [12], [26] and it would be quite interesting to generalize the analysis in this paper to the case of unknown CSI. All these aspects clearly deserve specific attention and are left for future research.

APPENDIX A PROOF OF THEOREM 1
The proof is an extension of the corresponding ones in [11] and [15]; therefore, only the main differences are highlighted here. We divide the exposition into two subsections, corresponding to the achievability and converse parts of the theorem respectively.
The development strongly relies on concepts from binary hypothesis tests that are summarized here. Given two probability distributions P, Q, consider the set of binary hypothesis tests designed in order to distinguish these two distributions. We will denote by β α (P, Q) the minimum achievable probability of false alarm under the null hypothesis (Q) when the probability of correct detection under the alternative one (P ) is higher than or equal to α. Here, P will be always fixed to be the joint distribution of output and channel, given the input X, which will be denoted as P = P Y H|X=X . Therefore, with some abuse of notation we can write β α (X, Q) = β α P Y H|X=X , Q . In some occasions, we may want to consider hypothesis tests between Q and a collection of alternatives P X indexed by X ∈ F, with F a given channel input subset. In this case, we define κ τ (F, Q) as the minimum achievable probability of false alarm under the null hypothesis (Q) when the infimum (over all X ∈ F) of the probability of correct detection under the alternative P X is higher than or equal to τ .

A. Achievability
The achievability is established according to the κβ bound [1, Theorem 25], which showed that, given any output probability distribution Q and any subset of the input space F, for each 0 < τ < < 1, there exists an (N cw , ) code with codewords chosen from F that satisfies As mentioned above, the alternative distribution is always fixed to P = P Y H|X=X , whereas Q is fixed in our case to the joint distribution of channel and observation when the inputs are standardized complex Gaussian distributed, denoted here by Q = Q Y H . In order to prove the result, we will first find an upper bound on β 1−+τ (X, Q Y H ) and then a lower bound on κ τ (F, Q Y H ). In this process, we will have to work with the Radon-Nikodym derivative of the distribution P Y H|X=X with respect to Q Y H , which is usually referred to as the information density and can be expressed as where and where we have denoted y m,l (respectively h m,l , x m,l and w m,l ) as the (m, l)th entry of the matrix Y (respectively H, X, W). In particular, we begin by investigating the first three absolute moments of i N (X) conditioned on the input X, which will be necessary in order to bound β 1−+τ (X, Q Y H ).

1) Moments of the Information Density:
Observe that the stationarity of the channel implies that, for any measurable function F , the expectation EF (|h m,l | 2 ) is independent of the frequency index m. Using this property, we can establish that the normalized expectation of the information density takes the form where h l denotes any of the entries of h l . We observe that, imposing the power constraint x l 2 = M P to the input, we obtain D N (X) = C (ρ) independently of the codeword X, where C (ρ) is the ergodic capacity as defined in (11).
Regarding the variance of i N (X), using again the stationarity of h l , we obtain 4 where |x| 2 = vec (X X * ), where Φ 1 is a block matrix with (l, l )th block, denoted as Φ l,l 1 , having entries Observe that, contrary to what happens with the expectation of i N (X), we are not able to eliminate the dependence of the variance on the codeword X even if we impose the constraint x l 2 = M P . Indeed, forcing x l 2 = M P only nulls out the third term of the above expression, but not necessarily the fourth and fifth ones. This is a clear difference with respect to what happens in the AWGN channel, for which these last two terms are identically zero. 4 Recall the definitions of c( ·) and d(·) in (2).
Finally, at some point of the development we will need to control the behavior of the Berry-Esseen ratio, defined as where i m (x (m) ) is defined in (23) and where we recall that x (m) denotes the mth row of X. Let us denote and observe that we can bound, by Jensen's inequality and the fact that |Re x| ≤ |x|, These three quantities are trivially upper bounded by quantities independent of m. Therefore, we can generally write for three positive constants K 1 , K 2 , K 3 independent of m, l.
2) Upper Bound on β 1−+τ (X, Q Y H ): The upper bounds in [1,Theorem 25] can be modified to our setting by simply noting that the conventional Berry-Esseen inequality is not applicable here because the different terms in the sum of (22) are not independent. However, we can still apply some variants of the Berry-Esseen theorem that are still valid under the strong mixing conditions imposed here, see e.g. [27], [28]. In particular, we can use [29,Theorem 10], which is valid under exponentially decaying ρ-mixing coefficients and provides a relationship with the Berry-Esseen ratio B N (X) defined in (25). More specifically, a trivial modification of [1, Lemma 58] leads to 5 where we assume here that the argument of Q −1 (·) is in the interval [0, 1]. Now, we choose F as where V N (X) and V L (ρ) are defined in (24) and (9) respectively, where δ 1 , δ 2 , θ, C are positive constants with θ < 1/12. We claim that Indeed, observe that if x l 2 = M P we can lower bound Therefore, from (26) we can see that 3) Lower Bound on κ τ (F, Q y ): We can use [11,Lemma 4] to state that, for any input distribution P X we can write In particular, we can select P X to be the L-fold product of the uniform distribution on the sphere S M−1 . Using the procedure in [11, eq. (57)-(58)] we find that Lemma 2: Let F be defined as above and assume that P X is the L-fold uniform distribution on the sphere S M−1 . When M → ∞, we have P X (F) → 1.
Proof: We only need to see that as M → ∞. Regarding the convergence of V N (X), observe that (7) provides a sufficient condition for the spectral norm of Φ 1 , Φ 2 to be uniformly bounded in M . Indeed, observe that Φ 1 is a block-Toeplitz matrix, and the (l, l ) block has spectral norm bounded by where (a) follows from [30,Lemma 4.1], (b) from the triangular inequality and (d) from the ρ-mixing assumption in (7). The same argument shows that Φ 2 < +∞. Now, using the fact that for δ > 0 we see from the definition of V N (X) in (24) that it is enough to show that, for any δ > 0, we have as well as as M → ∞. Now, let G denote an M × L matrix with i.i.d. standardized complex Gaussian entries, with columns denoted as g 1 , . . . , g L and entries g ml . Clearly, x l has the same distribution as g l g1 √ P M. To see (30), we define ξ l = g l 2 /M, r l = g l g * l and use the fact that this probability can be upper bounded by (29), shown as in (32) at the bottom of the next page, where Φ l,l 1 is the M × M matrix corresponding to the (l, l )th block of Φ 1 .
Let us first deal with the terms in the above sum such that l = l . Using the Markov inequality together with the fact that E (r l /ξ l ) = 1 M and the identity [31] for details), we obtain, when l = l , which is clearly of order O(M −1 ). Regarding the terms for Thus, the Markov inequality shows that the first term in (33) converges to zero in probability, concluding the proof of (30).
The proof of (31) follows the same reasoning and is therefore omitted. Together, (30) and (31) imply (27). In order to prove (28), observe that for any δ > 0 we have [15, Proof of Lemma 18] for any k ∈ N. Noting that E |g m,l | 2k = k! and choosing k > 1/(2θ) > 6 we see that the right hand side converges to 1 as M → ∞. 4) Finalizing the Proof of Achievability: Combining the two bounds above, we have just proven that, for each 0 < τ < < 1 and δ 1 > 0, there exists an (N cw , ) code such that Therefore, by letting M → ∞ we see that and since τ, δ 1 can be made arbitrarily low, the proof is complete.

B. Converse
The converse can be established along the lines of [15,Theorem 19]. Consider a specific (N cw , ) code and let C denote the corresponding codebook. We can assume that the codewords are such that x l 2 = M P (otherwise, we can simply add one row to the codeword matrix X so that this constraint is satisfied without affecting the asymptotics of the maximum achievable rate, up to the order that we are considering here [1, Lemma 39]). The maximum number of codewords is thereby bounded by using the meta-converse theorem [1,Theorem 31], that establishes that such an (N cw , ) code must satisfy for any distribution Q on the channel output space. As in the achievability part, we can choose Q as the joint distribution of channel and observation when the inputs are standardized complex Gaussian distributed, namely Q = Q Y H according to the notation above. Using the inequality equivalent to [4, (2.87)] after replacing the Berry-Esseen theorem by [29, Theorem 10], we obtain (for any Δ > 0) where we assume that the argument of Q −1 belongs to (0, 1). It has been shown above that if we have max 1≤i≤M |(x l ) i | ≤ √ P M θ for some 0 < θ < 1/12 we can guarantee that log 2 LM √ LM B N (X) ≤ C log 2 M M 1/2−6θ for some C > 0 and M sufficiently large. Hence, where we have chosen Δ = (LM ) 1/4 . Now, the problem with the above bound is the fact that we have no control on inf X∈C V N (X) and there might be some codewords in C for which the channel dispersion is abnormally low. This is technically solved in [15,Theorem 19], by dividing the codebook into two disjoint sets, namely C = C u ∪ C l where C u = {X ∈C : V N (X) > V L (ρ) − δ} for some δ > 0, and where C l contains the rest of the codewords. The above bound is obviously useful to bound log |C u |, whereas for C l one can follow the strategy in [15,Theorem 19].

APPENDIX B PROOF OF LEMMA 1
The main idea behind the proof is to deal with the different terms of the series in (14) when ρ → ∞. In order to simplify the expressions below, define α = (1 − φ 2 ) −1 and observe that we can express (from Corollary 1) The key observation lies on the asymptotic behavior of the incomplete gamma function as ρ → ∞. Indeed, the terms with Γ(0, α/ρ) = E 1 (α/ρ) will scale up as log ρ -see (17)whereas for k > 1, Γ(k, α/ρ) will converge to a constant Γ (k). By separating Γ(0, α/ρ) from the rest of Γ(k, α/ρ) with k > 0, we can write the cross-covariance as the sum of three where C (ρ) is defined in (11) and where we have introduced the quantities We can analyze the asymptotic behavior of these three terms separately, up to terms O ρ −1 . 1) Analysis of ϑ 1 : Observe that we can write where we have defined the two error terms Next, observe that we can bound these two error terms as Therefore, using the asymptotic approximation of E 1 1 ρ in (17) we can conclude that ϑ 1 (ρ) = 1 + 2 ρ log 2 α − 2 α log α ρ 2) Analysis of ϑ 2 : We first introduce a result that will be useful in order to analyze this term.
Lemma 3: With the above definitions, one can write is the lower incomplete gamma function. Clearly, On the other hand, using the definition in (34) and the integral in [22, 2.484 and the result follows.
In this case, we are able to write where we have defined the error terms Obviously, ξ 21 (ρ) = O ρ −2 . Regarding ξ 22 (ρ), we use the loose bound Thus, using Lemma 3 we find that On the other hand, following a similar approach and therefore both are of order O ρ −2 . A direct use of Lemma 3 and the expansion in (17) show that Finally, inserting this into the expression of ϑ 2 (ρ) we can conclude that 3) Analysis of ϑ 3 : We note first that we can express where we have the two error terms