On the minimum mean p-th error in Gaussian noise channels and its applications

The problem of estimating an arbitrary random variable from its observation corrupted by additive white Gaussian noise, where the cost function is taken to be the minimum mean p-th error (MMPE), is considered. The classical minimum mean square error (MMSE) is a special case of the MMPE. Several bounds and properties of the MMPE are derived and discussed. As applications of the new MMPE bounds, this paper presents: (a) a new upper bound for the MMSE that complements the `single-crossing point property' for all SNR values below a certain value at which the MMSE is known, (b) an improved characterization of the phase-transition phenomenon which manifests, in the limit as the length of the capacity achieving code goes to infinity, as a discontinuity of the MMSE, and (c) new bounds on the second derivative of mutual information, or the first derivative of MMSE, that tighten previously known bounds.

MMPE, a unifying proof (i.e., for any p) of the SCPP is shown. A complementary bound to the SCPP is then shown, which bounds the MMPE for all SNR values below a certain value, at which the MMPE is known.
As a first application of the MMPE, a bound on the conditional differential entropy in terms of the MMPE is provided, which then yields a generalization of the Ozarow-Wyner lower bound on the mutual information achieved by a discrete input on a Gaussian noise channel.
As a second application, the MMPE is shown to improve on previous characterizations of the phase transition phenomenon that manifests, in the limit as the length of the capacity achieving code goes to infinity, as a discontinuity of the MMSE as a function of SNR.
As a final application, the MMPE is used to show new bounds on the second derivative of mutual information, or the first derivative of the MMSE, that tighten previously known bounds important in characterizing the bandwidth-power trade-off in the wideband regime.

I. INTRODUCTION
In the Bayesian setting the Minimum Mean Square Error (MMSE) of estimating a random variable X from an observation Y is understood as a cost function 1 with a quadratic loss function (i.e., L 2 norm): Another commonly used cost function is the L 1 norm with loss function given by the absolute value of the error (i.e., the difference between the variable of interest and its estimate). In general, cost functions with non-quadratic loss functions are not well understood and have been considered only for special cases, such as under the assumption of Gaussian statistics.
The interplay between estimation theoretic and information theoretic measures has been very fruitful; for example the so called I-MMSE relationship [1], that relates the derivative of the mutual information with respect to the Signal-to-Noise-Ratio (SNR) to the MMSE, has found numerous applications through out information theory [2]. The goal of this work is to show that the study of estimation problems with non-quadratic loss functions can also offer new insights into classical information theoretic problems. The program of this paper is thus to develop the 1 Another common term used is a risk function.
necessary theory for a class of loss functions, and then apply the developed tools to information theoretic problems.

A. Past Work
The popularity of the MMSE stems from its analytical tractability, which is rooted in the fact that the MMSE is defined through the L 2 norm in (1b). The L 2 norm, in turn, allows applications of the well understood Hilbert space theory [3]. In information theoretic applications the L 2 norm is used, for example, to define an average input power constraint. The connection between the power constraint and the L 2 norm leads to a continuous analog of Fano's inequality that relates the conditional differential entropy and the MMSE [4,Theorem 8.6.6].
Recently, in view of the I-MMSE relationship [1], the MMSE (in an Additive White Gaussian Noise (AWGN) channel) has received considerable attention. For example, in [5] the I-MMSE relationship was used to give a simple alternative proof of the Entropy Power Inequality (EPI) [6]. Moreover, the so called 'Single-Crossing-Point Property' (SCPP) [7], [8] that bounds the MMSE for all SNR values above a certain value at which the MMSE is known, together with the I-MMSE relationship, offers an alternative, unifying framework for deriving information theoretic converses, such as: [7] to provide an alternative proof of the converse for the Gaussian broadcast channel (BC) and show a special case of the EPI; in [9] to provide a simple proof for the information combining problem and a converse for the BC with confidential messages; in [8], by using various extensions of the SCPP, to prove a special case of the vector EPI, a converse for the capacity region of the parallel degraded BC under per-antenna power constraints and under an input covariance constraint, and a converse for the compound parallel degraded BC under an input covariance constraint; and in [10] to provide a converse for communication under an MMSE disturbance constraint.
In [11] we demonstrated a bound that complements the SCPP, that bounds the MMPE for all SNR values below a certain value at which the MMSE is known, and allows for a finer characterization of the phase transition phenomenon that manifests as a discontinuity of the MMSE as a function of SNR, as the length of the codeword goes to infinity. This plays an important role in characterizing achievable rates of the capacity achieving codes [12] and [13].
One of the applications of the tools presented in this work is an improvement on the bound in [11,Theorem 1]. July 7, 2016 DRAFT Many other properties of the MMSE in relation to the I-MMSE have been studied in [7], [14], [15], [16]. For a comprehensive survey on results, applications and extensions of the I-MMSE relationship we refer the reader to [9].
While the MMSE has received considerable attention and is well understood, non-quadratic cost functions are only understood in special cases, such as under the assumption of Gaussian statistics. For example, in [17] it was shown that under scalar Gaussian statistics, for a large class of symmetric loss functions the optimal linear MMSE (LMMSE) estimator is also optimal. The result of [17] was extended in [18] to a large class of cost functions that also include asymmetric loss functions. Other early work in this direction includes also [19].
In [20], the authors studied the expected L ∞ norm of the error, when the input is assumed to be a Gaussian mixture. The authors showed that, as the dimension of the signal goes to infinity, the optimal LMMSE estimator minimizes the expected maximum error.
In [21] and [22] the authors studied a class of even and nondecreasing and even and convex loss functions and gave a sufficient condition on the conditional distribution of the input X given the output Y , so that the conditional expectation E[X|Y ] is the optimal estimator.
In [23], the authors studied a scalar additive noise channel and an L p cost function and showed a necessary and sufficient condition on the noise and the input distributions to guarantee that the optimal estimator is linear. Moreover, if the source and noise variances are the same, then the optimal estimator is linear if and only if input and the noise distributions are identical.
In [24] and [25] the authors considered the problem of transmitting a modulated signal over a discrete memoryless channel where the performance criterion was taken to be the L p cost function. To that end, the authors showed tight exponential bounds for very small and very large values of p.
In [26] the authors focused on designing an appropriate cost function such that the output of the trained model approximates the desired summary statistics, such as the conditional expectation, the geometric mean or the variance.
In non-Bayesian estimation L p cost functions have been considered in [27] and [28], in a context of minimax estimation, and the authors gave lower and upper bounds on the exponential behavior of the cost function. For a non-Bayesian treatment of non-quadratic cost functions we refer the reader to [29].
Looking into non-quadratic cost functions is further motivated by the fact that often the July 7, 2016 DRAFT quadratic cost function may not be the correct measure of signal fidelity for certain applications.
This is especially true in image processing where error metrics, more sensitive to structural changes of the input signal, better capture human perceptions of quality. We refer the reader to [30] for a survey of recent results in this direction.

B. Paper Outline and Main Contributions
In this work we are interested in studying a cost function, termed the Minimum Mean p-th Error (MMPE) 2 , the scalar version of which is given by where the infimum is over all estimators f (Y ).
Our contributions are as follows: 1) In Section II we formally define the vector version of the MMPE in (2) and introduce related definitions.
2) In Section III we study properties of the optimal MMPE estimator and show: • In Section III-A, Proposition 1 shows that the MPPE optimal estimator indeed exists; • In Section III-B, Proposition 2 derives an orthogonality-like principle that serves as a necessary and sufficient condition for an estimator to be MMPE optimal; • Section III-C gives examples of optimal MMPE estimators. In particular, in Proposition 3 we find the MMPE for Gaussian random vectors, and in Proposition 4 for discrete binary random variables; and • In Section III-D, Proposition 5 shows some basic properties of the optimal MMPE estimator in terms of input distribution, such as, linearity, stability, degradedness, etc.
Moreover, via an example it is shown that in general the MMPE optimal estimator is biased on average (i.e., the first moment of the error (bias) is not zero). However, it is shown that the p-th order estimator is unbiased on average in sense that the p − 1-th moment of the error is zero. 2 The abbreviation MMPE has been used before in [9,Chapter 8] for the Minimum Mean Poisson Error.
3) In Section IV we study properties of the MMPE as a function of order p, SNR and the input distribution that will be useful in a number of applications: • In Section IV-A, Proposition 6 shows that the MMPE is invariant under translations of the input random vector and derives basic scaling properties; • In Section IV-B, Proposition 7 shows that, as far as estimation error over the channel Y = √ snrX+Z is concerned the estimation of the input X is equivalent to the estimation of the noise Z; and • In Section IV-C, Proposition 8 gives a 'change of measure' result that allows one to take the expectation in the definition of the MMPE with respect to an output at a different SNR.

4)
In Section V we discuss basic bounds on the MMPE and show: • In Section V-A, Proposition 10 develops basic ordering bounds between MMPE's of different orders and bounds equivalent to that of the LMMSE bound; • In Section V-B, Proposition 11 shows that, under an appropriate moment constraint on the input distribution, the Gaussian input is asymptotically the 'hardest' to estimate; • In Section V-C, Proposition 12 derives interpolation bounds for the MMPE. • In Section VIII-C, Theorem 4 improves on the previous characterizations of the width the phase transition region of finite-length code of length n given by O( 1 n ) in [11] to O( 1 √ n ). This in turn also improves the converse result on the communications under disturbance constrained problem studied in [11]; and • In Section VIII-D, Proposition 21 we show how the MMPE can be used to provide new lower and upper bounds on the derivative of the MMSE.

C. Notation
Throughout the paper we adopt the following notational conventions: • Deterministic scalar and vector quantities are denoted by lower case and bold lower case letters, respectively. Matrices are denoted by bold upper case letters; • Random variables and vectors are denoted by upper case and bold upper case letters, respectively, where r.v. is short for either random variable or random vector, which should be clear from the context; • The symbol | · | may denote different things: |A| is the determinant of the matrix A, |A| is the cardinality of the set A, |X| is the cardinality of supp(X) , or |x| is the absolute value of the real-valued x; • E[·] denotes the expectation operator; • We denote the covariance of r.v. X by K X ; • X ∼ N (m, K X ) denotes the density of a real-valued Gaussian r.v. X with mean vector m and covariance matrix K X ; • The identity matrix is denoted by I; • Reflection of the matrix A along its main diagonal, or the transpose operation, is denoted by A T ; • The trace operation on the matrix A is denoted by Tr(A); • The Order notation A B implies that A − B is a positive semidefinite matrix; • log(·) denotes logarithms in base 2; • [n 1 : n 2 ] is the set of integers from n 1 to n 2 ≥ n 1 ; • For x ∈ R we let x denote the largest integer not greater than x; • For x ∈ R we let [x] + := max(x, 0) and log + (x) := [log(x)] + ; • Let f (x), g(x) be two real-valued functions. We use the Landau notation f (x) = O(g(x)) to mean that for some c > 0 there exists an x 0 such that f (x) ≤ c g(x) for all x ≥ x 0 , and f (x) = o(g(x)) to mean that for every c > 0 there exists an x 0 such that f (x) < cg(x) for all x ≥ x 0 ; • We denote the conditional r.v. X|Y = y ∼ p X|Y (·|y) as X y ; • We denote the upper incomplete gamma function and the gamma function by The generalized Q-function is denoted bȳ In particular, the generalized Q-function can be related to the standard Q-function, by using the relationship Q( √ 2x) = 1 2 √ π Γ 1 2 ; x 2 and Γ 1 2 = √ π, asQ 1 2 ; a 2 = 2Q( √ 2a); and • We define the volume of the region S embedded in R n as In particular, the volume of the n-dimensional ball B(r) of radius r center at origin is given by Vol(B(r)) = π n 2 r n Γ n 2 + 1 .

II. COST FUNCTION DEFINITION
Motivated by the study of cost functions with non-quadratic error we define the following norm.
Definition 1. For the r.v. U ∈ R n and p > 0 For p ≥ 1 the function in (5) defines a norm and obeys the triangle inequality as shown in Appendix A. Therefore, throughout the paper we define the L p space, for p ≥ 1, as the space of r.v. on a fixed probability space (Ω, σ(Ω), P) such that the norm in (5) is finite.
However, many of our results will hold for 0 ≤ p < 1, for which (5) is not a norm.
In particular, for Z ∼ N (0, I) the norm in (5) is given by and for V uniform over the n dimensional ball of radius r the norm in (5) is given by Note that for n = 1 we have that U p p = E [|U | p ] and therefore from now on we will refer to U p p as p-th moment of U. Naturally, for n > 1, there are many other ways for defining the moments, see for example [32]. However, in view of the information theoretic problems we are interested in, such for example from previous work [11], the definition in (5) arises naturally. Definition 2. We define the minimum mean p-th error (MMPE) of estimating X from Y as and where the minimization is over all possible Borel measurable functions f (Y). Whenever the optimal MMPE estimator exists, we shall denote it by f p (X|Y). 3 We shall denote mmpe(X|Y; p) = mmpe(X, snr, p), if Y and X are related as where Z, X, Y ∈ R n , Z ∼ N (0, I) is independent of X, and snr ≥ 0 is the SNR. When it will be necessary to emphasize the SNR at the output Y, we will denote it with Y snr . Since the distribution of the noise is fixed mmpe(X|Y; p) is completely determined by distribution of X and snr and there is no ambiguity in using the notation mmpe(X, snr, p). Applications to the Gaussian noise channel will be the main focus of this paper.
For p = 2, the MMPE reduces to the MMSE, that is, mmpe(X|Y; 2) = mmse(X|Y) and . Note that there are other ways of defining the loss function in (9); our definition in (9) is motivated by: • For X ∈ R 1 the error in (9) reduces to a natural expression with loss function given by • The definition in (9) naturally appears in applications of Hölder's or Jensen's inequalities to mmse(X|Y ); and • The norm in (5) used in the definition of (9) can be related to information theoretic quantities, such as differential entropy and Reyni entropy, via the vector moment entropy inequality from [34].
We shall also look at the p-th error achieved by the suboptimal (unless p = 2) estimator , that is, which represents higher order moments of the MMSE loss function and serves (see below) as an upper bound on (9).

A. Existence of Optimal Estimator
It is important to point out that X − E[X|Y] p in general is not equal to the MMPE, as might not be the optimal estimator under the p-th norm. The first result of this section shows that for AWGN channel the optimal estimator f p (X|Y = y) indeed exists.
Proposition 1. For mmpe(X, snr, p), p > 0, snr > 0 the optimal estimator is given by the following point-wise relationship (13). Therefore, we have the following chain of inequalities This concludes the proof.

B. Orthogonality-like Property
The MMPE for p = 2 differs from MMSE in a number of aspects. The main difference is that the norm defined in (5) is not a Hilbert space norm in general (unless p = 2); as a result, there is no notion of inner product or orthogonality, and f p (X|Y), unlike E[X|Y], can no longer be thought of as an orthogonal projection. Therefore, the orthogonality principle-an important tool in the analysis of the MMSE-is no longer available when studying the MMPE for p = 2.
However, an orthogonality-like property can indeed be shown for the MMPE.
for any deterministic function g : R n → R n , that is, where W = X − f p (X|Y). Moreover, for 0 ≤ p < 1 the condition in (16a) is necessary for optimality.
Proof: See Appendix C.
Note that Proposition 2 for n = 1 and p ∈ R + reduces to which for p 2 ∈ N further reduces to Moreover, for p = 2 Proposition 2 reduces to the familiar orthogonality principle arriving at a contradiction. This implies that E[X|Y] is the unique estimator up to a set of measure zero.
In [23,Lemma 1], by replicating the above argument and by assuming that p 2 ∈ N and n = 1, it was shown that the optimal MMPE estimator is unique. However, since the proof relies heavily on the assumption that p 2 ∈ N and n = 1, this argument cannot be extended in a straightforward way to p ∈ R + or n > 1.
However, uniqueness of the MMPE optimal estimator can be shown for p > 1 (i.e., strictly convex loss functions) by using Proposition 1 in conjunction with [29, Corollary 4.1.4].

C. Examples of Optimal MMPE Estimators
In general we do not have a closed form solution for the MMPE optimal estimator in (13).
Interestingly, the optimal estimator for Gaussian inputs can be found and is linear for all p ≥ 1.
Note that similar results have been demonstrated in [17] and [23] for scalar Gaussian inputs.
Next we extend this result to vector inputs and give two alternative proofs of the linearity of the optimal MMPE estimator for Gaussian inputs, via Proposition 1 and via Proposition 2.
with optimal estimator given by Proof: The proof follows by observing that 1+snr has a Gaussian distribution and is independent of Y. So, for any two functions f (·) and g(·) we have Therefore, by using (22) for estimator f p (X G |Y = y) = √ snr 1+snr y the necessary and sufficient conditions in Proposition 2 hold and thus the linear estimator must be an optimal one. Finally observe that where we have usedẐ = X G − √ snr 1+snr Y ∼ N 0, 1 1+snr I . For a proof that uses only Proposition 1 see Appendix D.
The optimal MMPE estimator is in general a function of p as shown next.
we have that . (24a) In particular, for p = 1, we have that Proof: See Appendix E.

Proposition 4 will be useful in demonstrating several examples and counter examples in the
following sections. Note that for the practically relevant case of BPSK modulation, or x 1 = −x 2 = 1, q = 1 2 the optimal estimator in (24a) reduces to which for p = 1 is the hard decision decoder By Proposition 4 we can show that the orthogonality principle only holds for p = 2 (when MMPE corresponds to MMSE) as shown in Fig. 1a, where we plot h(p) vs. p for BPSK input and observe it is zero only for p = 2.   Proposition 5. For any p > 0 the optimal MMPE estimator has the following properties: for any deterministic function g(·), Proof: See Appendix F. . This comes as no surprise as it is very common in Bayesian estimation that the optimal estimator is biased [35].
However, the optimal MMPE estimator is unbiased in the sense that the (p − 1)-th moment of the bias is zero. This can be seen from the orthogonality like property in Proposition 2 by taking g(Y) to be the vector of all one's E Err

IV. PROPERTIES OF THE MMPE
In this section we explore properties of the MMPE as a function of SNR and of the input distribution.

A. Basic Properties
The next two properties of the MMPE directly follow from the properties of f p (X|Y) in Proposition 5.
Proposition 6 implies that the MMPE, like the MMSE, is invariant under translations, and that scaling the input results in scaling the SNR and the error.

B. Estimation of the Input is Equivalent to Estimation of the Noise
The following lemma is commonly applied in the analysis of the MMSE.
Lemma 1 states that estimating the noise is equivalent to estimating the input signal if one uses the conditional expectation as an estimator.
Next we show that an equivalent statement holds for the MMPE.

Proposition 7.
For any X, p > 0 and snr > 0, we have Proof: From the definition of the MMPE in (9) This shows the equality in (30a). Moreover, since f p (X|Y) exists and the infimum in (31) is attainable by Proposition 1, so is the infimum in (32). Therefore, from (32) we have that f p (Z|Y) exists and is given by which leads to (30b). This concludes the proof.

C. Change of Measure
The next result enables us to change the expectation from Y snr to Y snr 0 in (9) whenever snr ≤ snr 0 . This is particularly useful when we know the MMPE, or the structure of the optimal MMPE estimator, at one SNR value but not at another smaller SNR value.
One must be careful when evaluating Proposition 8. For example, since we have that at first glance it appears that the expectation on the right of (35) is zero while mmpe(X, 0, p) is not, thus violating the equality. However, a more careful examination shows that when snr → 0 the limit and expectation in (35) cannot be exchanged; indeed we have that where in the last equality we used the moment generating function of the Cauchy r.v. Z 2 . As an example, Proposition 8 for X ∼ N (0, 1) with the optimal linear estimator from Proposition 3, i.e., f (y) = ay for some a, evaluates to where the equalities follow from: a) linearity of expectation and the fact that Z and X are independent; and b) since E e

V. BOUNDS ON THE MMPE
In this section we develop bounds on the MMPE, many of which generalize well known MMSE bounds. However, we also show bounds that are unique to the MMPE and emphasize the usefulness of the MMPE.

A. Extension of Basic MMSE Bounds
An important upper bound on the MMSE often used in practice is the LMMSE.
Proposition 10. For snr ≥ 0, 0 < q ≤ p, and input X and where where Z p p is given in (7).
It is interesting to point out that in the derivation of the bounds in Proposition 10 no assumption is put on the distribution of Z, and thus the bounds hold in great generality. If Z is composed of independent identically distributed (i.i.d.) Gaussian elements, then the moment Z p p in Proposition 10 can be tightly approximated in terms of factorials as which is tight for even n and integer p 2 . It is not difficult to check that for p = 2 Proposition 10 reduces to Proposition 9. The reason that the bounds on X − E[X|Y] p are only available for p ≥ 2, while the bounds on mmpe(X, snr, p) are available for p ≥ 0, is because the proof of the bound in (37b) uses Jensen's inequality, which requires p ≥ 2, while the proof of the bound in (37d) does not.

B. Gaussian Inputs are the Hardest to Estimate
Note that the bounds in Proposition 10 are similar to the bound in (36a) and blow up at snr = 0 + . Therefore, it is desirable to have bounds as in (36b). The next result demonstrates such a bound and shows that Gaussian inputs are asymptotically the hardest to estimate.
Proof: See Appendix I.

C. Interpolation Bounds and Continuity
One of the key advantages of using the MMPE is that the MMPE of order q can be tightly predicted based on the knowledge of the MMPE at lower orders p and higher orders r. At the heart of this analysis is the interpolation result of L p spaces [36]: given 0 < p ≤ q ≤ r and α ∈ (0, 1) such that 1 q = α p + 1−α r , the q-th norm can be bounded as which implies that the norm is log-convex and thus a continuous function of p [37, Theorem 5.1.1]. Next, we present several interpolation results for the MMPE.
Finally, the bounds in (41e) and (41f) follow by choosing f (Y) in (41d) equal to f r (X|Y) and f p (X|Y) respectively. This concludes the proof.
From log-convexity we can deduce continuity.
where the last inequality is due to the continuity of the norm.
An interesting question is whether the following interpolation inequality holds: instead of (41e) and (41f). A counter example to the interpolation inequality in (43) is shown in Fig. 2 where we take a binary input X ∈ {±1} equality likely, p = 2, r = 8, and snr = 1 and show: This shows that (43) is not true in general.

D. Bounds on Discrete Inputs
So far, by using Proposition 10, we have shown that the MMPE as a function of snr decreases as O 2 . Next we show that the MMPE can decrease exponentially in snr. Such a behavior has been already observed for the MMSE in [38] and [15]. Clearly the conjectured bound is below the true MMPE thus (43) cannot be true. where Proof: See Appendix J.
The exponential behavior of the MMPE of discrete inputs can be clearly seen for the case n = 1 as follows. By usingQ 1 2 ; a 2 = 2Q( where the last inequality follows from the Chernoff's bound Q(x) ≤ e − x 2 2 .
Having developed bounds on the MMPE of discrete inputs, we are now in the position to demonstrate a phase transition phenomenon, that is, we show that as n → ∞ the MMPE becomes a discontinuous function of the SNR.
Proof: For X D ∈ {±1} we have that With the following well known limits [39]: and, in light of the limit in (48), we have that the bound in (46) holds. This concludes the proof.

VI. CONDITIONAL MMPE
We define the conditional MMPE as follows.
Definition 3. For any X and U, the conditional MMPE of X given U is defined as The conditional MMPE in (50) reflects the fact that the optimal estimator has been given additional information in the form of U. Note that when Z is independent of (X, U) we can write the conditional MMPE for X u ∼ P X|U (·|u) as mmpe(X, snr, p|U) = mmpe(X u , snr, p) dP U (u).
Since giving extra information does not increase the estimation error, we have the following result.
where (X, Z, Z ∆ ) are mutually independent. Then mmpe(X, snr 0 , p|U) = mmpe(X, snr 0 + ∆, p). Proof: where Z ∆ and Z are independent, by using maximal ratio combining, we have that where W ∼ N (0, I). Next by using the same argument as in [9, Proposition 3.4], we have that the conditional probabilities are for y snr = √ ∆ √ snr 0 +∆ y ∆ + √ snr 0 √ snr 0 +∆ y snr 0 . The equivalence of the posterior probabilities implies that the estimation of X from Y snr is as good as the estimation of X from (Y snr 0 , Y ∆ ). This concludes the proof.
Propositions 17 and Proposition 16 imply that, for fixed X and p mmpe(X, snr, p) ≥ mmpe(X, snr, p| and we have the following: Corollary 2. mmpe(X, snr, p) is a non-increasing function of snr.

VII. ADVANCE BOUNDS: SCPP BOUND AND ITS COMPLEMENT
The SCPP is a powerful too that can be used to show the advantage of Gaussian inputs over arbitrary inputs in certain channels with Gaussian noise. In conjunction with the I-MMSE relationship, the SCPP provides simple and insightful converse proofs to the capacity of multiuser AWGN channels. The original proof of the SCPP in [7] and [8] relied on bounding the MMSE. Next we give a simpler proof of the SCPP that does not require knowledge of the derivative the MMSE and can easily be extended to the MMPE of any order p.
First observe that, in light of the bound in (37d), for any snr > 0 we can always find a β ≥ 0 such that Next we generalize the SCPP bound to the MMPE. where (57b) Proof: Let snr = snr 0 + ∆ for ∆ ≥ 0, and let Y ∆ = √ ∆X + Z ∆ . Then where W ∼ N (0, I). Next, let and define a suboptimal estimator given (Y ∆ , Y snr 0 ) aŝ for some γ ∈ R to be determined later. Then where the (in)-equalities follow from: a) Proposition 17; b) by using the sub-optimal estimator in (59); and c) by choosing γ = Z 2 p Z 2 p +∆·m for m defined in (58). Next, by applying the triangle inequality to (60) we get where in the last step we used (a + b) ≤ √ 2 √ a 2 + b 2 .
Note that for the case p = 2, instead of using the triangular inequality in (61), the term in (60) can be expanded into a quadratic equation for which it is not hard to see that the choice p +∆·m is optimal and leads to the bound The proof is concluded by noting that β = m Z 2 p −snr 0 m . Remark 3. We conjecture that the multiplicative constant c p can be sharpened to 1 for all p ≥ 1.
However, in order to make such a claim one must solve the following optimization problem where W and Z are independent and Z ∼ N (0, I). Because it is not clear how to solve (64) for p = 2 and thus we leave it for the future work.

Remark 4. Note that the proof of Proposition 18 does not require the assumption that Z is
Gaussian and only requires the assumptions of Proposition 17. That is, we only require that a channel is such that the estimation of X from two observations is equivalent to estimating X from a single observation with a higher SNR.

A. Complementary SCPP bound
In this section we give a bound that complements the SCPP bound, that is, while the SCPP bounds the MMPE for all snr ≥ snr 0 , we give a bound that bounds the MMPE for all snr ≤ snr 0 where it is assumed that the MMPE is known at snr 0 .
The next result enables us to bound the MMPE at snr with values of the MMPE at snr 0 while varying the order.
Proposition 19. For 0 < snr ≤ snr 0 , X and p ≥ 0, we have mmpe(X, snr, p) ≤ κ n,t mmpe where κ n,t := 2 n n 2 t t+1 Proof: From Proposition 8 we have that mmpe(X, snr, p) = inf where the (in)-equalities follow from: a) Hölder's inequality with conjugate exponents 1 ≤ m, r such that 1 m + 1 r = 1; and b) by recognizing that the expectation of the exponential is the moment generating function of a Chi-square distribution of degree n, which exists only if r(snr 0 −snr) 2 snr 0 < 1 2 . Next, we let t = snr 0 −snr snr 0 and let r = t+1 2t in (65), so that m = 1+t 1−t . Observe that now the bound in (65) holds for all values of snr ≤ snr 0 since r(snr 0 − snr) With the choice of m = 1+t 1−t the bound in (65) becomes This concludes the proof.

The bound in Proposition 19 is the key in showing new bounds on the phase transitions region
for the MMSE, presented in the next section.

As an application of Proposition 19 we show that the MMPE is a continuous function of SNR.
Proposition 20. For fixed X and p, mmpe(X, snr, p) is a continuous function of snr > 0.
Proof: Assume without loss of generality that snr 0 ≥ snr

A. Bounds on the Differential Entropy
For any random vector U such that |K U | < ∞, h(U) < ∞, and any random vector V, the following inequality is considered to be a continuous analog of Fano's inequality [4]: where the inequality in (68) is a consequence of the arithmetic-mean geometric-mean inequality, that is, for any 0 A we have used |A| where λ i 's are the eigenvalues of A.
By applying (68) to the AWGN setting, for any X such that |K X | < ∞, h(X) < ∞, by using Proposition 10 with q = 1, we can arrive at the trivial bound: for any p ≥ 2 Next, we show that the inequality in (68) can be generalized in terms of the norm in (5), and the trivial bound in (69) can be improved.
Theorem 1. For any U ∈ R n such that h(U) < ∞ and U p < ∞ for some p ∈ (0, ∞), and for any V ∈ R n , we have h(U|V) ≤ n 2 log k 2 n,p · n 2 p · mmpe Note that the result in Theorem 1 holds in great generality, i.e., the AWGN assumption is not necessary. As an application of Theorem 1 to the AWGN setting we have the following stronger version of the inequality in (69).
Proof: The proof follows by setting U = X and V = Y in the statement of Theorem 1.

B. Generalized Ozarow-Wyner Bound
In [31] the following "Ozarow-Wyner lower bound" on the mutual information achieved by a discrete input X D transmitted over an AWGN channel was shown: where lmmse(X|Y ) is the LMMSE. The advantage of the bound in (72) compared to existing bounds is its computational simplicity. The bound on the gap in (72) has been sharpened in [40, since lmmse(X, snr) ≥ mmse(X, snr).
Next, we generalize the bound in (72) to discrete vector inputs and give the sharpest known bound on the gap term.
Theorem 2. (Generalized Ozarow-Wyner Bound) Let X D be a discrete random vector with finite entropy, such that p i = P[X D = x i ], and x i ∈ supp(X D ), and let K p be a set of continuous random vectors, independent of X D , such that for every U ∈ K, h(U), U p < ∞, and Then for any p > 0 . (74e) Proof: See Appendix L.
It is interesting to note that the lower bound in (74b) resembles the bound for lattice codes in [41,Theorem 1], where U can be thought of as dither, G 2,p corresponds to the log of the normalized p-moment of a compact region in R n , G 1,p corresponds to the log of the normalized MMSE term, and H(X D ) corresponds with the capacity C.
In order to show the advantage of Theorem 2 over the original Ozarow-Wyner bound (case of n = 1 and with LMMSE instead of MMPE), we consider X D uniformly distributed with the number of points equal to N = √ 1 + snr , that is, we choose the number of points such that H(X D ) ≈ 1 2 log(1 + snr). Fig. 3 shows: • The solid cyan line is the "shaping loss" then for any p > 0 and therefore lim n→∞ G 2,p (U) = 0. Therefore, Proof: See Appendix M.

C. New bounds on the MMSE and Phase Transitions
The SCPP is instrumental in showing the behavior of the MMSE of capacity achieving codes.
Based on the above, an interesting question is how the MMSE in (76) behaves for codes of finite length. In [11], in order to study the phase transition phenomenon for inputs of finite length, the following optimization problem was proposed: Investigation in [11] revealed that M n (snr, snr 0 , β) in (77a) must be of the following form: , snr ≤ snr L T n (snr, snr 0 , β), snr L ≤ snr ≤ snr 0 β 1+βsnr , snr 0 ≤ snr , for some snr L and some function T n (snr, snr 0 , β), where the region snr L ≤ snr ≤ snr 0 is referred to as the phase transition region and its width is defined as W (n) := snr 0 − snr L . In [11] the following was established for T n (snr, snr 0 , β) and W (n): and the width of phase transition region scales as W (n) = O (n −1 ) .
The main result of this subsection is shown next. It uses Propositions 19 and Proposition 12.
where γ := snr 2snr 0 −snr ∈ (0, 1], and κ(r, γ, n) := and where the minimizing r in (79a) can be approximated by . (79d) Moreover, the width of the phase transition region scales as Proof: From the SCPP complementary bound in Proposition 19 with p = 1 we have that mmse(X, snr) ≤ κ n,t mmpe From the interpolation result in Proposition 12 letting q = 1+t 1−t · 2, p = 2 we have that for some r such that 2 ≤ 2 1+t 1−t = q < r and and thus the MMPE term can be bounded as   By putting all of the bounds together, letting γ = 1−t 1+t and observing that we get the bound in (79a). Finally, the proof of approximately optimal r in (79d) is given in The bounds in Theorem 4 and in (78) are shown in Fig. 4. The bound in Theorem 4 is asymptotically tighter than the one in (78). This follows since the phase transition region shrinks for Theorem 4, and as O 1 n for the bound in (78). It is not possible in general to assert that Theorem 4 is tighter than (78). In fact, for small values of n, the bound in (78) can offer advantages, as seen for the case n = 1 shown in Fig. 4b. Another advantage of the bound in (78) is its analytical simplicity.

D. Bounds on the derivative of the MMSE
The MMPE can be used to study the second derivative of mutual information (or first derivative of MMSE), as initiated for n = 1 in [7] and for n ≥ 1 in [8], namely, The second derivative of mutual information is important in characterizing the bandwidth-power trade-off in the wideband regime [42] and [43], and has also been used in the proof of the SCPP in [7] and [8]. Moreover, in [7] it has been shown that the derivative of the MMSE and the quantity in (12) are related by the following bound for n = 1: The main result of this subsection is the next bound.
Proof: See Appendix O.
It can be observed that, for the case n = 1, by using the bound in (37b) from Proposition 10 we have that which significantly reduces the constant in (83) from 3·2 4 to 3. For a similar but slightly different bound than that in (85) on E Cov 2 (X|Y ) please see [11].

IX. CONCLUDING REMARKS
This paper has considered the problem of estimating a random variable from a noisy observation under a general cost function, termed the MMPE. We have show that many properties of the MMSE and the conditional expectation (i.e., optimal MMSE estimator) are identical or have a natural generalization to the MMPE and the MMPE optimal estimator.
We have also provided a new simpler proof of the SCPP for the MMSE and generalized it to the MMPE. We have shown that the new framework of the MMPE also permits the development of bounds that are complementary to the SCPP which in turn allows for new tighter characterizations of the phase transition phenomena that manifest, in the limit as the length of the capacity achieving code goes to infinity, as a discontinuity of the MMSE as a function of SNR.
We have also shown connections between the MMPE and the conditional differential entropy by generalizing a well know continuous analog of Fano's inequality. The MMPE was further used to refine bounds on the conditional entropy and improve the gap term in the Ozarow-Wyner bound.
Currently, we are investigating the connections between bounds on the MMPE provided in this work and the rate distortion problem with the MMPE distortion measure. Possible future applications of the sharpened version of the Ozarow-Wyner bound include sharpening the bounds on discrete inputs in [44] and [45]. Another interesting future direction is to consider a modified 'information bottleneck problem' [46] where the constraint on the mutual information is replaced by a constraint on the MMPE.

APPENDIX A PROOF OF THE TRIANGLE INEQUALITY IN (6)
It is well know that the trace operator is an inner product in the space of matrices and since the inner product induces a norm we have Therefore, we have that where the inequalities follow from: a) triangle inequality for inner product induce norm for p ≥ 1; and b) Minkowski inequality for the expectation which holds for p ≥ 1. This concludes the proof.

APPENDIX B PROOF OF PROPOSITION 1
For simplicity, we look at the case n = 1. The case for n > 1 follows similarly. We first assume that snr > 0. The first direction follows trivially: The other direction follows by using where we focus on the inner expectation and show that the infimum is achieved by f (y) = f p (X|Y = y) given in (13). Since y is now given, we are simply looking for an optimal solution to the more general problem where X y ∼ p X|Y (·|y). The goal is to show that the infimum in (88) is achievable. Clearly, the infimum exists since where the last inequality follows from [7,Proposition 6] which asserts that for any p < ∞, X y is a sub-Gaussian random variable and hence all conditional moments are finite.
Next, we show that g(v) = E [|X y − v| p ] is a continuous function of v. Recall, that any given For arbitrary |v| < ∞ take a sequence v n such that v n → v, we want to show that This can be done with the help of the dominated convergence theorem. We must find an integrable random variable θ such that |X y − v n | p ≤ θ for all n; this is found as where the inequalities follow from: a) |X y − v n | p ≤ (2 max(|X y |, |v n |)) p ≤ 2 p (|X y | p + |v n | p ) which holds for any p ≥ 0; and b) recall that every convergent sequence is bounded and since the sequence v n converges to v it is also bounded by some finite K for every n. The integrability of θ = 2 p (|X y | p + K) follows again by the sub-Gaussian argument from [7,Proposition 6].
Therefore, we conclude that the function g(v) is continuous.
Next, we show that the infimum is attainable by some |v 0 | < ∞. By definition of the infimum there exists some v n (not necessarily convergent) such that Towards a contradiction, assume that v n → ∞. Then by Fatou's lemma However, this contradicts the result in (90) and therefore sequence v n must be bounded. This, together with the fact that g(v) is continuos, implies that the infimum is attainable and thus Therefore, for each y ∈ R there exists |v| < K that minimize the expression min v∈R E [|X y − v| p ] and the optimal estimator defined point-wise is given by For the case of snr = 0 + the problem reduces to which is bounded if and only if X p < ∞. This concludes the proof.

APPENDIX C PROOF OF PROPOSITION 2
We take a classical approach used in estimation theory to find an optimal estimator by using tools from calculus of variations [47,Ch.7 Thm.1]. A necessary condition for f to be a minimizer in (13) is expressed through a functional derivative as for all admissible g(Y).
Therefore, we focus on the following limit: We seek to apply the dominated convergence theorem to (100) in order to interchange the order of the limit and the expectation. To that end we let v = x − f (y) and Next for the integrant we observe that all the terms in (104) are of order no more than p, and since all of the terms are in L p (or p integrable) the quantity in (104) is integrable for any . Therefore, the dominated convergence theorem applies and we can interchange the order of limit and expectation in (100).
Next, observe that we can re-write the limit as a derivative, that is, By using chain rules of differentiation of matrix calculus we arrive at Therefore, the function derivative is given by .
Finally, for f p (X|Y) to be optimal it must satisfy E Err for any admissible g(Y). This verifies the necessary condition for optimality for p > 0.
To verify that this is a sufficient condition for optimality we take the second variational derivative of E Err p 2 (X, f (Y)) and demonstrated that it is always positive for p ≥ 1. The fact that follows since Err p 2 (x, f (y) + g(y)) is a convex function of for p ≥ 1.
This verifies the sufficient condition for p ≥ 1 and concludes the proof.

APPENDIX D PROOF OF PROPOSITION 3
In Proposition 1 we let X y ∼ p X|Y (·|y) and therefore have to solve for all y We know that X y is Gaussian with X y ∼ N √ snry 1+snr , 1 1+snr . The optimization problem in (110) can be transformed into and where Z ∼ N (0, 1). Next, by taking the derivative with respect to a in (111) where the interchange of the order of differentiation and expectation in (113) is possible by Leibniz integral rule [48] which requires verifying that for we have that |g(a, z)| ≤ θ(z) where θ(z) is integrable. This is indeed the case since Clearly, θ(z) is integrable, so the change of the order of differentiation and expectation in (111) is justified.
Next, observe that for a fixed a the function g(z, a) in (115) All this implies that a = 0 is a critical and a minimum point. Therefore, the optimalâ = 0 for the optimization problem in (111) and the optimalv for the original optimization problem is found through (112) to bev Finally, we compute the mmpe(X, snr, p) for X ∼ N (0, 1) where the equalities follow from: a) follows since X and Z are independent Gaussian r.v.'s and have an equivalent distribution given byẐ √ 1+snr whereẐ ∼ N (0, 1); and b) follows from (7) by setting n = 1. This concludes the proof.

APPENDIX E PROOF OF PROPOSITION 4
From Proposition 1 we have to minimize E [|X y − v| p ] where X y ∼ p X|Y (·|y). We have that the joint probability density function of (X, Y ) is given by Without loss of generality we assume that x 1 ≤ x 2 . By using Bayes' formula we have that The minimization of (124) with respect to v is equivalent to minimizing In piecewise form we can write g(v) as with the derivative of g(v) given by From (128) we see that for the regime x 2 ≤ v the derivative is positive and therefore the minimum occurs at v = x 2 . For the regime v ≤ x 1 we have that the derivative is always negative so the minimum occurs at v = x 1 . For the regime x 1 < v < x 2 the optimal v soves Next, by comparing the three candidates for the minimizing v, we have that .
(134) Therefore, the optimal estimator is given by the RHS of (134).
Note, that for the case of p = 1 the function g(v) reduces to and the minimum occurs at This implies that for p = 1 the optimal estimator is . This concludes the proof.

APPENDIX F PROOF OF PROPOSITION 5
The key to deriving all of the claimed properties is the expression of the optimal estimator in Proposition 1. We prove next all the properties. then where the (in)-equalities follow from: a) using the assumption that X ≥ 0 and v y < 0 so X − v y > 0 and the absolute value is redundant; and b) by using the assumption that X ≥ 0 and v y < 0 then X − v y ≥ X. The expression in (139) leads to a contradiction since it implies that v y = 0 but by assumption v y < 0. Therefore, v y = f p (X|Y = y) ≥ 0. This concludes the proof of property 1).
2) Next we show that f p (aX + b|Y) = af p (X|Y) + b. Let then where the equalities follow from: a) since scaling the objective function does not change the optimizer; and b) since the minimum is attained at v−b a = v y . This concludes the proof of property 2).
3) Next, we show that f p (g(Y)|Y = y) = g(Y). Since, This concludes the proof of property 3). 4) Follows from property 3) by taking g(Y) = f p (X|Y).
By using Proposition 1 we have that f p (X|Y snr 0 = y snr 0 , Y snr = y snr ) = arg min v∈R n E Err We defineŶ where Z ∼ N (0, σ 2 I) whith σ 2 = snr 0 −snr snr is independent of Y snr 0 , X and Z. Observe thatŶ snr and Y snr have the same SNR's and therefore mmpe(X|Y snr ; p) = mmpe(X|Ŷ snr ; p).
By performing a change of measure we have where L(x, y) is given by and thus The lower bound in (37a) for p ≥ q follows by mmpe(X, snr, q) = inf where the inequality in a) follows from Jensen's inequality and the concavity of (·) q p .
B. Proof of the bounds in (37b) and (37c) We now proceed to the proof of the upper bounds in (37b) and (37c). We have where the (in)-equalities follow from: a) by using Lemma 1, b) by using the triangle inequality which holds for p ≥ 1.
Next, the term E[Z|Y] p can be further bound as follows: where the inequality in a) follows from using Jensen's inequality. Depending on whether p 2 ≤ 1 or p 2 ≥ 1 we bound (154) as follows: where the (in)-equalities follow from: a) by using the scaling property of the MMPE in Proposition 6; and b) by using the bound in (160).
Observe that the bound in (160) is achieved asymptotically by using X G ∼ N (0, σ 2 I) since by Proposition 3 and the scaling property in Proposition 6 we have that mmpe(X G , snr, p) = σ p Z p This concludes the proof.
APPENDIX J

PROOF OF PROPOSITION 14
We use the approach of [38]. Suppose we use the following sub-optimal decoder: where B x i (r) is the n-dimensional ball of radius r = Since E Err p 2 (X, g(Y))|X = x i , Y ∈ B x i (r) = 0 we have that n mmpe(X, snr, p) ≤ Second, observe that Therefore, we have that where h e (·) is the differential entropy measured in nats. Moreover, observe that h e (W v ) = h e (U v − g(v)) = h e (U v ) due to the translation invariance of the differential entropy. Therefore, by rearranging (173) and by using the translation invariance of the differential entropy, we get 1 n h e (U v ) log(e) ≤ log k n,p · n where from (5)  = log k n,p · n where the inequality in a) follows from Jensen's inequality. Finally, since this bound holds for any deterministic function g(·), to tighten this bound, and due to the monotonicity of the log function, we may pick g(·) to be the optimal p-th estimator of U. This concludes the proof.
APPENDIX L PROOF OF THEOREM 2 Let (U, X D , Z) be mutually independent. By the data processing inequality and the assumption in (74a) we have Next, by using Theorem 1, we have that the last term of (175) can be bounded as n −1 h(X D + U|Y) ≤ log k n,p · n 1 p · X D + U − g(Y) p .
Next, by combining (175) and (176) and taking g(Y) = f p (X|Y) we have that n −1 gap p ≤ inf U∈K (L 1,p (U, X D ) + L 2,p (U)) , where inequality in (178) follows by the triangle inequality which holds for p ≥ 1.
Finally, the proof concludes by taking g(Y) = f p (X|Y).

APPENDIX M PROOF OF THEOREM 3
To show that lim n→∞ G 2,p (U) = 0 we show that lim n→∞ k n,p · n 1 p · U p e 1 n he(U) = 1.
First of all observe that using (173) in Appendix K 1 ≤ k n,p · n 1 p · U p e 1 n he(U) . (181) Next, we show an upper bound. Note that if U is uniform over a ball B 0 (r) of radius r = d min (X D )/2 then h(U) = log (Vol(B 0 (r))) , where Vol(B 0 (r)) = π n/2 Γ n 2 + 1 r n .
By taking the derivative of (200) with respect to r we get We will need the following bounds on trace of A 0 where A ∈ R n×n 1 n Tr(A) 2 ≤ Tr(A 2 ) ≤ nTr(A) 2 .
For the upper bound we have that Tr E Cov 2 (X|Y) = E Tr Cov 2 (X|Y) where the inequalities follow from: a) since Cov(X|Y) 0 and by using the inequality in (206); and b) Jensen's inequality.