On the Minimum Mean $p$ th Error in Gaussian Noise Channels and Its Applications

The problem of estimating an arbitrary random vector from its observation corrupted by additive white Gaussian noise, where the cost function is taken to be the minimum mean  $p$ th error (MMPE), is considered. The classical minimum mean square error (MMSE) is a special case of the MMPE. Several bounds, properties, and applications of the MMPE are derived and discussed. The optimal MMPE estimator is found for Gaussian and binary input distributions. Properties of the MMPE as a function of the input distribution, signal-to-noise-ratio (SNR) and order  $p$  are derived. The “single-crossing-point property” (SCPP) which provides an upper bound on the MMSE, and which together with the mutual information-MMSE relationship is a powerful tool in deriving converse proofs in multi-user information theory, is extended to the MMPE. Moreover, a complementary bound to the SCPP is derived. As a first application of the MMPE, a bound on the conditional differential entropy in terms of the MMPE is provided, which then yields a generalization of the Ozarow–Wyner lower bound on the mutual information achieved by a discrete input on a Gaussian noise channel. As a second application, the MMPE is shown to improve on previous characterizations of the phase transition phenomenon that manifests, in the limit as the length of the capacity achieving code goes to infinity, as a discontinuity of the MMSE as a function of SNR. As a final application, the MMPE is used to show new bounds on the second derivative of mutual information, or the first derivative of the MMSE.


I. INTRODUCTION
I N the Bayesian setting the Minimum Mean Square Error (MMSE) of estimating a random variable X from an observation Y is understood as a cost function 1 with a quadratic loss function (i.e., L 2 norm): Another commonly used cost function is the L 1 norm with loss function given by the absolute value of the error (i.e., the difference between the variable of interest and its estimate).
In general, cost functions with non-quadratic loss functions are not well understood and have been considered only for special cases, such as under the assumption of Gaussian statistics. The interplay between estimation theoretic and information theoretic measures has been very fruitful; for example the so called mutual information-MMSE (I-MMSE) relationship [3], that relates the derivative of the mutual information with respect to the Signal-to-Noise-Ratio (SNR) to the MMSE, has found numerous applications throughout information theory [4]. The goal of this work is to show that the study of estimation problems with non-quadratic loss functions can also offer new insights into classical information theoretic problems. The program of this paper is thus to develop the necessary theory for a class of loss functions, and then apply the developed tools to information theoretic problems.

A. Past Work
The popularity of the MMSE stems from its analytical tractability, which is rooted in the fact that the MMSE is defined through the L 2 norm in (1). The L 2 norm, in turn, allows applications of the well understood Hilbert space theory [5]. In information theoretic applications the L 2 norm is used, for example, to define an average input power constraint. The connection between the power constraint and the L 2 norm leads to a continuous analog of Fano's inequality that relates the conditional differential entropy and the MMSE [6,Th. 8.6.6].
Recently, in view of the I-MMSE relationship [3], the MMSE (in an Additive White Gaussian Noise (AWGN) channel) has received considerable attention. For example, in [7] the I-MMSE relationship was used to give a simple alternative proof of the Entropy Power Inequality (EPI) [8]. Moreover, the 1 Another common term used is a risk function. 0018-9448 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
so called 'Single-Crossing-Point Property' (SCPP) [9], [10] that bounds the MMSE for all SNR values above a certain value at which the MMSE is known, together with the I-MMSE relationship, offers an alternative, unifying framework for deriving information theoretic converses, such as: [9] to provide an alternative proof of the converse for the Gaussian broadcast channel (BC) and show a special case of the EPI; in [11] to provide a simple proof for the information combining problem and a converse for the BC with confidential messages; in [10], by using various extensions of the SCPP, to prove a special case of the vector EPI, a converse for the capacity region of the parallel degraded BC under perantenna power constraints and under an input covariance constraint, and a converse for the compound parallel degraded BC under an input covariance constraint; and in [12] to provide a converse for communication under an MMSE disturbance constraint.
In [13] we demonstrated a bound that complements the SCPP, that bounds the MMPE for all SNR values below a certain value at which the MMSE is known, and allows for a finer characterization of the phase transition phenomenon that manifests as a discontinuity of the MMSE as a function of SNR, as the length of the codeword goes to infinity. This plays an important role in characterizing achievable rates of the capacity achieving codes [14], [15]. One of the applications of the tools presented in this work is an improvement on the bound in [13,Th. 1].
Many other properties of the MMSE in relation to the I-MMSE have been studied in [9] and [16]- [18]. For a comprehensive survey on results, applications and extensions of the I-MMSE relationship we refer the reader to [11] and [19].
While the MMSE has received considerable attention and is well understood, non-quadratic cost functions are only understood in special cases, such as under the assumption of Gaussian statistics. For example, in [20] it was shown that under scalar Gaussian statistics, for a large class of symmetric loss functions the optimal linear MMSE (LMMSE) estimator is also optimal. The result of [20] was extended in [21] to a large class of cost functions that also include asymmetric loss functions. Other early work in this direction includes also [22].
Tan et al. [23] studied the expected L ∞ norm of the error, when the input is assumed to be a Gaussian mixture. The authors showed that, as the dimension of the signal goes to infinity, the optimal LMMSE estimator minimizes the expected maximum error.
Hall and Wise [24], [25] studied a class of even and nondecreasing and even and convex, respectively, loss functions and gave a sufficient condition on the conditional distribution of the input X given the output Y , so that the conditional expectation E[X|Y ] is the optimal estimator.
Akyol et al. [26] studied a scalar additive noise channel and an L p cost function and showed a necessary and sufficient condition on the noise and the input distributions to guarantee that the optimal estimator is linear. Moreover, under the derived sufficient and necessary conditions, if the source and noise variances are the same, then the optimal estimator is linear if and only if the input and the noise distributions are identical.
Weinberger and Merhav [27] and Merhav [28] considered the problem of transmitting a modulated signal over a discrete memoryless channel where the performance criterion was taken to be the L p cost function. To that end, the authors showed tight exponential bounds for very small and very large values of p.
Saerens [29] focused on designing an appropriate cost function such that the output of the trained model approximates the desired summary statistics, such as the conditional expectation, the geometric mean or the variance.
Livadiotis [30] focus on characterizing expectation and variance based on L p norms and emphasized that a parameter p provides a new degree of freedom in analyzing of new phenomena in statistical physics. The interested reader is also referred to [31] where the interplay between the L p means and means generalized in terms convex functions is considered.
In non-Bayesian estimation L p cost functions have been considered in [32] and [33], in a context of minimax estimation, and the authors gave lower and upper bounds on the exponential behavior of the cost function. For a non-Bayesian treatment of non-quadratic cost functions we refer the reader to [34] and [35].
Looking into non-quadratic cost functions is further motivated by the fact that often the quadratic cost function may not be the correct measure of signal fidelity for certain applications. This is especially true in image processing where error metrics, more sensitive to structural changes of the input signal, better capture human perceptions of quality. We refer the reader to [36] for a survey of recent results in this direction.

B. Paper Outline and Main Contributions
In this work we are interested in studying a cost function, termed the Minimum Mean p-th Error (MMPE), 2 the scalar version of which is given by where the infimum is over all measurable estimators f (Y ). Our contributions are as follows: 1) In Section II we formally define the vector version of the MMPE in (2) and introduce related definitions. 2) In Section III we study properties of the optimal MMPE estimator and show: • In Section III-A, Proposition 1 shows that the MPPE optimal estimator indeed exists; • In Section III-B, Proposition 2 derives an orthogonality-like principle that serves as a necessary and sufficient condition for an estimator to be MMPE optimal; • Section III-C gives examples of optimal MMPE estimators. In particular, in Proposition 3 we find the MMPE for Gaussian random vectors, and in Proposition 4 for discrete binary random variables; and • In Section III-D, Proposition 5 shows some basic properties of the optimal MMPE estimator in terms of input distribution, such as, linearity, stability, degradedness, etc. Moreover, via an example it is shown that in general the MMPE optimal estimator is biased on average (i.e., the first moment of the error (bias) is not zero). However, it is shown that the p-th order estimator is unbiased on average in the sense that the ( p − 1)-th moment of the error is zero. 3) In Section IV we study properties of the MMPE as a function of order p, SNR and the input distribution that will be useful in a number of applications: • In Section IV-A, Proposition 6 shows that the MMPE is invariant under translations of the input random vector and derives basic scaling properties; • In Section IV-B, Proposition 7 shows that, as far as estimation error over the channel Y = √ snrX+Z is concerned the estimation of the input X is equivalent to the estimation of the Gaussian noise Z; and • In Section IV-C, Proposition 8 gives a 'change of measure' result that allows one to take the expectation in the definition of the MMPE with respect to an output at a different SNR. 4) In Section V we discuss basic bounds on the MMPE and show: • In Section V-A, Proposition 10 develops basic ordering bounds between MMPE's of different orders and bounds equivalent to that of the LMMSE bound; • In Section V-B, Proposition 11 shows that, under an appropriate moment constraint on the input distribution, the Gaussian input is asymptotically the 'hardest' to estimate; • In Section V-C, Proposition 12 derives interpolation bounds for the MMPE. One of the consequences of such bounds is Proposition 13, which shows that the MMPE is a continuous function of order p; and • In Section V-D, Proposition  • We denote the covariance of r.v. X by K X ; • X ∼ N (m X , K X ) denotes the density of a real-valued Gaussian r.v. X with mean vector m and covariance matrix K X ; • The identity matrix is denoted by I; • Reflection of the matrix A along its main diagonal, or the transpose operation, is denoted by A T ; • The trace operation on the matrix A is denoted by Tr(A); • The Order notation A B implies that A−B is a positive semidefinite matrix; • log(·) denotes logarithms in base 2; • [n 1 : n 2 ] is the set of integers from n 1 to n 2 ≥ n 1 ; • For x ∈ R we let x denote the largest integer not greater than x; • For x ∈ R we let [x] + := max(x, 0) and log + (x) := [log(x)] + ; • Let f (x), g(x) be two real-valued functions. We use the Landau notation f (x) = O(g(x)) to mean that for some as X y ; • We denote the upper incomplete gamma function and the gamma function by The generalized Q-function is denoted bȳ In particular, the generalized Q-function can be related to the standard Q-function, by using the relationship ; and • We define the volume of the region S embedded in R n as In particular, the volume of the n-dimensional ball B(r ) of radius r centered at the origin is given by Vol(B(r )) = π n 2 r n n 2 + 1 .

II. COST FUNCTION DEFINITION
Motivated by the study of cost functions with non-quadratic error we define the following norm.
Definition 1: For the r.v. U ∈ R n and p > 0 (5) For p ≥ 1 the function in (5) defines a norm and obeys the triangle inequality Therefore, throughout the paper we define the L p space, for p ≥ 1, as the space of r.v. on a fixed probability space (, σ (), P) such that the norm in (5) is finite. However, many of our results will hold for 0 < p < 1, for which (5) is not a norm. In particular, for Z ∼ N (0, I) the norm in (5) is given by and for V uniform over the n dimensional ball of radius r the norm in (5) is given by Note that for n = 1 we have that U p p = E |U | p and therefore from now on we will refer to U p p as p-th moment of U. Naturally, for n > 1, there are many other ways for defining the moments, see for example [38]. However, in view of the information theoretic problems we are interested in, such for example from previous work [13], the definition in (5) arises naturally.
Definition 2: For any p > 0, we define the minimum mean p-th error (MMPE) of estimating X from Y as and where the minimization is over all possible Borel measurable functions f (Y). Whenever the optimal MMPE estimator exists and is unique (up to a set of measure zero) we shall denote it by f p (X|Y). 3 The optimal estimator in (9) might not be unique (i.e., there could be two or more estimators that do not agree on a set of a positive measure) in which case we define is a minimizer in (9)}.
(10) Remark 1: The notation f p (X|Y), for the optimal estimator in (9) is inspired by the conditional expectation E[X|Y], and f p (X|Y) should be thought of as an operator on X and a function of Y. Indeed, for p = 2, the MMPE reduces to the MMSE; that is, mmpe(X|Y; 2) = mmse(X|Y) and f 2 (X|Y) = E [X|Y]. The properties of f p (X|Y) as an operator on X will be investigated in Proposition 5. Finally, similarly to the conditional expectation, the notation f p (X|Y = y) should be understood as an evaluation for a realization of a random variable Y, while f p (X|Y) should be understood as a function of a random variable Y which itself is a random variable.
We shall denote if Y and X are related as where Z, X, Y ∈ R n , Z ∼ N (0, I) is independent of X, and snr ≥ 0 is the SNR. When it will be necessary to emphasize the SNR at the output Y, we will denote it with Y snr . Since the distribution of the noise is fixed mmpe(X|Y; p) is completely determined by the distribution of X and snr and there is no ambiguity in using the notation mmpe(X, snr, p).
Applications to the Gaussian noise channel will be the main focus of this paper.
Note that there are other ways of defining the loss function in (9); our definition in (9) is motivated by: • For X ∈ R 1 the error in (9) reduces to a natural expression with loss function given by (9) naturally appears in applications of Hölder's or Jensen's inequalities to mmse(X|Y ); and • The norm in (5) used in the definition of (9) can be related to information theoretic quantities, such as differential entropy and Rényi entropy, via the vector moment entropy inequality from [40]. We shall also look at the p-th error achieved by the suboptimal (unless This concludes the proof.

B. Orthogonality-Like Property
The MMPE for p = 2 differs from MMSE in a number of aspects. The main difference is that the norm defined in (5) is not a Hilbert space norm in general (unless p = 2); as a result, there is no notion of inner product or orthogonality, and f p (X|Y), unlike E[X|Y], can no longer be thought of as an orthogonal projection. Therefore, the orthogonality principlean important tool in the analysis of the MMSE-is no longer available when studying the MMPE for p = 2. However, an orthogonality-like property can indeed be shown for the MMPE.
Proposition 2 (Necessary and Sufficient Condition for the Optimality of f p (X|Y)): For any X, any snr > 0, p ≥ 1, f p (X|Y) is an optimal estimator if and only if for any deterministic function g : R n → R n , that is, where W = X − f p (X|Y). Moreover, for 0 < p < 1 the condition in (18a) is necessary for optimality. Proof: See Appendix B. Note that Proposition 2 for n = 1 and p ∈ R + reduces to Moreover, for p = 2 Proposition 2 reduces to the familiar orthogonality principle  In [26,Lemma 1], by replicating the above argument and by assuming that p 2 ∈ N and n = 1, it was shown that the optimal MMPE estimator is unique. However, since the proof relies heavily on the assumption that p 2 ∈ N and n = 1, this argument cannot be extended in a straightforward way to p ∈ R + or n > 1.

C. Examples of Optimal MMPE Estimators
In general we do not have a closed form solution for the MMPE optimal estimator in (14). Interestingly, the optimal estimator for Gaussian inputs can be found and is linear for all p ≥ 1. Note that similar results have been demonstrated in [20] and [26] for scalar Gaussian inputs. Next we extend this result to vector inputs and give two alternative proofs of the linearity of the optimal MMPE estimator for Gaussian inputs, via Proposition 1 and via Proposition 2.
Proposition 3: For input X G ∼ N (0, I) and p ≥ 1 with optimal estimator given by 1+snr has a Gaussian distribution and is independent of Y. So, for any two functions f (·) and g(·) we have Therefore, by using (24) for estimator f p (X G |Y = y) = √ snr 1+snr y the necessary and sufficient conditions in Proposition 2 hold and thus the linear estimator must be an optimal one. Finally observe that The optimal MMPE estimator is in general a function of p as shown next. .

(26a)
In particular, for p = 1, we have that Proof: See Appendix D. Proposition 4 will be useful in demonstrating several examples and counter examples in the following sections. Note that for the practically relevant case of BPSK modulation, or x 1 = −x 2 = 1 and q = 1 2 , the optimal estimator in (26a) reduces to which for p = 1 is the hard decision decoder By Proposition 4 we can show that the orthogonality principle only holds for p = 2 (when MMPE corresponds to MMSE) as shown in Fig. 1a vs. p for BPSK input and observe it is zero only for p = 2.

D. Basic Properties of the Optimal MMPE Estimator
Interestingly many of the known properties of Proposition 5: For any p > 0 the optimal MMPE estimator has the following properties: Proof: See Appendix E. It is important to point out that in general, the linearity property does not hold for the sum of random variables. That is, the following property: in general is not true. This comes as no surprise as it is very common in Bayesian estimation that the optimal estimator is biased [41].

IV. PROPERTIES OF THE MMPE
In this section we explore properties of the MMPE as a function of SNR and of the input distribution.

A. Basic Properties
The next two properties of the MMPE directly follow from the properties of f p (X|Y) in Proposition 5.

B. Estimation of the Input is Equivalent to Estimation of the Noise
The following lemma is commonly applied in the analysis of the MMSE.
(31b) Lemma 1 states that estimating the noise is equivalent to estimating the input signal if one uses the conditional expectation as an estimator.
Next we show that an equivalent statement holds for the MMPE.
Proposition 7: For X, Z, Y given in Proof: From the definition of the MMPE in (9) This shows the equality in (32a). Moreover, since f p (X|Y) exists and the infimum in (33) is attainable by Proposition 1, so is the infimum in (34). Therefore, from (34) we have that f p (Z|Y) exists and is given by which leads to (32b). This concludes the proof.

C. Change of Measure
The next result enables us to change the expectation from Y snr to Y snr 0 in the definition of the MMPE in (9) whenever snr ≤ snr 0 . This is particularly useful when we know the MMPE, or the structure of the optimal MMPE estimator, at one SNR value but not at another smaller SNR value.
One must be careful when evaluating Proposition 8. For example, since we have that lim snr→0 + snr snr 0 e snr 0 −snr 2snr 0 at first glance it appears that the expectation on the right of (36) is zero while mmpe(X, 0, p) is not, thus violating the equality. However, a more careful examination shows that when snr → 0 the limit and expectation in (36) where in the last equality we used the moment generating function of the Cauchy r.v. Z 2 . As an example, Proposition 8 for X ∼ N (0, 1) with the optimal linear estimator from Proposition 3, i.e., f (y) = ay for some a, evaluates to where the equalities follow from: a) linearity of expectation and the fact that Z and X are independent; and b) since

V. BOUNDS ON THE MMPE
In this section we develop bounds on the MMPE, many of which generalize well known MMSE bounds. However, we also show bounds that are unique to the MMPE and emphasize the usefulness of the MMPE.

A. Extension of Basic MMSE Bounds
An important upper bound on the MMSE often used in practice is the LMMSE.
(37a) 4 Note that this optimal a is evident from the specific change of measure that we have used. Instead of having the estimator according to Proposition 3 as √ snr 1+snr we get it with the normalization by The next bound generalizes Proposition 9 to higher order errors.
Proposition 10: For snr ≥ 0, 0 < q ≤ p, and input X and where where Z p p is given in (7). Proof: See Appendix G. It is interesting to point out that in the derivation of the bounds in Proposition 10 no assumption is put on the distribution of Z, and thus the bounds hold in great generality. If Z is composed of independent identically distributed (i.i.d.) Gaussian elements, then the moment Z p p in Proposition 10 can be tightly approximated in terms of factorials as which is tight for even n and integer p 2 . It is not difficult to check that for p = 2 Proposition 10 reduces to Proposition 9. The reason that the bounds on X − E[X|Y] p are only available for p ≥ 2, while the bounds on mmpe(X, snr, p) are available for p ≥ 0, is because the proof of the bound in (38b) uses Jensen's inequality, which requires p ≥ 2, while the proof of the bound in (38d) does not.

B. Gaussian Inputs Are the Hardest to Estimate
Note that the bounds in Proposition 10 are similar to the bound in (37a) and blow up at snr = 0 + . Therefore, it is desirable to have bounds as in (37b). The next result demonstrates such a bound and shows that Gaussian inputs are asymptotically the hardest to estimate.
Proof: See Appendix H.

C. Interpolation Bounds and Continuity
One of the key advantages of using the MMPE is that the MMPE of order q can be tightly predicted based on the knowledge of the MMPE at lower orders p and higher orders r . At the heart of this analysis is the interpolation result of L p spaces [42]: given 0 < p ≤ q ≤ r and α ∈ (0, 1) such that 1 q = α p + 1−α r , the q-th norm can be bounded as which implies that the norm is log-convex and thus a continuous function of p [43, Th. 5.1.1]. Next, we present several interpolation results for the MMPE. Proposition 12 (Log-Convexity and Interpolation): For any 0 < p ≤ q ≤ r ≤ ∞ and α ∈ (0, 1) such that In particular, Moreover, In particular, Proof: The bound in (42b) follows by applying (41) with The bound in (42d) follows by where the last inequality follows from (41) Finally, the bounds in (42e) and (42f) follow by choosing f (Y) in (42d) equal to f r (X|Y) and f p (X|Y) respectively. This concludes the proof.
From log-convexity we can deduce continuity.
where the last inequality is due to the continuity of the norm.
An interesting question is whether the following interpolation inequality holds: instead of (42e) and (42f). A counter example to the interpolation inequality in (44) is shown in Fig. 2 where we take a binary input X ∈ {±1} equality likely, p = 2, r = 8, and snr = 1 and show: (X|Y ) (green dashed line); and • The right-hand side of the conjectured inequality in (44) (red-dotted line). This shows that (44) is not true in general.

D. Bounds on Discrete Inputs
Next, we investigate properties of the MMPE under the assumption that the input is a discrete r.v. Discrete inputs are commonly encountered in practice and, therefore, it is worthwhile to investigate their performance. where Proof: The proof follows by upper bounding the MMPE with the probability of detection error. Consider where the (in)-equalities follow from: a) choosing a suboptimal estimator; b) using the fact that  (Y) to be the maximum a posteriori (MAP) decoder. Such behavior has been already observed for the MMSE in [17] and [44].
Proposition 15: Let X D be a discrete r.v. with |supp(X D )| = N and P[X D = x i ] = p i for x i ∈ supp(X D ) then: where A slightly weaker bound than that in Proposition 15, yet computationally simpler, can be derived by choosingX D (Y) to be a threshold (or sphere) decoder. This weaker bound would be used later on the mutual information in Section VIII-B.

VI. CONDITIONAL MMPE
We define the conditional MMPE as follows. (48) The conditional MMPE in (48) reflects the fact that the optimal estimator has been given additional information in the form of U. Note that when Z is independent of (X, U) we can write the conditional MMPE for X u ∼ P X|U (·|u) as mmpe(X, snr, p|U) = mmpe(X u , snr, p) d P U (u). (49) Since giving extra information does not increase the estimation error, we have the following result.
(50) Finally, the following Proposition generalizes [11,Proposition 3.4] and states that the MMPE estimation of X from two observations is equivalent to estimating X from a single observation with a higher SNR.
Proposition 17: For every X and p ≥ 0, let U = √ · X + Z where Z ∼ N (0, I) and where (X, Z, Z ) are mutually independent. Then mmpe(X, snr 0 , p|U) = mmpe(X, snr 0 + , p). (51) Proof: For two independent observations Y snr 0 = √ snr 0 X + Z and Y = √ X + Z where Z and Z are independent, by using maximal ratio combining, we have that where W ∼ N (0, I). Next by using the same argument as in [11,Proposition 3.4], we have that the conditional probabilities are for y snr = √ √ snr 0 + y + √ snr 0 √ snr 0 + y snr 0 . The equivalence of the posterior probabilities implies that the estimation of X from Y snr is as good as the estimation of X from (Y snr 0 , Y ). This concludes the proof.
Propositions 17 and Proposition 16 imply that, for fixed X and p mmpe(X, snr, p) ≥ mmpe(X, snr, p| and we have the following: Corollary 3: mmpe(X, snr, p) is a non-increasing function of snr.

VII. SCPP BOUND AND ITS COMPLEMENT
The SCPP is a powerful tool that can be used to show the advantage of Gaussian inputs over arbitrary inputs in certain channels with Gaussian noise. In conjunction with the I-MMSE relationship, the SCPP provides simple and insightful converse proofs to the capacity of multi-user AWGN channels. The original proof of the SCPP in [9] and [10] relied on bounding the MMSE. Next we give a simpler proof of the SCPP that does not require knowledge of the derivative of the MMSE and can easily be extended to the MMPE of any order p.
First observe that, in light of the bound in (38d), for any snr > 0 we can always find a β ≥ 0 such that Next we generalize the SCPP bound to the MMPE. where Proof: Let snr = snr 0 + for ≥ 0, and let Y = √ X + Z . Then where W ∼ N (0, I). Next, let and define a suboptimal estimator given (Y , Y snr 0 ) aŝ for some γ ∈ R to be determined later. Then where the (in)-equalities follow from: a) Proposition 17; b) by using the sub-optimal estimator in (56); and c) by choosing γ = Z 2 p Z 2 p + ·m for m defined in (55). Next, by applying the triangle inequality to (57) we get mmpe 1 p (X, snr, p) where in the last step we used (a + b) ≤ √ 2 √ a 2 + b 2 . Note that for the case p = 2, instead of using the triangular inequality in (58), the term in (57) where W and Z are independent and Z ∼ N (0, I). It is not clear how to solve (59) for p = 2 and thus we leave it for the future work. Remark 5: Note that the proof of Proposition 18 does not require the assumption that Z is Gaussian and only requires the assumptions of Proposition 17. That is, we only require that a channel is such that the estimation of X from two observations is equivalent to estimating X from a single observation with a higher SNR.

A. Complementary SCPP Bound
In this section we give a bound that complements the SCPP bound, that is, while the SCPP bounds the MMPE for all snr ≥ snr 0 , we give a bound that bounds the MMPE for all snr ≤ snr 0 where it is assumed that the MMPE is known at snr 0 .
The next result enables us to bound the MMPE at snr with values of the MMPE at snr 0 while varying the order.
Proposition 19: For 0 < snr ≤ snr 0 , X and p ≥ 0, we have mmpe(X, snr, p) ≤ κ n,t mmpe where κ n,t := 2 n n 2 Proof: From Proposition 8 we have that mmpe(X, snr, p) where the (in)-equalities follow from: a) Hölder's inequality with conjugate exponents 1 ≤ m, r such that 1 m + 1 r = 1; and b) by recognizing that the expectation of the exponential is the moment generating function of a Chi-square distribution of degree n, which exists only if r(snr 0 −snr) 2 snr 0 < 1 2 . Next, we let t = snr 0 −snr snr 0 and let r = t +1 2t in (60), so that m = 1+t 1−t . Observe that now the bound in (60) holds for all values of snr ≤ snr 0 since With the choice of m = 1+t 1−t the bound in (60) becomes mmpe(X, snr, p) This concludes the proof. The bound in Proposition 19 is the key in showing new bounds on the phase transitions region for the MMSE, presented in the next section.
As an application of Proposition 19 we show that the MMPE is a continuous function of SNR.
Proposition 20: For fixed X and p, mmpe(X, snr, p) is a continuous function of snr > 0.
Proof: Assume without loss of generality that snr 0 ≥ snr where the (in)-equalities follow from: a) since the MMPE is a decreasing function of SNR and since snr 0 ≥ snr; b) by using Proposition 19; and c) by definition of t in Proposition 19 we have that lim snr→snr 0 t = 0 and lim snr→snr 0 k n,t = 1, and by continuity of the MMPE in p from Proposition 13. This concludes the proof.

VIII. APPLICATIONS
We next show how the MMPE can be used to derive tighter versions of some well known bounds. It is important to point out that even though the focus of this paper is on the AWGN setting, the results that follow (Theorem 1, Theorem 2 and Theorem 3) apply to any additive channel model in which the noise is an absolutely continuous random variable, without the need for the i.i.d. assumption.

A. Bounds on the Differential Entropy
For any random vector U such that |K U | < ∞, h(U) < ∞, and any random vector V, the following inequality is considered to be a continuous analog of Fano's inequality [6]: where the inequality in (62) is a consequence of the arithmeticmean geometric-mean inequality, that is, for any 0 A we have used |A| where λ i 's are the eigenvalues of A.
By applying (62) to the AWGN setting, for any X such that |K X | < ∞, h(X) < ∞, by using Proposition 10 with q = 1, we can arrive at the trivial bound: for any p ≥ 2 Next, we show that the inequality in (62) can be generalized in terms of the norm in (5), and the trivial bound in (63) can be improved. Theorem 1: For any U ∈ R n such that h(U) < ∞ and U p < ∞ for some p ∈ (0, ∞), and for any V ∈ R n , we have Proof: See Appendix J. Note that the result in Theorem 1 holds in great generality, i.e., the AWGN assumption is not necessary. As an application of Theorem 1 to the AWGN setting we have the following stronger version of the inequality in (63) snr, p) .

B. Generalized Ozarow-Wyner Bound
In [37] the following "Ozarow-Wyner lower bound" on the mutual information achieved by a discrete input X D transmitted over an AWGN channel was shown: where lmmse(X, snr) is the LMMSE. The advantage of the bound in (66) compared to existing bounds is its computational simplicity, and the bound has been shown to be useful for problems such as two-user Gaussian interference channels [45], [46], communication with a disturbance constraint [13], energy harvesting problems [47], [48], and information-theoretic security [49].
The bound on the gap in (66) has been sharpened in [45, Remark 2] to gap ≤ 1 2 log πe since lmmse(X, snr) ≥ mmse(X, snr). Next, we generalize the bound in (66) to discrete vector inputs and give the sharpest known bound on the gap term.
Theorem 2 (Generalized Ozarow-Wyner Bound): Let X D be a discrete random vector with finite entropy, such that p i = P[X D = x i ], and x i ∈ supp(X D ). Moreover, for any p > 0 let K p be a set of continuous random vectors, independent of X D , such that for every U ∈ K p , h(U), U p < ∞, and Then for any p > 0 Proof: See Appendix K. It is interesting to note that the lower bound in (67b) resembles the bound for lattice codes in [50,Th. 1], where U can be thought of as dither, G 2, p corresponds to the log of the normalized p-moment of a compact region in R n , G 1, p corresponds to the log of the normalized MMSE term, and H (X D ) corresponds with the capacity C.
In order to show the advantage of Theorem 2 over the original Ozarow-Wyner bound (case of n = 1 and with LMMSE instead of MMPE), we consider X D uniformly distributed with the number of points equal to N = √ 1 + snr, that is, we choose the number of points such that H (X D ) ≈ 1 2 log(1 + snr). Fig. 3 shows: • The solid cyan line is the "shaping loss" . Then for any p > 0 and therefore lim n→∞ G 2, p (U) = 0. Therefore, For recent applications of the bound in Theorem 2 to non-Gaussian and MIMO channels the reader is referred to [51]- [53], respectively.

C. New Bounds on the MMSE and Phase Transitions
The SCPP is instrumental in showing the behavior of the MMSE of capacity achieving codes. For example, as the length of any capacity achieving code goes to infinity, the MMSE behaves as follows: as shown: in [14], for the Gaussian point-to-point channel with the output Y snr 0 with β = γ = 0; in [15], for the Gaussian BC with outputs Y snr 1 and Y snr 0 , where snr 0 ≤ snr 1 and rate pair (R 1 , R 2 ) = 1 2 log(1 + βsnr 1 ), 1 2 log 1+snr 0 1+βsnr 0 for some β ∈ [0, 1], with γ = 0; in [15], for the Gaussian wiretap channel with outputs Y snr 0 (primary) and Y snr 1 (eavesdropper) with maximum equivocation D max and rate R ≥ D max , for β = γ = 0; and in [12], for the Gaussian point-topoint channel with output Y snr 1 and an MMSE disturbance constraint at Y snr 0 measured by mmse(X, snr 0 ) ≤ β 1+βsnr 0 for some β ∈ [0, 1] with γ = β. The jump discontinuities in (69) at snr = snr 0 and snr = snr 1 are referred to as the phase transitions.
Based on the above, an interesting question is how the MMSE in (69) behaves for codes of finite length. In [13], in order to study the phase transition phenomenon for inputs of finite length, the following optimization problem was proposed: Definition 4: for some β ∈ [0, 1]. Investigation in [13] revealed that M n (snr, snr 0 , β) in (70a) must be of the following form: for some snr L and some function T n (snr, snr 0 , β), where the region snr L ≤ snr ≤ snr 0 is referred to as the phase transition region and its width is defined as W (n) := snr 0 −snr L . In [13] the following was established for T n (snr, snr 0 , β) and W (n): and where the minimizing r in (72a) can be approximated by

(72d)
Moreover, the width of the phase transition region scales as Proof: From the SCPP complementary bound in Proposition 19 with p = 1 we have that mmse(X, snr) ≤ κ n,t mmpe From the interpolation result in Proposition 12 letting q = 1+t 1−t · 2, p = 2 we have that for some r such that 2 ≤ 2 1+t 1−t = q < r and and thus the MMPE term can be bounded as as follows: .
By putting all of the bounds together, letting γ = 1−t 1+t and observing that we get the bound in (72a). Finally, the proof of approximately optimal r in (72d) is given in Appendix M. The bounds in Theorem 4 and in (71) are shown in Fig. 4. The bound in Theorem 4 is asymptotically tighter than the one in (71). This follows since the phase transition region shrinks as O 1 √ n for Theorem 4, and as O 1 n for the bound in (71). It is not possible in general to assert that Theorem 4 is tighter than (71). In fact, for small values of n, the bound in (71) can offer advantages, as seen for the case n = 1 shown in Fig. 4b. Another advantage of the bound in (71) is its analytical simplicity.

D. Bounds on the Derivative of the MMSE
The MMPE can be used to study the second derivative of mutual information (or first derivative of MMSE), as initiated for n = 1 in [9] and for n ≥ 1 in [10], namely, The second derivative of mutual information is important in characterizing the bandwidth-power trade-off in the wideband regime [54] and [55], and has also been used in the proof of the SCPP in [9] and [10]. Moreover, in [9] it has been shown that the derivative of the MMSE and the quantity in (13) are related by the following bound for n = 1: The main result of this subsection is the next bound. Proposition 21: For any input X mmse 2 (X, snr) = mmpe 2 (X, snr, 2) It can be observed that, for the case n = 1, by using the bound in (38b) from Proposition 10 we have that which significantly reduces the constant in (76) from 3 · 2 4 to 3. For a similar but slightly different bound than that in (78) on E Cov 2 (X|Y ) please see [13].

IX. CONCLUDING REMARKS
This paper has considered the problem of estimating a random variable from a noisy observation under a general cost function, termed the MMPE. We have shown that many properties of the MMSE and the conditional expectation (i.e., optimal MMSE estimator) are identical or have a natural generalization to the MMPE and the MMPE optimal estimator.
We have also provided a new simpler proof of the SCPP for the MMSE and generalized it to the MMPE. We have shown that the new framework of the MMPE also permits the development of bounds that are complementary to the SCPP which in turn allows for new tighter characterizations of the phase transition phenomena that manifest, in the limit as the length of the capacity achieving code goes to infinity, as a discontinuity of the MMSE as a function of SNR.
We have also shown connections between the MMPE and the conditional differential entropy by generalizing a well know continuous analog of Fano's inequality. The MMPE was further used to refine bounds on the conditional entropy and improve the gap term in the Ozarow-Wyner bound.
Currently, we are investigating the connections between bounds on the MMPE provided in this work and the rate distortion problem with the MMPE distortion measure. Possible future applications of the sharpened version of the Ozarow-Wyner bound include sharpening the bounds on discrete inputs in [56] and [57]. Another interesting future direction is to consider a modified 'information bottleneck problem' [58] where the constraint on the mutual information is replaced by a constraint on the MMPE. For simplicity, we look at the case n = 1. The case for n > 1 follows similarly. We first assume that snr > 0. The first direction follows trivially: The other direction follows by using and show that the infimum is achieved by f (y) = f p (X|Y = y) given in (14). Since y is now given, we are simply looking for an optimal solution to the more general problem where X y ∼ p X |Y (·|y). The goal is to show that the infimum in (79) is achievable. Clearly, the infimum exists since where the last inequality follows from [9, Proposition 6] which asserts that for any p < ∞, X y is a sub-Gaussian random variable and hence all conditional moments are finite. Next, we show that For arbitrary |v| < ∞ take a sequence v n such that v n → v, we want to show that This can be done with the help of the dominated convergence theorem. We must find an integrable random variable θ such that 2 2 X y − v n 2 2 p ≤ θ for all n; this is found as where the inequalities follow from: a) 2 2 X y − v n 2 2 p ≤ (2 max(|X y |, |v n |)) p ≤ 2 p |X y | p + |v n | p which holds for any p ≥ 0; and b) recall that every convergent sequence is bounded and since the sequence v n converges to v it is also bounded by some finite K for every n. The integrability of θ = 2 p |X y | p + K follows again by the sub-Gaussian argument from [9, Proposition 6]. Therefore, we conclude that the function g(v) is continuous.
Next, we show that the infimum is attainable by some |v 0 | < ∞. By definition of the infimum there exists some v n (not necessarily convergent) such that Towards a contradiction, assume that v n → ∞. Then by Fatou's lemma However, this contradicts the result in (80) and therefore sequence v n must be bounded. This, together with the fact that g(v) is continuos, implies that the infimum is attainable and thus Therefore, for each y ∈ R there exists |v| < K that minimize the expression min v∈R E 2 2 X y − v 2 2 p . Note, that the optimizing v might not be unique and the set of optimal values is given by According to the Definition 2 we have that Note, that we have show that for every y the optimal value v is bounded and therefore in (82) we can take the max instead of the sup. Moreover, as will be shown in Proposition 2, due to the strict convexity of |·| p for p > 1 the optimizer is indeed unique and can be given by For the case of snr = 0 + the problem reduces to which is bounded if and only if X p < ∞. This concludes the proof.

APPENDIX B PROOF OF PROPOSITION 2
We take a classical approach used in estimation theory to find an optimal estimator by using tools from calculus of variations [59, Ch.7, Th.1]. A necessary condition for f to be a minimizer in (14) is expressed through a functional derivative as for all admissible g(Y). Therefore, we focus on the following limit: We seek to apply the dominated convergence theorem to (84) in order to interchange the order of the limit and the expectation. To that end we let v = x − f (y) and Next for the integrand we observe that all the terms in (85) are of order no more than p, and since all of the terms are in L p (or p integrable) the quantity in (85) is integrable for any . Therefore, the dominated convergence theorem applies and we can interchange the order of limit and expectation in (84). Next, observe that we can re-write the limit as a derivative, that is, By using chain rules of differentiation of matrix calculus we arrive at Therefore, the function derivative is given by .
Finally, for f p (X|Y) to be optimal it must satisfy for any admissible g(Y). This verifies the necessary condition for optimality for p > 0.
To verify that this is a sufficient condition for optimality we take the second variational derivative of E X − f (Y) p and demonstrated that it is always positive for p ≥ 1. The fact that follows since x − ( f (y) + g(y)) p is a convex function of for p ≥ 1. This verifies the sufficient condition for p ≥ 1 and concludes the proof.

APPENDIX C PROOF OF PROPOSITION 3
In Proposition 1 we let X y ∼ p X |Y (·|y) and therefore have to solve for all y We know that X y is Gaussian with X y ∼ N √ snry 1+snr , 1 1+snr . The optimization problem in (88) can be transformed into and where Z ∼ N (0, 1). Next, by taking the derivative with respect to a in (89) where the interchange of the order of differentiation and expectation in (91) is possible by Leibniz integral rule [60] which requires verifying that for we have that |g(a, z)| ≤ θ(z) where θ(z) is integrable. This is indeed the case since Clearly, θ(z) is integrable, so the change of the order of differentiation and expectation in (89) is justified. Next, observe that for a fixed a the function g(z, a) in (92) is a decreasing function of z for any p ≥ 1 and in addition g(z, a) is an odd function around z = a. Since f (a) is an average value of g(a, z) this means that the sign of f (a) is the same as the sign of a, that is, All this implies that a = 0 is a critical and a minimum point. Therefore, the optimalâ = 0 for the optimization problem in (89) and the optimalv for the original optimization problem is found through (90) to bê v = √ snr y 1 + snr .
Finally, we compute the mmpe(X, snr, p) for X ∼ N (0, 1) where the equalities follow from: a) follows since X and Z are independent Gaussian r.v.'s and have an equivalent distribution given byẐ √ 1+snr whereẐ ∼ N (0, 1); and b) follows from (7) by setting n = 1. This concludes the proof.

APPENDIX D PROOF OF PROPOSITION 4
From Proposition 1 we have to minimize E |X y − v| p where X y ∼ p X |Y (·|y). We have that the joint probability density function of (X, Y ) is given by Without loss of generality we assume that x 1 ≤ x 2 . By using Bayes' formula we have that The minimization of (93) with respect to v is equivalent to minimizing In piecewise form we can write g(v) as with the derivative of g(v) given by From (94) we see that for the regime x 2 ≤ v the derivative is positive and therefore the minimum occurs at v = x 2 . For the regime v ≤ x 1 we have that the derivative is always negative so the minimum occurs at v = x 1 . For the regime x 1 < v < x 2 the optimal v soves Next, by comparing the three candidates for the minimizing v, we have that (1, a), we have that the minimum Therefore, the optimal estimator is given by the RHS of (95).
Note, that for the case of p = 1 the function g(v) reduces to and the minimum occurs at This implies that for p = 1 the optimal estimator is . This concludes the proof.

APPENDIX E PROOF OF PROPOSITION 5
The key to deriving all of the claimed properties is the expression of the optimal estimator in Proposition 1. We prove next all the properties. 1) For 0 ≤ X ∈ R suppose that where the (in)-equalities follow from: a) using the assumption that X ≥ 0 and v y < 0 so X − v y > 0 and the absolute value is redundant; and b) by using the assumption that X ≥ 0 and v y < 0 then X − v y ≥ X.
The expression in (96) leads to a contradiction since it implies that v y = 0 but by assumption v y < 0. Therefore, v y = f p (X|Y = y) ≥ 0. This concludes the proof of property 1). 2) Next we show that f p (aX + b|Y) = a f p (X|Y) where the equalities follow from: a) since scaling the objective function does not change the optimizer; and b) since the minimum is attained at v−b a = v y . This concludes the proof of property 2).
3) Next, we show that f p (g(Y)|Y = y) = g(Y). Since, This concludes the proof of property 3). 4) Follows from property 3) by taking g(Y) = f p (X|Y). 5) Observe that for the Markov chain X → Y snr 0 → Y snr we have p X|Y snr 0 ,Y snr (x|y snr 0 , y snr ) = p X|Y snr 0 (x|y snr 0 ). (97) By using Proposition 1 we have that where the equality in a) follows from (97). 6) See Fig. 1a for the counter example. This concludes the proof.

APPENDIX F PROOF OF THE BOUND IN PROPOSITION 8
We defineŶ where Z ∼ N (0, σ 2 I) with σ 2 = snr 0 −snr snr is independent of Y snr 0 , X and Z. Observe thatŶ snr and Y snr have the same SNR's and therefore mmpe(X|Y snr ; p) = mmpe(X|Ŷ snr ; p).
By performing a change of measure we have where L(x, y) is given by , and thus This concludes the proof.

A. Proof of the Bound in (38a)
The upper bound in (38a) follows from the fact that E[X|Y] is in general a suboptimal estimator for a given p thus The lower bound in (38a) for p ≥ q follows by mmpe(X, snr, q) = inf where the inequality in a) follows from Jensen's inequality and the concavity of (·) q p .

B. Proof of the Bounds in (38b) and (38c)
We now proceed to the proof of the upper bounds in (38b) and (38c). We have where the (in)-equalities follow from: a) by using Lemma 1, b) by using the triangle inequality which holds for p ≥ 1. Next, the term E[Z|Y] p can be further bound as follows: where the inequality in a) follows from using Jensen's inequality. Depending on whether p 2 ≤ 1 or p 2 ≥ 1 we bound (100) as follows: where the inequalities follow from: a) by using Jensen's inequality on a convex function x r for r ≥ 1; and b) by using Jensen's inequality on a concave function x r for r ≤ 1. By putting (99), (100) and (101) together we get The second term in the minimum of (38b) and (38c) is shown by assuming that X p is finite and by mimicking the steps leading to the bound in (102). We have Taking the minimum, between (102) and (103) concludes the proof.
. This concludes the proof.

APPENDIX I PROOF OF PROPOSITION 15
We seek to give an upper bound on P (n) e (snr) in (45)  To that end, let us denote the following events: We have where inequality in (116) follows by the triangle inequality which holds for p ≥ 1. Finally, the proof concludes by taking g(Y) = f p (X|Y).

APPENDIX L PROOF OF THEOREM 3
To show that lim n→∞ G 2, p (U) = 0 we show that lim n→∞ k n, p · n where Vol(B 0 (r )) = π n/2 n 2 + 1 r n .
Moreover, the norm U can be upper bounded by B 0 (r) r 2 p 2 du 1 du 2 · · · du n = r p n . (119) Solving f (r ) = 0 in (124) we get that the approximate solution is Dr. Tuninetti's research interests are in the ultimate performance limits of wireless interference networks (with special emphasis on cognition and user cooperation), coexistence between radar and communication systems, multirelay networks, content-type coding, and caching systems.