On the Structure of the Least Favorable Prior Distributions

This paper studies optimization of the minimum mean square error (MMSE) in order to characterize the structure of the least favorable prior distributions. In the first part, the paper characterizes the local behavior of the MMSE in terms of the input distribution and finds the directional derivative of the MMSE at the distribution $P_{\mathbf{X}}$ in the direction of the distribution $Q_{\mathbf{X}}$. In the second part of the paper, the directional derivative together with the theory of convex optimization is used to characterize the structure of least favorable distributions. In particular, under mild regularity conditions, it is shown that the support of the least favorable distributions must necessarily be very small and is contained in a nowhere dense set of Lebesgue measure zero. The results of this paper produce both sufficient and necessary conditions for optimality, do not rely on Gaussian statistics assumptions, and are not sensitive to the dimensionality of random vectors. The results are evaluated for the univariate and multivariate random Gaussian cases, and the Poisson case. Finally, as one of the applications, it is shown how the results can be used to characterize the capacity of Gaussian MIMO channels with an amplitude constraint.


I. INTRODUCTION
The minimum mean square error (MMSE) in estimating an input random vector X ∈ R n from a noisy observation/output Y ∈ R k is defined as In this paper we study the problem of maximizing the MMSE in (1) over the set of input distributions on X for a fixed transition distribution P Y|X . Specifically, we will work with the following two sets: 1) the set of distributions with a compact support; and 2) the set of distributions with finite generalized moments (e.g., second moment, third absolute moment, logarithmic moments, etc.). The distributions that achieve the worst-case MMSE (i.e., maximize the MMSE) are called least favorable prior distributions. The problem of finding least favorable prior distributions is interesting from both estimation theoretic and information theoretic points of view. Firstly, in estimation theory, maximization of the MMSE over a set of distributions with compact support is directly relate to the problem of characterizing a minimax estimator [1]. Specifically, a conditional expectation (optimal Bayes estimator) evaluated with a least favorable prior distribution is also a minimax estimator.
Secondly, in information theory, in view of the I-MMSE relationship [2] that connects the MMSE and the mutual information for the case of additive Gaussian noise, the least favorable distributions are often also capacity achieving distributions (i.e., maximize mutual information). For example, in [3] such an approach was used to characterize the capacity achieving distribution of a Gaussian noise channel with a small (but nonvanishing) input amplitude constraint.
Unlike previous works, the approach taken in this work is based on the theory of convex optimization and allows us to produce systematic and general results. For instance, our approach produces both sufficient and necessary conditions for optimality, does not rely on the assumption of Gaussian statistics, and is not sensitive to the dimensionality of random vectors X and Y. Our approach also parallels the variational approach, used in information theory [4], [5], for finding capacity achieving distributions.

A. Past Work
The theory of finding least favorable prior distributions has received considerable attention under the assumption of univariate and/or Gaussian statistics. For the univariate case under some mild condition, Ghosh in [6] has shown that, with the support constraint on the input, the least favorable priors are discrete with finitely many points. However, as was pointed out in [6] it is not clear how to generalize the argument to the multivariate case. In contrast, the approach taken in this paper is insensitive to the dimensionality.
In [7] for the Gaussian case, capitalizing on the result of Ghosh, the authors demonstrated necessary and sufficient conditions for the optimality of a two point prior distribution. In addition, the authors in [7] also provided a sufficient condition for the optimality of a three point prior. In contrast, the methodology used in this paper produces both sufficient and necessary conditions that can be tested against any Npoint prior.
For the multivariate Gaussian case, with a sufficiently small ball constraint, in [1] it has been shown that the least favorable prior is distributed on the boundary of the ball. For a comprehensive overview of the minimax estimation of a bounded mean the interested reader is referred to [8] and references therein.

B. Outline and Paper Contributions
Our contributions are as follows. In Section II we review important properties of the MMSE needed in our analysis. In Section III we characterize the local behavior of the MMSE in terms of the input distribution and find the directional derivative of the MMSE functional at the distribution P X in the direction of the distribution Q X .
In Section IV we apply the theory of convex optimization to maximize the MMSE. In Section IV-A and Section IV-B we present required mathematical tools such as theorems from convex optimization and theorems of analytic functions. In Section IV-C we look at the case of the compact support constraint: Theorem 6 shows that a least favorable input distribution exists for an arbitrary P Y|X and derives necessary and sufficient conditions for its optimality; Proposition 1, under some mild conditions, characterizes the structure of the support of least favorable prior distributions and shows that the support must be a nowhere dense set of Lebesgue measure zero; Proposition 3 and Proposition 4 look at univariate and multivariate Gaussian noise cases and recover and expand on some known results; Proposition 5 shows how our results can be applied to characterize the capacity of MIMO channels. Surprisingly, Proposition 5 also characterizes the capacity of the MIMO amplitude channel in a regime where the number of antennas approaches infinity; and Proposition 6 considers the Poisson noise case. Section IV-F, looks at least favorable priors under the generalized moment constraints. Section V concludes the paper.
Due to space limitations, some of the proofs are omitted and can be found in an extended version of this paper [9].

C. Notation
Throughout the paper we adopt the following notational conventions: Deterministic scalar quantities are denoted by lowercase letters and deterministic vector quantities are denoted by lowercase bold letters; matrices are denoted by bold uppercase letters; random variables are denoted by uppercase letters and random vectors are denoted by bold uppercase letters; and we denote an n-dimensional ball of radius R centered at 0 as B 0 (R) {x ∈ R n : x ≤ R}. For a random vector X with distribution P X we define the expected value as E[X] = xdP X (x) when we need to emphasize that X is distributed according to P X we use the notation E P X [X]. We say that a random vector Y ∈ L p if E[ Y p ] < ∞; we denote the set of all probability distributions on S ⊂ R n as F ∞ (S); and a point x ∈ R n is said to be a point of increase of a distribution P X , if for any open subset O ⊂ R n containing x, P X (O) > 0. We denote the set of points of increase of P X as E(P X ) ⊆ R n . Observe that P X (E(P X )) = 1. In fact, E(P X ) is the minimal closed subset of R n whose probability is 1.

II. THE MMSE
In this section we review some important properties of the MMSE.

A. Fundamental Theorems of MMSE Estimation
2) (Conditional Expectation is the Optimal Estimator.)

B. The MMSE as a Functional
Throughout the paper we will treat the MMSE as an operator (or a functional) on the space of joint distributions P XY . To emphasize that the MMSE is a function of the pair (P X , P Y|X ) we use the following notation: Continuity of the MMSE will play a key role in our analysis and, therefore, we need the following definitions.
A function f is continuous at x 0 if it is both upper and lower semicontinuous at x 0 .
Next, we summarize operator properties of the MMSE.
and bounded density and E[ N 2 ] < ∞; then P X → mmse(P X , P Y|X ) is continuous.

III. LOCAL BEHAVIOR OF THE MMSE IN TERMS OF THE INPUT DISTRIBUTION
Let P X be the distribution of X. In this section, we study the local behavior of the MMSE as a function of P X . Definition 2. (The Gâteaux Derivative.) Let F be a convex topological space. For any two distributions P ∈ F and Q ∈ F we define the Gâteaux derivative of a function g : F → R at P in the direction of Q as The Gâteaux derivative is the generalization of a concept of directional derivative and is an important optimization tool. The following theorem finds the Gâteaux derivative of the MMSE with respect to the input distribution. Theorem 3. (The Gâteaux Derivative of the MMSE.) For any P X , Q X and P Y|X we have that where where the steps follow from: a) using the Pythagorean identity in (2a); and b) using the property that the expected value is a linear operator on a set of distributions. Next by dividing (6) by λ and taking λ → 0 we have that The proof that the limit in (7) is zero is delegate to [9].
IV. OPTIMIZATION OF THE MMSE In this section we use the derivative found in Theorem 3 to characterize distributions that maximize the MMSE. Unlike previous approaches, the approach laid out in this paper is systematic and produces both sufficient and necessary conditions for optimality. Moreover, the approach is fairly general and works for a large class of channels P Y|X . We begin by introducing necessary mathematical tools.

A. Optimization Theorems
We will need the following optimization theorems.
Moreover, the solution is unique if f is strictly concave. 2) (Necessary Condition for Optimality.) Let F be a convex topological space and let f : 3) (Necessary and Sufficient Condition for Optimality.) The condition in (9) is sufficient if in addition f is concave. 4) (KKT Conditions.) Let F be a convex topological space, and let f : F → R be a concave function on F and g : F → R a convex function on F. Assume there exists a point F ∈ F such that g(F ) < 0. Let µ = sup Then, there is a constant λ ≥ 0 such that Furthermore, if the supremum in (10) is achieved by F 0 , it is achieved by F 0 in (11) and λg(F 0 ) = 0.

B. Analytic Functions and the Size of the Uniqueness Set
Part of our analysis will require identifying the sizes of sets on which two analytic functions can agree without being identical everywhere (i.e., uniqueness sets) for which the following theorem will be used. Theorem 5. (Identity Theorems [12].) Let X ⊂ R n and let f, g : X → R be two real-analytic functions on X that agree on some set E ⊂ X . Then, f and g agree on X if one of the following conditions is satisfied: 1) E is an open set; 2) E is a set of positive Lebesgue measure; or 3) n = 1 and E has a limit point in X .

C. Bounded Input: General Case
In this section we seek to find sup P X ∈F∞(S) mmse(P X , P Y|X ).
Theorem 6. For any P Y|X and any compact S ⊂ R n sup P X ∈F∞(S) mmse(P X , P Y|X ) = max P X ∈F∞(S) mmse(P X , P Y|X ).

(12a)
Moreover, P X is an optimal input distribution in (12a) if and only if for all Q X ∈ F ∞ (S), mmse Q X (P X , P Y|X ) ≤ mmse(P X , P Y|X ). (12b) Proof. The proof of (12a) follows from the fact that P X → mmse(P X , P Y|X ) is an upper semicontinuous function, as shown in Theorem 2, using that F ∞ (S) is a sequentially compact set, as shown in [9, Lemma 1], and applying property 1) from Theorem 4. Finally, the statement in (12b) follows from property 2) and property 3) in Theorem 4, and the derivative expression for the MMSE in Theorem 3.
In this work we seek to make statements about the size of the support of an optimal input distribution. Therefore, it is convenient to re-write the condition in (12b) in an equivalent form as conditions that involve statements about the support of an optimal input distribution. Proposition 1. P X is an optimal input distribution in (12a) if and only if the following two conditions hold: Definition 3. (Dense and Nowhere Dense Sets.) • A set A ⊂ X is said to be dense in X if every element x ∈ X either belongs to A or is a limit point of A. • A set A ⊂ X is said to be nowhere dense if, for every nonempty open set U ⊂ X , the intersection U ∩ A is not dense in U. Proposition 2. Suppose that the function

IEEE International Symposium on Information Theory (ISIT)
is non-constant and real-analytic on S. Then, an optimal input distribution in (12a) P X , satisfies the following: • for S ⊂ R n where n ≥ 1, E(P X ) is a nowhere dense set of Lebesgue measure zero; and • for S ⊂ R, E(P X ) has finite cardinality (i.e., an optimal input distribution is discrete with finitely many points). Proof. If P X is a maximizer in (12a), then by (13b) g(x) = mmse(P X , P Y|X ), ∀x ∈ E(P X ).
In other words, g(x) is constant on E(P X ). We first focus on the general case of n ≥ 1. Towards a contradiction suppose that E(P X ) ⊆ S is not a nowhere dense set of S. Then there exists some open set O such that O ∩ E(P X ) is dense in O. Moreover, by (15)  is constant on all of S. However, this contradicts our assumption that g(x) is non-constant on S and, therefore, E(P X ) is a nowhere dense set.
The conclusion that E(P X ) has Lebesgue measure zero follows by assuming, towards a contradiction, that E(P X ) is a set of positive Lebesque measure. By (15) g(x) is constant on E(P X ) ⊂ S and using Theorem 5 we conclude that g(x) must be constant on S.
The proof in the case of n = 1 is relegated to [9].
The result of Proposition 2 for n > 1 show that the support of an optimal input distribution is small in two ways. First, the support is small in measure theoretic terms and has zero Lebesgue measure. Second, the support is small topologically and is a nowhere dense which loosely speaking implies that the elements of the support are not tightly clustered. An interesting question, which we will address shortly, is whether the size of the support is also small when measured in terms of cardinality. For example, for n = 1 we already know that this is the case and the support has finite cardinality. It turns out that in general, for n > 1, the support of an optimal distribution might not be of finite or even countably infinite cardinality.
Next, we show that the conditions on g(x) in Proposition 2 are not very restrictive and work for a variety of settings (e.g., Gaussian noise). Lemma 1. Let P Y|X be such that Y = X + Z and where X and Z are independent, and suppose that the pdf of Z ∼ f Z (z) is a complex-analytic function on an open subset of C n that contains R n . Moreover, assume that f Z (z) > 0 for all z ∈ R. Then, g(x) defined in (14) is a real analytic function on R n .

D. Bounded Input: Gaussian Noise Case
This section looks at the case when P Y|X is Gaussian.
we have the following: • an optimal input distribution in (16) is discrete with finitely many points. Moreover, the optimizing input distribution is unique and symmetric; • {±A} ⊆ E(P X ) for every A ≥ 0; and • {±A} = E(P X ) if and only if A ≤Ā B ≈ 1.05647.
For the multivariate case we have the following generalization of Proposition 3.
we have the following: • the optimal input distribution P X is unique and spherically symmetric. Moreover, Note that the result of Proposition 4 shows that an optimal input distribution can be supported on the set C(R) which is a nowhere dense set of Lebesgue measure zero. However, note that the set C(R) does have an uncountably infinite cardinality. Hence, for n > 1 the conclusion in Proposition 2 is not superfluous and in general cannot be strengthened, and discrete inputs are in general not optimal for n > 1. However, do note that the number of possible spheres that make up E(P X ) is finite. In other words, the magnitude X is a discrete random variable with finitely many points.
In Proposition 4, the constant that determinesR can be difficult to evaluate, but it can be shown that it is sufficient to takeR ≤ √ n. Proposition 4 can be used to find the capacity of a MIMO channel given an amplitude constraint.
the optimal input distribution is uniform on the set C(R) = {x ∈ R n : x = R} (i.e., boundary of the ball) if R ≤ √ n.
Note that Proposition 5 establishes capacity in the small amplitude regime (i.e., R ≤ √ n) in the massive MIMO case (i.e., the number of antennas approaches infinity) [13]. A similar argument can also be applied to the MIMO wiretap channel.

E. Bounded Input: Poisson Noise Case
The Poisson random transformation is governed by the following conditional distribution: where we use the convention that 0 0 = 1. It is well known that the conditional expectation is given by where p Y (y; P X ) is the marginal probability mass function (pmf) of Y induced by input distribution P X .
The following theorem characterizes the structure of a least favorable prior for the Poisson case. Proposition 6. (Poisson Noise Case.) Let P Y |X be as in (19). Then, for the optimization problem we have the following: • an optimal input distribution in (21) is discrete with finitely many points; and • E(P X ) = {0, A} if and only if A ≤Ā ≈ 0.9129 whereĀ is the solution of the equation 2e x 2 (x − 1) + xe x − 2 = 0 for x > 0. Moreover, the optimal probability assignment is given by P X [X = 0] = 1 1+e A 2 , and the MMSE is given by

F. Generalized Input Moment Constraints
In this section we seek to find for some given f : R n → R independent of P X . Observe that the set F(f ; α) is convex. In addition, we assume that f (X) is a non-negative monotonically increasing function of X which by the Markov inequality and Prokhorov theorem implies that F(f ; α) is a sequentially compact set.
An example of an f (·) that satisfies such a condition is f (X) = X r for any r > 0. Theorem 7. Suppose the MMSE in the optimization problem in (22) is an upper semicontinuous function. Then, the supremum in (22a) is attainable by some input distribution P X . Moreover, P X is optimal if and only if the following two conditions hold: ≤ mmse(P X , P Y|X ); and 2) for all x ∈ E(P X ) ⊆ R n E X − E P X [X|Y] 2 |X = x − λ (f (x) − α) = mmse(P X , P Y|X ).
Next, we look at the special case of multivariate Gaussian.
Proposition 7. Let P Y|X = N (x, I). Then for the optimization problem in (22) we have the following: • the optimal input distribution is unique and symmetric.
• if f (x) = ω x 2 , then the support of the optimal input distribution is bounded (i.e., E(P X ) ⊆ B 0 (R) for some R > 0); • if f (x) = x 2 , then the optimal input distribution is given by X ∼ N (0, αI); and • if f (x) = o( x 2 ), then the support of the optimal input distribution is unbounded (i.e., there is no R ≥ 0 such that E(P X ) ⊆ B 0 (R) ).
It is important to point out that the proof of the case f (x) = o( x 2 ) in Proposition 7 does not require the assumption that P Y|X is Gaussian, and holds under the general assumptions of Theorem 7.
Observe that according to Proposition 7, in the case of f (x) = ω x 2 , the optimal distribution has a bounded support and, therefore, from Proposition 2 we have that the support is a nowhere dense set of Lebesgue measure zero.

V. CONCLUSION
In this work we have examined the structure of the support of least favorable prior distributions. We have shown that, under some mild conditions, the support of a least favorable distribution must be a nowhere dense set of Lebesgue measure zero. Our results also produce necessary and sufficient conditions for optimality and, in most cases, can be easily evaluated as has been demonstrated by the Gaussian and the Poisson examples. An interesting future direction is to consider the problem where for λ ≥ 0 we seek to maximize max P X mmse(P X , P Y|X ) − λ mmse(P X , Q Y|X ) .
For example, taking P Y|X = N (Hx, I) and Q Y|X = N (H 0 x, I) might potentially generalize the single crossing point property, shown in [14] and discussed in great detail in [15] and [16].