When are discrete channel inputs optimal? — Optimization techniques and some new results

Discrete channel inputs have been shown to maximize mutual information under various settings. This paper offers a brief review of corresponding optimization methods. Specifically, two techniques are considered; the first is rooted in the theory of linear optimization whereas the second is based on ideas from convex optimization.


I. INTRODUCTION
In this work, we are interested in the optimization problem max X:X∼PX ∈M where I(X; Y ) is the mutual information between random vectors X and Y distributed according to some joint distribution P XY . Their respective marginal distributions are denoted by P X (input distribution) and P Y (output distribution), respectively, whereas M is some compact and convex set of input distributions. 1 We are specifically interested in an exact characterization of the optimizing distribution P ⋆ X (not necessarily unique). In cases where an exact optimizer cannot be found we determine properties of an optimal input distribution in order to reduce the dimensionality of the problem. For example, in many cases it turns out that an optimizing distribution is discrete with finitely many mass points, which reduces the infinite dimensional problem (1) to a finite dimensional one allowing to apply numerical methods for computing P ⋆ X . Towards this end, we survey two popular approaches to characterize the maximum in (1). The first approach assumes that the output space is of finite cardinality whereas the input space can be of arbitrary size. The technique is based on the theory of linear optimization. We also introduce several new techniques to this approach. For example, the new tools not only allow to work with Euclidean input spaces but also with inputs supported on separable metric spaces.
The second approach makes no assumption on the cardinality of the chanenl output space and is based on convex optimization techniques. This approach has usually been applied to channels with scalar inputs. We introduce several new tools in order to extend the results to the vector case. Moreover, in cases in which the input distribution is discrete with finitely This work was supported in part by the U. S. National Science Foundation under Grants CCF-1420575 and CNS-1456793, by the German Research Foundation under Grant GO 2669/1-1, and by the European Union's Horizon 2020 Research And Innovation Programme, grant agreement no. 694630. 1 Note that M is very general in the sense that it accounts for any type of constraints on the channel inputs (e.g., amplitude constraints, average power constraints). many mass points, we show how to obtain an estimate on the number of points.
Due to space limitations, we do not focus on the algorithmic aspects of finding an optimal input distribution. The literature on this subject is vast and the interested reader is referred to [1]- [3] and references therein. We also do not focus on the popular subject of approximating the capacity of a channel with discrete inputs (see [4]- [6] and references therein).
The paper is organized as follows. Section II introduces a preliminary set of mathematical tools needed in our analysis such as tools from probability theory, convex analysis, and mathematical optimization. Section III then presents the first optimization technique that focuses on channel output spaces of finite cardinality. Under very general assumptions it is shown that the optimizing input distribution must be discrete with a finite number of mass points. Moreover, it is shown that the number of mass points growth linearly with the cardinality of the output space and linearly with the number of constraints on the input. The second optimization technique is presented in Section IV and it is shown under very mild conditions that in the support of the optimal input distribution is generally a nowhere dense set of Lebesgue measure zero. Moreover, in cases in which the optimal input distribution is discrete with finitely many mass points it is shown how the number of points can be determined. Finally, Section V concludes the paper.
Notation: deterministic quantities (i.e, scalars and vectors) are denoted by lower case letters and random objects by capital letters; R denotes the affinely extended real number system; the closed ball in R n of radius R centered at x is denoted as B x (R) := {y ∈ R n : ∥y − x∥ ≤ R}, where ∥ · ∥ is the Euclidean norm; the Dirac measure centered on a fixed point x is denoted as δ x ; the mutual information between X and Y distributed according to P XY is also denoted as I(P X , P Y |X ).

II. SOME ELEMENTARY DEFINITIONS AND RESULTS
In this section, we recap some elementary definitions and results from probability theory and mathematical optimization that will be very useful for our purposes.
For a random vector X ∈ R n and every measurable set A ⊂ R n we denote the probability measure of X as If it is clear from the context we sometimes write P instead of P X . The space of all probability measures defined on a sample space Ω ⊆ R n is denoted as P(Ω).

A. Weak Convergence and Weak Continuity
It is well known that there exist several different notions of the convergence of a sequence of probability measures. One of them is weak convergence, which provides a given space of probability measures with a topology.

Definition 1.
A sequence of probability measures {P n } n∈N is said to converge weakly to the probability measure P if for all bounded and continuous functions φ.
Another main ingredient of our considerations are linear functionals. The following theorem gives a necessary and sufficient condition for a linear functional to be weakly continuous for some bounded and continuous function φ.

Definition 2.
A given space of probability measures P(Ω) is said to be compact if every infinite sequence, {P n } n∈N , in P(Ω) has a weakly convergent subsequence. 2

B. Elementary Optimization Theorems
The extreme value theorem for real-valued continuous functions over compact intervals is one of the most celebrated results of calculus. The following theorem is a generalization to compact topological spaces [9, Sec. 2.13]. Moreover, if f is strictly concave the maximizer is unique.

Definition 3.
An extreme point of any convex set S is a point x ∈ S that cannot be represented as x = (1 − t)y + tz with y, z ∈ S and some t ∈ (0, 1). We denote the set of all extreme points of S as exS.
As in this paper we are interested in the extreme points of certain sets of probability measures, the following theorem will be of particular importance [10, Th. 2.1].

Theorem 3. (Extreme Points of Moment Sets) Let
(Ω, σ(Ω)) be a measurable space and let P reg (Ω) be the set of all regular probability measures over the sample space Ω. 3 Fix measurable functions f 1 , . . . , f k as well as real numbers c 1 , . . . , c k and consider the set that is, the set of regular probability measures with k ∈ N bounded moments (called moment set). Then, 1) H k is convex and the extreme points of H k are given by where are linearly independent ; 2) If the moment conditions in (3) are fulfilled with equality, then (4) holds with equality.
The following result, taken from [10, Th. 3.2], states that when maximizing a linear functional over a moment set it is sufficient to focus on its extreme points. Note that Theorem 4 only requires L to be linear and not necessarily continuous. With this definition in hand, we close the section with a theorem proven by Dubin [11].
for some constant c ∈ R, be a hyperplane formed by f . Moreover, let I be the intersection of a linearly closed and linearly bounded convex set K ⊂ V with n hyperplanes. Then, every extreme point of I is a convex combination of at most n + 1 extreme points of K.
A remarkable property of Theorem 5 is that it also holds for the infinite dimensional case.

III. CHANNEL OUTPUTS OF FINITE CARDINALITY
In this section, we consider the optimization problem (1) for channels whose outputs are of bounded cardinality. Towards this end, we are going to generalize an elegant approach proposed by Witsenhausen in [12]. The approach relies on techniques from linear optimization and convex analysis.
As the main result of this section, the following theorem characterizes under very mild conditions the structure of the optimal input distribution.

Theorem 6. (Optimal Input Distribution Under Finite Output
Cardinality) Let N ∈ N be finite and assume the following: Moreover, P ⋆ X is discrete with at most N (k + 1) mass points (possibly containing points at ±∞) and we have the following: . . , f k are bounded and continuous on Ω, then P ⋆ X has at most N + k finite mass points; . . , f k are bounded and continuous on Ω = R n and are such that for every P X with a finite number of mass points E PX [f i (X)] < ∞ implies P X (+∞) = P X (−∞) = 0, then P ⋆ X has at most N + k finite mass points. Proof: First of all, since the moment set H k is assumed to be compact and P X → I(X; Y ) is concave, by Theorem 2 we have that the supremum is attained by some P ⋆ X ∈ H k . Next, we show that P ⋆ X is discrete and provide a bound on the number of probability masses. The assumption that Y is discrete allows us to write Recall that the entropy of Y is given by where the elements of the vector of output probabilities p Y := [p 1 , . . . , p N ] are of the form so we can write p Y as a linear transformation denote the output vector induced by an optimal input distribution P ⋆ X and note that the conditional entropy is given by As P X → H(Y ) is concave and P X → H(Y |X) linear, it follows that I(P X , P Y |X ) is the difference between a concave and a linear functional.
Let the set of distributions that induce the optimal output distribution defined as Observe that P ⋆ is an intersection of H k with the N − 1 hyperplanes Note that we only consider N − 1 hyperplanes instead of N . This is because in the space of probability distributions everything adds up to one so that L N is redundant. Note also that each L i is a closed set, which follows from Theorem 1 because P Y |X (y i |x) is continuous and bounded in x.
As the intersection of a compact with a closed set is compact, we have that P ⋆ is compact. This implies that Moreover, it follows from (5) that where the last step follows from the fact that all distributions in P ⋆ induce the same H(Y ). Now, as P X → H(Y |X) is linear, by means of Theorem 4 we conclude that In the following, let P ⋆ X ∈ exP ⋆ be a distribution that maximizes (7). Due to Theorem 3, we have that any P X ∈ exP ⋆ , and in particular P ⋆ X , can be represented as a convex combination of at most (N − 1) + 1 = N extreme points of H k . Due to Theorem 5, however, the extreme points are given by discrete distributions with at most k + 1 points from which we conclude that P ⋆ X consists of at most N (k+1) mass points. Now that we know the maximizing input distribution is discrete with at most N (k + 1) mass points, we are able to slightly refine the number under various assumptions. Due to the lack of space, the corresponding proofs are deferred to the extended version of this paper. [12] was concerned with the scalar case only where Ω = [−a, a] for some a > 0. All the subsequent extension of this result relied on reducing the optimization problems to the case of bounded support. The key novelty in the proof of Theorem 6 is the application of Theorem 3, which does not require to show or assume Ω is bounded. In fact, Theorem 3 does not make use of the spaces underlying X and Y and holds if Ω is a well behaving set of some vectors space.

Remark 2.
If the functions f 1 , . . . , f k in the definition of H k prevent the occurrence of mass points at ±∞, then the bound on the number of points can be reduced to N + k. An example of such a function is f (x) = |x| r , r > 0, which naturally forces probability measures with a finite number of mass points to have mass points at ±∞ with zero probability. Corollary 1. Let U be arbitrary but independent of X. Then, Theorem 6 is valid for the optimization problem sup PX ∈H k I(X; Y |U ).
We close this section with a historical note. In the context of discrete memoryless channels, Gallager has shown in [13] that the cardinality of the input should not exceed the cardinality of the output. However, Gallager's result does not apply to the setting of Theorem 6 in which the input space Ω may not be discrete a priori. It also does not hold for general input constraints. The approach in this section has also been applied for point-to-point channels with output quantization [14], [15].

IV. CONVEX OPTIMIZATION APPROACH
In this section, we follow the convex optimization method taken by Smith in [16]. Unlike Section III, however, we do not make any assumption on the channel output alphabets. In order to determine some properties of the optimal input distributions, we introduce tools with which we are able to obtain results for the multivariate case, which is in contrast to the majority of related works considering the scalar case only.
Whereas in the previous section we were able to obtain relatively tight bounds on the number of mass points an optimal distribution must have, to the best of our knowledge the approach of Smith has never been used to obtain similar bounds. In Section IV-B, we therefore introduce a method that can be used to bound the cardinality of the support of the optimal input distribution provided it is discrete.

A. Necessary and Sufficient Conditions for Optimality
The key tool of this section will the notion of weak or directional derivative over the space of probability distributions.

Definition 5.
Let P be a convex topological space. For any two distributions P ∈ P and Q ∈ P we define the Gâteaux derivative of f : P → R at P in the direction of Q as ε .
We will use the Gâteaux derivative together with the following optimization theorems.

Theorem 7.
Let P be a convex topological space and let f : P → R have a Gâteaux derivative ∆ Q f (P ) for every P, Q ∈ P. Suppose that P ⋆ ∈ P is a maximizer of f , then If in addition f is concave, (8) is also sufficient.

Theorem 8. (Karush-Kuhn-Tucker Conditions)
Let P be a convex topological space, f : P → R a concave functional, and g : P → R a convex functional. Assume there exists a point P ∈ P such that g(F ) < 0. Furthermore, let µ := sup P ∈P,g(P )≤0 f (P ).
Then, there exists a constant λ ≥ 0 such that If the supremum in (9) is attained by some P 0 , then P 0 also attains the supremum in (10) with λg(P 0 ) = 0.
Unlike its counterparts in finite dimensions, the Gâteaux derivative may exist without the functional being continuous. For example, the Gâteaux derivative of a linear functional is of the form which exists as long as E P [f (X)] and E Q [f (X)] are finite. However, P → E P [f (X)] is continuous if and only if f is bounded and continuous (see Theorem 1).
Another assumption that we make in this section is that the Gâteaux derivative of the mutual information exists. Assumption 1. Suppose the Gâteaux derivative of the mutual information P X → I(P X , P Y |X ) exists for all P X , Q X ∈ M and that it is given by and P Y (Y ; P X ) denoting the channel output distribution induced by the input distribution P X .
We do not formally prove that the Gâteaux derivative is of the form (11). However, for all the known cases of interest it is given by this expression.
Next, we provide a necessary and sufficient condition for the optimality of an input distribution P X .

Theorem 9. (Necessary and Sufficient Optimality Condition)
Let M be a convex and compact set of channel input distributions. Then, P ⋆ X ∈ M maximizes (1) if and only if ∀Q X ∈ M : I QX (P ⋆ X , P Y |X ) ≤ I(P ⋆ X , P Y |X ).
Proof: The proof is based on Theorem 7, the Gâteaux derivative (11) and the concavity of mutual information.

B. Structure of the Support
In many cases of interest, the set M is equal to H k with functions f 1 , . . . , f k chosen such that H k is compact. In other words, M is a moment set, a set of distributions that are of compact support, or a combination of both. For ease of presentation, we focus on the case M = H 1 only. The results, however, are extendible to the cases H k , k ∈ N.
In order to study the support of the optimal input distribution we will need the following definition.

Definition 6.
A point x ∈ R n is said to be a point of increase of a distribution P X , if for any open subset O ⊂ R n containing x, P X (O) > 0. We denote the set of points of increase of P X as E(P X ) ⊆ R n .
Observe that P X E(P X ) = 1. In fact, E(P X ) is the smallest closed subset of R n whose probability is 1.

Theorem 10. (Sufficient and Necessary Condition) Let
i(x; P X , P Y |X ) : R n → R be defined as Then, P ⋆ X is an optimizer if and only if there exists a λ ≥ 0 such that the following three conditions are satisfied: We will also need the following definition.

Definition 7.
A set A ⊂ X is said to be dense in the set X if every element x ∈ X either belongs to A or is an accumulation point of A. A set A ⊂ X is said to be nowhere dense if for every nonempty open set U ⊂ X, the intersection U ∩ A is not dense in X .

Theorem 12.
(Properties of the Optimal Support) Suppose that Ω ⊂ R n contains an open subset and let i(x; P ⋆ X , P Y |X ) and f be non-constant, real analytic functions on Ω. Then, E(P ⋆ X ) ⊂ Ω ⊂ R n is a nowhere dense set of Ω and is of Lebesgue measure zero. In addition, if n = 1, then for every finite interval J , the set E(P ⋆ X ) ∩ J is of finite cardinality. Proof: We first show that E(P ⋆ X ) is a nowhere dense set. Towards a contradiction, assume that E(P ⋆ X ) is not nowhere dense in Ω. Therefore, by Definition 7, there exists an open set U ⊂ Ω such that U ∩ E(P ⋆ X ) is dense in Ω. By using (14), we have that and therefore is constant on U ∩ E(P ⋆ X ). Since U ∩E(P ⋆ X ) is dense in Ω by the properties of continuous functions (analytic functions are continuous), g is also constant on U. Moreover, since U is an open set and g is analytic and constant on U by property 1) of Theorem 11, g must be constant on Ω. However, this leads to a contradiction as we assumed that g is non-constant on Ω. Therefore, E(P ⋆ X ) is a nowhere dense in Ω.
The conclusion that E(P ⋆ X ) has Lebesgue measure zero follows along similar lines by assuming that E(P ⋆ X ) is a set of positive measure and using property 2) of Theorem 11 to conclude that g must be constant on all of Ω. This again leads to a contradiction, which implies that E(P ⋆ X ) must have Lebesgue measure zero.

Remark 3.
Note that if f and i(x; P ⋆ X , P Y |X ) are orthogonally equivariant (i.e., they only depend on ∥x∥), then it is not difficult to see that E(P ⋆ X ) a union of concentric spheres. That is, where C(r j ) := {x ∈ R n : ∥x∥ = r j }, for some r j . For example, this is the case if P Y |X = N (x, I n ) with I n the n × n identity matrix. The example in (15) shows that the cardinality of E(P ⋆ X ) is uncountable and that discrete inputs are in general not optimal. Theorem 12 can therefore generally not be improved in the sense that we cannot make statements about the cardinality of E(P ⋆ X ) if Ω = R n for n > 1. Note, however, that the magnitude of X ∼ P ⋆ X is discrete. Remark 4. To the best of our knowledge, the approach taken in this section has never been used to obtain bounds on the number of mass points not even for a Gaussian channel with amplitude-constrained inputs (i.e,. X ∈ [−A, A] for some A > 0). An attempt to determine the position and the number of mass points was made in [18], where it was conjectured that the number of points increases by at most one. By using tools from complex analysis, one can show that if x → i(x; P ⋆ X , P Y |X ) has a complex analytic extension to an open subset of C containing [−A, A], then the number of mass points is given by where γ is a regular closed curve that contains [−A, A] and i ′ the derivative of i with respect to z. For details we refer to the extended version of this paper.

C. Some Remarks on Related Works
As mentioned at the beginning, the approach followed in this section was first presented in [16] in the context of scalar Gaussian channels with an amplitude or power-constrained input. In [19], the result was extended to the complex Gaussian case. The authors of [20] considered the Gaussian noise channel subject to Rayleigh fading, where channel state information is not available at the transmitter and the receiver and where channel input is subject to a power constraint. In particular, it was shown that the optimal input distribution is discrete albeit the number of mass points is countably infinite. In [21], similar result were obtained for power-constrained complexvalued channels with a rapidly varying phase.
In [22], the approach was applied to a large class of vector channels that are conditionally Gaussian (i.e., P Y |X is Gaussian) and where the input is constrained to an Euclidian ball and/or has finite power. There are also many works focusing on non-Gaussian additive noise channels. In [23], scalar additive noise channels with amplitude-constrained inputs are considered. The author provides sufficient conditions on the input density to guarantee the optimal input distribution is discrete with finitely many points. For additive noise channels with arbitrary input constraints, the most general set of conditions under which the optimal input distribution is bounded or discrete can be found in [24]. For other results on additive channels with various input constraints, the interested reader is referred to [25]- [31]. Finally, it has to emphasized that the approach of this section has also been applied to non-additive noise channels [32], [33] as well as to multiuser channels [34].

V. CONCLUSION
In this work, we have focused on two optimization methods that follow ideas from the theories of linear and convex optimization. Of course there exist other approaches for finding capacity-achieving input distributions. For example, a promising approach is to connect, via I-MMSE type relationships [35]- [37], the theory of least-favorable prior distributions for estimation measures with the search for optimal input distributions. Such connections have been made in the context of Gaussian noise channels [38]- [40]. Another challenging future direction might be to evaluate how the approaches can be extended to multiuser settings such as the interference channel.