Properties of the Support of the Capacity-Achieving Distribution of the Amplitude-Constrained Poisson Noise Channel

This work considers a Poisson noise channel with an amplitude constraint. It is well-known that the capacity-achieving input distribution for this channel is discrete with finitely many points. We sharpen this result by introducing upper and lower bounds on the number of mass points. Concretely, an upper bound of order $\mathsf{A} \log^2(\mathsf{A})$ and a lower bound of order $\sqrt{\mathsf{A}}$ are established where $\mathsf{A}$ is the constraint on the input amplitude. In addition, along the way, we show several other properties of the capacity and capacity-achieving distribution. For example, it is shown that the capacity is equal to $ - \log P_{Y^\star}(0)$ where $P_{Y^\star}$ is the optimal output distribution. Moreover, an upper bound on the values of the probability masses of the capacity-achieving distribution and a lower bound on the probability of the largest mass point are established. Furthermore, on the per-symbol basis, a nonvanishing lower bound on the probability of error for detecting the capacity-achieving distribution is established under the maximum a posteriori rule.


Introduction
We consider a discrete-time memoryless Poisson channel. The output Y of this channel takes value on the set of nonnegative integers N 0 and the input X takes value on the set of nonnegative real numbers. The conditional probability mass function (pmf) of the output random variable Y given the input X that specifies the channel is given by In (1), we use the standard convention that 0 0 = 1 and 0! = 1. The capacity of this channel where the input X is subject to the amplitude constraint 0 ≤ X ≤ A is given by C(A) = max X: 0≤X≤A Finding the capacity of this channel remains to be an elusive task. The goal of this work is to make progress on this problem by studying the properties of the capacity-achieving distribution denoted by P X ⋆ .

Contributions and Paper Outline
The outline and the contribution of the paper are as follows. The remaining part of Section 1 will survey the relevant literature, present our notation, and will go over the key tools needed in our analysis, such as the oscillation theorem and the strong data-processing inequality. Section 2 presents our main results, which include the following: a new compact representation of the capacity; an upper bound on the values of probability masses of the optimal input distribution P X ⋆ ; a lower bound on the probability of the largest mass points (i.e., P X ⋆ (A)); a lower bound on the size of the support of P X ⋆ ; an upper bound on the size of the support of P X ⋆ ; a nonvanishing lower bound, on the per-symbol basis, on the probability of error for detecting the capacity-achieving distribution. Section 3 is dedicated to the proofs. Section 4 concludes the paper with some final remarks and possible future directions.

Prior Work
The now classical approach developed by Witsenhausen in [1] says that if the output alphabet has a cardinality n, then the support of the optimal input distribution cannot be more than n, irrespective of the size of the input alphabet. However, since the output alphabet of the Poisson noise channel has a countably infinite alphabet, the Witsenhausen approach does not apply. Instead, the approach that has been applied to the Poisson noise channel largely follows the analyticity idea introduced by Smith in [2] in the context of amplitude-constrained Gaussian noise channel. The interested reader is referred to [3] for a summary of these techniques. In this work, we also follow the latter technique. However, we considerably generalize and improve this approach. In what follows, we summarize the known results on the Poisson noise channel and highlight the elements of the new technique. The discrete-time Poisson noise channel is suited to model low-intensity, direct-detection optical communication channels [4]; the interested reader is also referred to a survey on free-space optical communications in [5]. The Poisson channel can be seen as a limiting case of the Binomial channel [6], which can be used to model the number of particles released by a sender unit in molecular communications [7]. A key difference in the mathematical formulation between the Poisson and the Binomial channel models is the infinite/finite nature of the output alphabet. In this work, we are concerned with the discrete-time channel; however, there also exists a large literature on continuous-time channels, and the interested reader is referred to a survey in [8].
The first major study of the capacity-achieving distribution for the Poisson channel was undertaken in [9], where the authors considered the capacity with and without an additional average-power constraint on the input, 1 that is The authors of [9] derived the Karush-Kuhn-Tucker (KKT) conditions that are necessary to study the structure of the capacity-achieving distribution. These KKT conditions were then used to show that the support of an optimal input distribution for any A can contain at most one mass point in the open interval (0, 1). Moreover, for any A ≤ 1, it was shown that the optimal input distribution consists of two mass points at 0 and A and is given by more general case that includes the possibility of a nonzero dark current 2 parameter. Moreover, using the analyticity idea of [2], in [11] for A < ∞ and any P > 0 it was shown that the optimizing input distribution is unique and discrete with finitely many mass points. Moreover, for the case of P ≥ A (i.e., the average-power constraint is not active) and dark current is zero, it was shown that the distribution in (4) continues to be capacity-achieving if and only if A ≤Ā whereĀ ≈ 3.3679. Further studies of the conditions under which the capacity-achieving distribution is binary have been undertaken by the authors of [13] and [14]. For example, in [13], it was shown that with both the amplitude and the average-power constraint, the optimal input distribution always contains a mass point at 0. Moreover, in the case of only an amplitude constraint, the optimal input distribution contains mass points at both 0 and A. In [14], it was shown that if P < A 2 and the dark current is large enough, the following binary distribution is optimal: The capacity-achieving distribution with only an average-power constraint was considered in [15] and was shown to be discrete with infinitely many mass points. The low-average-power and the low-amplitude asymptotics of the capacity have been studied in [16][17][18][19][20]. A number of papers have also focused on upper and lower bounds on the capacity. The first upper and lower bounds on the capacity have been derived in [9] for two situations: the case of the average-power constraint only, and the case of both the average-power and the amplitude constraint with A ≤ 1. The authors of [12] derived upper and lower bounds, in the case of the average-power constraint only, by focusing on the regime where both P and the dark current tend to infinity with a fixed ratio. Firm upper and lower bounds on the capacity in the case of only the average-power constraint and no dark current have been derived in [16] and [17]. Bounds in [16] and [17] have been further improved in [15] and [21]. The most general bounds on the capacity that consider both the amplitude and the average-power constraints on the input and hold for an arbitrary value of the dark current have been derived in [22]. The bounds in [22] have been shown to be tight in the regime where both the average-power and the amplitude constraint approach infinity with a fixed ratio P A . Finally, the authors of [23] sharpened the results of [22] for small values of P and A.

Notation
Throughout the paper, the deterministic scalar quantities are denoted by lower-case letters and random variables are denoted by uppercase letters. The set of all real numbers, nonnegative integers and positive integers are denoted by R, N 0 , and N, respectively. The function 1 A (x) denotes the indicator function over the set A where 1 A = 1 if x ∈ A and 1 A = 0 otherwise. We denote the distribution of a random variable X by P X . The support set of P X is denoted and defined as supp(P X ) = {x : for every open set D ∋ x we have that P X (D) > 0}.
The relative entropy between distributions P and Q will be denoted by D(P Q). A Poisson pmf with expected value x is denoted by P(x). The number of zeros of a function f : R → R on the interval I is denoted by N(I, f ). Let g 1 and g 2 be nonnegative functions, then • g 1 (x) = Ω(g 2 (x)) means that g 2 (x) = O(g 1 (x)); and g2(x) = 0.
2 A more general Poisson noise model also incorporates a nonnegative parameter λ known as the dark current parameter, and the channel is given by P Y |X (y|x) = 1 y! (x + λ) y e −(x+λ) , x ≥ 0, y ∈ N 0 . The dark current parameter λ represents the intensity of an additional source of noise or interference, which produces on average extra λ photons at a particle counter [4,[10][11][12].
Finally, the Lambert W-function is denoted by W k where k indicates the branch [24]. Since we are dealing with real numbers, we only use the principle branch W 0 and the lower branch W −1 . Recall that, for the real numbers x and y, the Lambert W-function provides a solution to the equation ye y = x, which can be solved for y only if x ≥ e −1 . Moreover, in the regime x ≥ 0, the solution is unique and is given by y = W 0 (x), and in the regime e −1 ≤ x < 0, there are two solutions give by y = W 0 (x) and y = W −1 (x).

Overview of the Key Tools
In this section, we overview the main tools needed in our analysis.
Strong Data-Processing Inequality In order to find an upper bound on the values of probability masses and a lower bound on the number of mass points, we will rely on the strong data-processing inequality for the relative entropy. The study of the strong data-processing inequalities has recently received some attention, and the interested reader is referred to [25][26][27] and references therein.
Fix some channel Q Y |X : X → Y where X and Y are the input and output alphabets, respectively. Let Q Y denote the distribution on Y induced by the push-forward of the distribution Q X on X through Q Y |X . We denote this operation by The classical data-processing inequality for the relative entropy states that for The strong data-processing inequality is the sharpening of (8) where one seeks to find a coefficient 0 It is not difficult to see that The quantity 0 < η KL (X ; Q Y |X ) ≤ 1 is known as the contraction coefficients. In Section 3.2, for the case of Poisson channel, we will provide a nontrivial upper bound on η KL (X , Q Y |X ).
Oscillation Theorem To find an upper bound on the number of points in the support of P X ⋆ , we will follow the proof technique developed in [28] for the Gaussian noise channel. The key step that we borrow from [28] is the use of the variation-diminishing property, which is captured by the oscillation theorem of Karlin [29]. To state the oscillation theorem we need the following definition.
Definition 1. Sign Changes of a Function. The number of sign changes of a function ξ : X → R is given by where N {ξ(y i )} m i=1 is the number of sign changes of the sequence {ξ(y i )} m i=1 . The following theorem, shown in [29,Thm. 3] (see also [30,Thm. 3.1, p. 21]), will be a key step in the proof of the upper bound on the number of mass points. Theorem 1. Oscillation Theorem. Given domains I 1 and I 2 , let p : I 1 × I 2 → R be a strictly totally positive kernel. 3 For an arbitrary y, suppose p(·, y) : I 1 → R is an n-times differentiable function. Assume that µ is 3 A function f : I 1 × I 2 → R is said to be a strictly totally positive kernel of order n if det [f (x i , y j )] m i,j=1 > 0 for all 1 ≤ m ≤ n, and for all x 1 < · · · < xm ∈ I 1 , and y 1 < · · · < ym ∈ I 2 . If f is a strictly totally positive kernel of order n for all n ∈ N, then f is a strictly totally positive kernel.   a regular sigma-finite measure on I 2 , and let ξ : I 2 → R be a function with S (ξ) = n. For x ∈ I 1 , define If Ξ : I 1 → R is an n-times differentiable function, then either N(I 1 , Ξ) ≤ n, or Ξ ≡ 0.
The above theorem says that the number of zeros of a function Ξ(x), which is the output of integral transformation, is less than the number of sign changes of the function ξ(y), which is the input to the integral transformation. In the rest of this paper, we will take the kernel p(x, y) to be the Poisson transition probability P Y |X in (1), µ to be the counting measure, and the domains will be set to I 1 = {x : x ≥ 0} and I 2 = N 0 . The fact that the Poisson transition probability is a strictly totally positive kernel was shown in [29]. The interested reader is also referred to [30, p. 19], where it is shown that members of the exponential family (e.g., Poisson, Gaussian, chi-squared) are positive definite kernels. Figure 1 shows an example of Theorem 1 for the Poisson kernel.
The oscillation theorem allows to upper bound the number of points in the support of P X ⋆ with the number of sign changes of a function that is related to the output distribution P Y ⋆ . In the case of the Gaussian noise channel, in order to count the number of sign changes, one needs to resort to complex analytic techniques. In contrast, in the Poisson case, due to the discrete nature of the channel, we no longer need to rely on the complex analytic techniques, which simplifies this part of the analysis.
However, the analysis for the Poisson channel is not necessarily simpler than that in the Gaussian case. One crucial step in both proofs relies on finding a lower bound of the output pdf in the Gaussian case and the output pmf in the Poisson case. In the Gaussian case, this can be done by using Jensen's inequality, and the resulting bound is universal and is independent of P X ⋆ . In the Poisson noise case, however, a distributionindependent bound cannot be obtained, and the lower bound on the tail depends on P X ⋆ . Specifically, the lower bound depends on the value of P X ⋆ (A). Therefore, to complete the proof, we need to also find a lower bound on P X ⋆ (A).

Connection Between Estimation and Information
Measures Some of the key intermediate steps in our proofs will rely on identities that connect information measures and estimation measure. In particular, we will rely on the expression for the conditional expectation derived in [31]. For further connections between estimation and information theoretic measure the interested reader is referred to [32] and [33] and references therein.

Main Results
The main results of this paper are summarized in the following theorem.
Theorem 2. The capacity and the capacity-achieving distribution P X ⋆ of an amplitude-constrained Poisson noise channel satisfy the following properties: • A New Capacity Expression: For every A ≥ 0 , the capacity is given by where P Y ⋆ is the capacity-achieving output distribution induced by P X ⋆ .
• An Upper Bound on the Probabilities: For every A ≥ 0, Location Dependent Bound: In addition, if |supp(P X ⋆ )| = 2, then the bound in (15) becomes equality.
• A Lower Bound on the Probability of the Largest Point: For all A such that e • On the Location of Support Points: • A Lower Bound on the Size of the Support: For every A ≥ 0, • An Upper Bound on the Size of the Support: For every A ≥ 0, In addition, for all A such that e The proof of Theorem 2 is given in Section 3. A few comments are now in order.

Numerical Simulations
In order to aid our discussion, we have also numerically computed the optimal input distributions for values of A up to 15. Fig. 2 depicts the output of this simulation. We note that there are several numerical recipes for generating an optimal input distribution [34][35][36]. However, most of these approaches ultimately optimize over the space of distributions, which is an infinitedimensional space. As was already alluded to in [2] and [28], a firm upper bound on the number of mass points, such as the one in Theorem 2, allows us to move the optimization from the space of probability distributions to the space R 2n where n is the number of points. Working in R 2n allows us to employ methods such as the projected gradient ascent [37], which was used to generate the plots in Fig. 2. A quick sketch of how to implement the projected gradient ascent is given in Appendix A.

On the Order of the Bounds
First, note that almost all of our bounds depend on the value of C(A), which is currently unknown for A ≥Ā ≈ 3.4. However, this not a limitation of our result, as we do have access to upper and lower bounds on C(A) that are tight for large A such as those in [22], which suggest that Moreover, some of the bounds in Theorem 2, such as those in (16) and (20), while are firm, are meant to be used for large values of A. Therefore, combing the bounds in Theorem 2 with the bound in (21), we arrive at the following: Thus, the order of the lower bound on the number of points is √ A, and the order of the upper bound on the number of points is A log 2 (A). It is interesting to speculate as to the reason why the bounds do not match and have different orders.
First, note that Theorem 2 presents two upper bounds on the number of points. The first and implicit bound in (19) depends on the value of P X ⋆ (A). The second bound in (20) is an explicit bound in terms of A and is derived by plugging in the lower bound on P X ⋆ (A) in (16) into the first bound in (19). We suspect that one of the reason why the bounds do not match is due to the lower bound on P X ⋆ (A) in (16), which we think is not tight. Hence, one interesting future direction is to improve the lower bound on the value of P X ⋆ (A) which, in view of (19), would lead to a better upper bound on |supp(P X ⋆ )|. However, to the best of our knowledge, there are no other methods for finding lower or upper bounds on the probabilities of the optimal input distribution. Indeed, one of the contributions of this work is the introduction of two such methods, one for finding an upper bound on the values of probabilities and one for finding a lower bound.
Finally, we would like to point out that numerical simulations are not useful for predicting the order of the number of points. For large values of A, the simulations become numerically unstable, and it is difficult to calculate the optimal input distribution and predict the order of the number of points. (14) and (15) It is also interesting to ask how tight the upper bounds on the probabilities in (14) and (15) are.

On the Upper Bounds on the Probabilities in
Note that the bound in (14) is universal and does not depend on the positions of the probability masses, while the bound in (15) depends on the position of the points. The advantage of the bound in (15) is that it can be tighter than the universal bound in (14). For example, in the regime where A ≤Ā where we only have two points in the support, the bound is achieved with equality. The clear disadvantage of the bound in (15) is that we do not know the location of the points (except 0 and A). However, such a bound might become useful once better estimates for the locations of the mass points are found. Some preliminary estimates of the locations are provided in (17). Fig. 2c plots the upper bounds in (14) and (15) and compares them to the values of P X ⋆ . To create the plot, in the regime A ≤Ā, we have used the exact expressions for P X ⋆ (A) and P X ⋆ (0) in (4). To create the plot in the regime A >Ā, first note that the upper bounds in (14) can be loosened to where we have used the fact that C(A) ≥ I( X; Y ) for any random variable X ∈ [0, A] and where Y is induced by X. Therefore, since we can choose any X ∈ [0, A], we selected it to be the one that is the output of the numerical simulation. The bound in (25) is only computed for x = A as we do not know the locations of other points and only have estimates for these. From the simulations in Fig. 2c, the bounds in (14) and (15) appear to be relatively tight. The bound in (14) relies on the strong data-processing inequality. Specifically, the factor 1 1−e −A in the exponent comes from using the strong data-processing inequality. The dotted black curve in Fig. 2c plots the loosened version of the bound in (14) that ignores the contribution of the strong data-processing inequality, that is we plot From the comparison Fig. 2c, we see that the contribution of the strong data-processing inequality is nontrivial, especially for small and medium values of A. (17) In addition to finding bounds on the number of points and the values of the probabilities, we have also provided additional information about the location of the points. Specifically, (17) provides information about the location of support points other than 0 and A. From the bound in (17), we see that the second-largest point can never be too close to A. Specifically, according to the bound in (17), the gap between A (i.e., the largest point) and the second-largest point is at least one. In fact, the numerical simulations shown in Fig. 2a suggest that this gap is much larger. In particular, the simulations suggest that the gap is not constant but is an increasing function of A. Therefore, one interesting future direction would be to verify this behavior and produce a better bound in (17)

On the Bound in
Similarly, from the lower bound in (17), we see that the second smallest point cannot be too close to the zero point. However, as A increases, the distance is allowed to get smaller. Note that our limited simulation results suggest a better lower bound, namely x ⋆ ≥ 1. Therefore, one interesting future direction would be to either demonstrate the existence of a mass point in the range (0, 1) or show that there is no such mass point. Note that the work of [9] already showed that there is at most one point in the range (0, 1).
Beyond theoretical interest, the existence of the estimates for the mass points' location might also be of interest from the practical point of view. As the existence of such estimates can also impact the design of practical constellations for the Poisson noise channel.

On the Equivocation, Symbol Error Probability, and Entropy
It is well-known that the capacity-achieving distribution should be 'difficult' to detect or estimate on the persymbol basis. To make this statement explicit, we consider the equivocation H(X ⋆ |Y ⋆ ) and the probability of error under the maximum a posteriori (MAP) rule (i.e., P e = P[X ⋆ =X(Y ⋆ )] whereX(Y ⋆ ) is the MAP decoder). The plots in Fig. 3a and Fig. 3b show that the equivocation and P e for the capacity-achieving input have relatively high values. With Theorem 2 at our disposal, we can now show the following result regarding the asymptotic behavior of the error probability.
Proof. The probability of error for the MAP rule can be written as where the expression for the probability of error in (29) comes from [38]; (31) follows by using the bound in (14); (34) follows by using that max x∈[0,A] P Y |X (y|x) ≤ P Y |X (y|y) for y < A and max x∈[0,A] P Y |X (y|x) ≤ P Y |X (y|A) for y ≥ A; (36) follows by using Stirling's bound y y e −y y! ≤ 1 √ 2πy , y ≥ 1; and (37) follows by using the bound 1≤y<A 1 √ y ≤ 2 √ y + y − 1 2 + c where c > 0 is some universal constant [39]. To conclude the proof of (27), we use the bound in (37) together with the asymptotic expression for C(A) in (21). The proof of (28) follows by using the following bound [40]: Remark 1. The bound in (28) can be improved, albeit at the expense of a more complicated expression. In particular, the bound relies on the inequality H(X|Y ) ≥ 2P e . There are several stronger versions of this inequality; the interested reader is referred to [38] for a summary of these inequalities.
The entropy of the optimal input distribution vs. A is plotted in Fig. 3c. We observe a particular behavior of the entropy in the simulated range of A: the rate of increase has finite jumps approximately at the levels log(k) of entropy, where the cardinality of the optimal input distribution increases from k to k + 1 points. These levels correspond to an approximate uniform input distribution on the k amplitude levels: this behavior is also confirmed by the mass probabilities plotted in Fig. 2c. When the rate of increase of the entropy is not compensated by a sufficiently large rate of decrease of the equivocation (see Fig. 3a), the rate of increase of capacity must be sustained by boosting the input entropy: This is done by increasing the cardinality of the input distribution. It is interesting to understand how the rate of increase of capacity is split between entropy and equivocation: If one could prove that the equivocation is upper-bounded by a constant, then this would show that the equivocation does not provide degrees of freedom to channel capacity for large A, and thus the whole rate should be sustained by the input entropy by increasing the cardinality of supp(P X ⋆ ). This hypothesis would imply |supp(P X ⋆ )| ≈ √ A for large A.

Proof of the Main Result
The starting point for most of our proofs are the following KKT conditions shown in [41]; see also the derivation in [11] done in the context of a Poisson noise channel. Lemma 1. The capacity-achieving distribution P X ⋆ and induced capacity-achieving output P ⋆ Y distribution satisfy the following: Lemma 1 asserts that the points of the support of P X ⋆ are zeros of the function x → i(x; P X ⋆ ) − C(A). Therefore, we have the following inclusion: This important observation will be a key step in the proof of the upper bound on the cardinality of supp(P X ⋆ ). (13) As was shown in [13], {0} ∈ supp(P X ⋆ ). Therefore, by using (39b), we have that

Proof of the New Capacity Expression in
This concludes the proof. Despite the simplicity of the above proof, to the best of our knowledge, the above expression has not been observed in the past. (14) and the Lower Bound in (18) In this section, we develop upper bounds on the probability masses and a lower bound on the cardinality of the support. Specifically, we will show that the probabilities are bounded by a term of the form e −C(A) , and the cardinality of the support points are lower bounded by a term of the form e C(A) . We first show an upper bound on the number of points. We show two methods for finding bounds on the probabilities. The first method relies on the strong data-processing inequality and the second method relies on the exact expression for the values of the probability distribution. An interesting feature of both methods is that they work for all channels for which a capacity-achieving distribution is discrete. In this section, with some abuse of notation, P X ⋆ and P Y ⋆ will denote the capacity-achieving input and output distribution, respectively, not only for the Poisson noise channel but for a generic channel P Y |X . Theorem 3. Fix some channel P Y |X and consider the optimization problem

Proof of the Upper Bound in
Moreover, suppose that a maximizing distribution P X ⋆ is discrete. Then, where 0 < η KL X ; P Y |X ≤ 1 is the contraction coefficient defined in (10).
Proof. Let Y x be the output of the channel P Y |X when the input is P X = δ x , where δ x is the Dirac delta function centered in x. Next, suppose that x ∈ supp(P X ⋆ ). Then, by using (39), we have that where in (48) we have noted that δ x → P Y |X → P Yx and P X ⋆ → P Y |X → P Y ⋆ and used the strong data-processing inequality in (9). This concludes the proof.
The next result provides an upper bound on the contraction coefficient for the Poisson channel.
Lemma 2. Let P Y |X be a Poisson channel as in (1). Then, for all A ≥ 0 Proof. As was shown in [42, Proposition II.4.10], the contraction coefficient is upper bounded by where in (51) TV(P Y |X=x P Y |X=x ′ ) is the total variation distance; and (52) follows from the bound in [43, Corollary 3.1].
Combining (45) and (50) leads to the bound in (14). Next, we show the proof of the second bound in (15). The following result holds for all channels that have a discrete capacity-achieving distribution.
Moreover, suppose that a maximizing distribution P X ⋆ is discrete. Then, for x ∈ supp(P X ⋆ ) Proof. By using (39), for x ∈ supp(P X ⋆ ) we have that This concludes the proof.
Remark 2. Upon using the bound P X ⋆ |Y ⋆ (x|y) ≤ 1 in (55), we arrive at This is clearly a weaker version of the bound in (45). However, as shown next, we can improve this by using a better bound on the term E log P X ⋆ |Y ⋆ (x|Y ) X = x . (1). Then, for all A ≥ 0 and x ∈ supp(P X ⋆ ) \ {0}

Lemma 3. Let P Y |X be a Poisson channel as in
In addition, the bound in (60) becomes equality if |supp(P X ⋆ )| = 2.
Proof. An upper bound to the expectation in (55) is as follows: for x ∈ supp(P X ⋆ ) where in (63) we have used the expression for the capacity in (13); and in (64) we have used the bound P X ⋆ |Y ⋆ (x|k) ≤ 1.
Next, for x ∈ supp(P X ⋆ ) \ {0}, combining (55) and (65), we arrive at Now, solving (66) for P X ⋆ (x) yields (60). Finally, if P X ⋆ is binary, then by using Bayes rule and the fact that P Y |X (0|0) = 1, it is not difficult to check that for k ≥ 1 and, therefore, the bound in (64) is tight. This concludes the proof.
This concludes the proof of the upper bounds on the values of the probabilities. The lower bound on the number of points in (20) is now a consequence of the upper bound on the values of P X ⋆ : which simplifies to e Remark 3. An alternative way of finding a lower bound on |supp(P X ⋆ )| is by using the following sequence of elementary inequalities: However, this upper bound is weaker than that in (69). (17) In this section, we establish bounds on P X ⋆ (A) and bounds the location of the support points. Let

Proof of the Lower Bound on P X ⋆ (A) in (16) and the Bounds in
The starting place for both bounds is the fact that if x ⋆ ∈ supp(P X ⋆ ) is a point that is neither equal to 0 nor A, then by using the KKT conditions in (39), we have that Then, by letting Y ∼ P(x ⋆ ), we have that where (78) follows by using the identity for the second derivative of i(x ⋆ ; P X ⋆ ) in Lemma 13 in Appendix B; and (79) follows by using that i ′ (x ⋆ ; P X ⋆ ) = 0. The inequality in (79) will be the key to both proofs. We start by showing the bound in (17). We also use a result of [9] to establish a lower bound on the location of the second largest point.
In addition, let x 0 = max{supp(P X ⋆ ) \ {A}} (i.e., the second largest point in the support). Then, for |supp(P X ⋆ )| ≥ 4, Proof. By using (79), we have that where the inequality follows by using the bound E[X ⋆ |Y ⋆ ] ≤ A. We now show that (83) implies (80). The function f (x) = log x A + 1 x is decreasing for x < 1 and increasing for x > 1, and has two zeros for A > e. Note that under the assumption |supp(P X ⋆ )| ≥ 3, we have that A >Ā ≈ 3.4. Therefore, we operate in the regime where we have two zeros.
The exact solution to (83) is given in terms of branches of the Lambert W-function: To make the above bounds more useful, we further loosen them to involve simpler functions.
To find the upper bound on the largest zero, we use the bound log(t) ≥ 1 − 1 t , which leads to We now find a lower bound on the smallest zero. By substituting x = e t , we can equivalently study the smallest zero of f (t) = t − log(A) + e −t , which is the solution t min < 0. Since f (t) is decreasing for t < 0, a lower bound to f (t) will provide a lower bound to t min . By using e −t ≥ 1 − t + t 2 2 , for t < 0, we get to study the smallest zero of and finally Finally, under the additional assumption that |supp(P X ⋆ )| ≥ 4 and using the fact that there is at most one point on the open interval (0, 1) [9], it follows that the second largest point must satisfy x 0 ≥ 1. This concludes the proof. We now show the lower bound on P X ⋆ (A). For ease of presentation and to emphasize key steps, the proof is split among three lemmas.
we have that Proof. Let p A = P X ⋆ (A). Then, starting with (79), we have that where in (94) we have used the bound E X ⋆ x0 Y ⋆ ≤ A x0 ; and (95) follows by first using that and then bounding the term E X ⋆ x0 1 X ⋆ ≤x0 Y ⋆ as follows: and (96) follows by using Jensen's inequality. Solving (96) for p A , we obtain (102) Now, the assumption P Y > c ≤ 1 A guarantees that the argument of the exponential in (102) is positive, which leads to where in (104) we have used the inequality exp(x) ≥ x + 1; and in (105) we have used that P Y ≥ c ≤ 1 A . This concludes the proof.
To find an explicit lower bound on P X * (A), we need to find an upper bound on E The next lemma provides such an upper bound. Lemma 6. Let x 0 = max{supp(P X ⋆ ) \ {A}}, and suppose that the following conditions hold: Then, where Y ∼ P(x 0 ).

Proof. See Appendix C.
The final result that we require is the following bound on the tail of the Poisson random variable.
Proof. See Appendix D.
With Lemma 5, Lemma 6, and Lemma 7 at our disposal, we are now ready to provide an explicit bound on P X * (A). Next, we make a choice of c and verify that it satisfies the conditions of the aforementioned lemmas.
To that end, choose c = x 0 e log(A). Moreover, assume that |supp(P X * )| ≥ 4, which by Lemma 4 guarantees that x 0 ≥ 1. This choice of c satisfies the condition of Lemma 5 since |supp(P X * )| ≥ 4 implies that A >Ā ≈ 3.4. Furthermore, we assume that 1 − 3 e Having verified the conditions of Lemma 5 and Lemma 6, the bound on P X * (A) now proceeds as follows: where (112) is due to the bound in Lemma 5; (113) follows from the upper bound A x0 − 1 ≤ A because x 0 ≥ 1, and the fact x is minimized at x = A; (114) follows by using the bound in (109); (115) follows from the choice of c = x 0 e log(A) and the bound 1 √ 2πc(x0A−c+1) ≤ 1; and (116) follows by using 1 ≤ x 0 ≤ A, which implies that This concludes the proof of the lower bound on P X * (A). 1−e −A > 3, and would result in However, asymptotically this bound is weaker than that in (116). (20) In this section, we establish an upper bound on the number of the support points. Specifically, we will establish a bound of the order A log 2 (A). To show the bound on the number of points, we need to first present a number of ancillary results. The first result that we will need allows us to bound the number of zeros of f with the number of zeros of f ′ .

Proof of the Upper Bound in
where f ′ denotes the derivative of f .
Proof. Let x 1 < · · · < x n denote the zeros of f . By Rolle's Theorem, each of the intervals (x i , x i+1 ) for i = 1, . . . , n − 1 contains at least one extreme point.
The final ancillary result that we need is the following lemma. Then, Proof. To prove the bound, note that for all k ≥ k ⋆ the function f (k) no longer can change sign. Therefore, there can be at most k ⋆ sign changes.
The next result provides an upper bound on the number of support points in terms of the number of sign changes of the function related to P Y ⋆ .

Lemma 10. For every
where where we define k log Proof. First, using Lemma 13 in Appendix B, we have that where and where we let g(k) = log 1 k!P Y ⋆ (k) . Now, the upper bound can be established as follows: where (127) where in (136) we assume that kg(k − 1)| k=0 = 0; (134) follows by defining and (135) follows from the oscillation theorem. This concludes the proof.
Remark 6. Note that the goal of steps (127) through (134) is to transform the function The latter formulation is significant as it enables the application of the oscillation theorem. The reader might wonder why such a representation was not possible right after (127)? To see why this cannot be done, note that where in the last step we have used the expression for G(x) in (126) and the fact that E[Y |X = x] = x. Therefore, it remains to answer whether we can write E[h(Y )|X = x] = x log(x) for some function h. In other words, can we show that x log(x) is a statistic of a Poisson distribution with mean x? To answer this question, towards a contradiction, assume that x log(x) is a statistic of a Poisson distribution and there exists a function h : N 0 → R such that Multiplying the above by e x , we arrive at which asserts that the function e x x log(x) has a power series representation at x = 0. This is, of course, not possible and leads to a contradiction. Therefore, we cannot write E[h(Y )|X = x] = x log(x). The purpose of steps (127) through (134) was to bypass this issue and eliminate the term x log(x) while keeping the number of zeros relative unchanged. Finally, we note that the analyticity argument leading to the contradiction in (144) has been used before in [15,Thm. 15] to show discreteness of the capacity-achieving distribution for the Poisson noise channel with both peak-power and average-power constraints.
The next result provides an upper bound on the number of sign changes S (ψ(·; P X ⋆ )).

Lemma 11. For
Proof. We begin by lower bounding ψ(·; P X ⋆ ) where (148) follows by using the bound E[X ⋆ |Y ⋆ ] ≤ A; and (149) follows by using the fact that there is a mass point at A which leads to the bound Therefore, there exists k ⋆ = ⌈A − log(P X ⋆ (A)) − C(A)⌉ such that It remains to show that k ⋆ is larger than zero. This follows by using the bound in (45) The proof is concluded by using Lemma 9.
Remark 7. Note that value of P X ⋆ (A) was introduced in the step (149). Since a significant part of this work was dedicated to find a lower bound on P X ⋆ (A), it is interesting to discuss how tight the bound in (149) actually is. Observe that Therefore, asymptotically the bound used to produced the inequality in (149) is tight.
Now, combining the bounds in (122), (146) and (16), we arrive at where the first inequality is due to the bound ⌈x⌉ ≤ x + 1. This concludes the proof.

Conclusion
This work has focused on studying properties of the capacity-achieving distribution for the Poisson noise channel with an amplitude constraint. It was previously known that the capacity-achieving distribution for this channel is discrete with finitely many points. In this work, we sharpened this result in several ways. First, by using a strong data-processing inequality, an upper bound on the values of the mass points has been shown. This upper bound on the probability values has been shown to lead to the lower bound on the number of support point of the optimal input distribution. Specifically, a lower bound of order √ A has been established on the number of support points where A is the constraint on the amplitude. Second, by using the variation-diminishing property of the Poisson kernel, the work has also established an upper bound on the number of the support points of the optimal input distribution. Specifically, an order A log 2 (A) bound has been established.
Finally, along the way, several other results have been shown. For example, a new compact expression for the capacity has been shown. In addition, a lower bound on the probability of the largest points of the optimal input distribution has been established. Furthermore, an estimate on the locations of the support other than 0 and A has been established.
There are few interesting future directions and open questions: • Some of the ideas in this work, in particular the oscillation theorem, were borrowed from [28] where a Gaussian noise channel was considered. It is interesting to note that in [28], it was shown that the oscillation theorem leads to an order tight bound on the cardinality of support of the capacityachieving distribution. Currently, we do not have such a claim for the Poisson noise channel. Therefore, an interesting future direction will be to assess whether the bound due to the oscillation theorem is also order tight for the Poisson noise channel.
• The location of the second-largest point plays an important role in our analysis. In particular, it will be interesting to investigate the gap between the largest and the second-largest points. Currently, this gap is lower-bounded by a constant. In other words, the second-largest point never approaches A. In fact, the numerical simulation suggests that this gap increases as a function of A. One interesting future direction is to investigate this gap and determine whether it increases with A.
• This work has developed a method for finding lower bounds on the value of P X ⋆ (A). This lower bound was instrumental for finding an explicit upper bound on the cardinality of support. However, the current bound on P X ⋆ (A) appears to be loose. An important future direction is to find ways of improving this bound.
• The general upper bounds on the values of the probabilities can be applied to other channels, and it would be interesting to see how tight these bounds are. Another interesting future direction is to extend the results of this paper to a more general Poisson model that includes the dark current parameter.

Appendices
A. Sketch of the Implementation of the Gradient Ascent First, to apply the projected gradient ascent [37], we first parametrize the input distribution as a vector in R 2n ].
From Theorem 2, for a given A, we know that n can be at most equal to (20). Therefore, we set n to be equal to (20). Second, we explicitly write the mutual information as a function of x and let In view of Theorem 2, we have that max PX :0≤X≤A where P is the probability simplex. The update equation for the projected gradient ascent are now given by where λ > 0 is some step size (to generate Fig. 2, we have used λ = 0.01), ∇g(x) is the gradient of g, and the projection operation proj maps a vector x = [x 1 , x 2 ] to the set x : The projection of the first element x 1 onto a cube [0, A] n can be done by individually going through each coordinate of x 1 . An efficient implementation of the projection of x 2 onto the probability simplex can be found in [44]. Finally, the initial condition x (1) for given A, is chosen to be the distribution that was a final output of the gradient algorithm for the previous epoch in which we considered A ′ < A. For the best performance, the difference between A ′ and A should be made as small as possible.

B. Derivatives of i(·; P X )
To find the derivatives of the term i(·; P X ), we will need the following auxiliary results, the proof of which can be found in [31].
Lemma 12. Let f : N 0 → R and assume P Y |X is given in (1). Then, for any P X , the following identities hold: where kf (k − 1)| k=0 = 0 and The derivatives of i(·; P X ) are given next. where Moreover, • The first and second derivative of G(x) are given by • The second derivative of i(x; P X ) is given by Proof. We first show the decomposition in (168) where (176) follows by using that log P Y |X (k|x) = log x k k! e −x = log 1 k! + k log(x) − x; and where (177) follows by using that E[Y |X = x] = x. Next, to find the derivative of G(x), we use the properties in (164) and (166) which lead to Similarly, by using (164) and (167) we arrive at To show the expression for the second derivative observe that C. Proof of Lemma 6 We start from
We next find a lower bound on the P Y ⋆ (k + 1) where we distinguish between two regimes x 0 ≤ c and c > x 0 . First, for 1 ≤ k + 1 ≤ min{x 0 , c}, we have the following lower bound on P Y ⋆ (k + 1): where in (190) we used x k+1 e −x ≥ e −1 , because x → x k+1 e −x is increasing for x < k + 1, and in the second summation we used that x → x k+1 e −x is decreasing for k + 1 < x ≤ x 0 < A; in (191), we used A k+1 ≥ A for A ≥ 1 and k ≥ 0; in (192), we used Ae −A ≤ e −1 for A ≥ 1; and in (194) we used that there is at most one support point in the open interval (0, 1), and the upper bound on the probability mass in (14). If c > x 0 , we can follow analogous bounding steps to write If c ≤ x 0 , plugging (194) into (187), we get Finally, if c > x 0 , we have where the last inequality follows from the assumption that A ≥ e and x 0 ≥ 1. The proof is concluded by observing that A ≥ e implies 1 log(A) e log(A)