The Compound Information Bottleneck Program

—Motivated by the emerging technology of oblivious processing in remote radio heads with universal decoders, we formulate and analyze in this paper a compound version of the information bottleneck problem. In this problem, a Markov chain X → Y → Z is assumed, and the marginals P X and P Y are set. The mutual information between X and Z is sought to be maximized over the choice of the conditional probability of Z given Y from a given class, under the worst choice of the joint probability of the pair ( X , Y ) from a different class. We provide values, bounds, and various characterizations for specific instances of this problem: the binary symmetric case, the scalar Gaussian case, the vector Gaussian case, the symmetric modulo-additive case, and the total variation constraints case. Finally, for the general case, we propose a Blahut-Arimoto type of alternating iterations algorithm to find a consistent solution to this problem.


I. INTRODUCTION AND PROBLEM FORMULATION
The information bottleneck (IB) methodology [1] provides a universal distortion measure for data compression when the desired distortion measure is either unavailable or cannot be defined. Nonetheless, in most practical cases, the distribution of the source involved in the IB problem is also not known with perfect accuracy (e.g., when it is estimated from a finite sample). In this paper, this aspect motivates us to introduce a compound version of the IB problem, in which the source distribution is only known to belong to a given class, and the representation chosen by the IB method is chosen to be the best possible under the worst-case choice within the class.
We consider the compound remote source coding system [2]- [4]. Let P X be a source of information generating the sequence X n . The encoder observes Y n which is a noisy version of X n . Then, the encoder produces a compressed representation M, which is later on mapped by the decoder to the reconstructed sequence Z n . The distortion is evaluated between X n and Z n , while the rate is the relative number of bits required to represent M. The encoder's goal is to find a compression strategy that extracts from Y n the relevant information regarding X n , when the distribution of the channel P Y|X is not known in advance and cannot be accurately learned. This compound setting generalizes the classical remote source coding model studied by Dobrushin and Wolf [5], [6].
This model motivates one to formulate a compound version of the information bottleneck (IB) optimization problem [1]. Specifically, let (X, Y) be a pair of random variables and fix their marginals to P X and P Y , respectively. Consider all random variables Z satisfying the Markov chain X → Y → Z. Unlike the standard IB problem, in which the joint distribution of P XY is fixed, here we consider an uncertainty set for this joint distribution, and aim to solve the following problem: In this paper, we take the set D Z|Y as the set of possible representations, and the set P X Y is the uncertainty set of the joint distribution. The class D Z|Y will be the usual IB class, i.e., D Z|Y = P Z|Y : I(Y; Z) ≤ C 2 , or a restricted subset of this class, with an additional structure. The class P X Y will take one of the following variants: where all are constrained to the given marginals, i.e., x P XY (x, y) = P Y (y) and y P XY (x, y) = P X (x). For the above sets of optimization, we have max min in (1).
As said, choosing the class P X Y to a singleton recovers the standard IB problem [1], which for discrete alphabets was initially studied in [7] as a method to characterize common information [8]. The IB method is essentially a remote source coding problem [5], [6], choosing the distortion measure as the logarithmic-loss, and thus recovers remote source coding by taking D Z|Y as a maximal distortion set.
In addition, PF, a dual problem to the IB framework [9], [10], can also be recovered from (1) by setting P X Y as the PF family (removing the marginalization constraint on P X ) and D Z|Y to contain a singleton. Therefore, the problem introduced in (1) is a composition of the IB and PF problems. This observation makes the problem in (1) rather delicatee.g., if (X, Y) are jointly Gaussian, even the standard PF rate is zero since one can use the channel from Y to Z to describe the less significant bits of Y [11]. We also mention that the PF is directly connected to information combining (IC) [12], [13]. For example, if the channel from Y to X is a binary memoryless symmetric (BMS) [14,Ch. 4], then by [12], P Z|Y is a binary erasure channel (BEC). Furthermore, the additive noise Helper problem, studied in [15], is directly linked to the PF. By reformulating the former as an IC problem, the solution follows directly, as was demonstrated in [11].
The IB problem can be approached via several strategies. When (X, Y) is a doubly symmetric binary source (DSBS) with transition probability p [16], it can be shown that binary symmetric channels are optimal via Mrs. Gerber's lemma [17] (see also the examples in [7] and [12]). When (X, Y) are jointly multivariate Gaussians, it was shown in [18] that the optimal distribution of (X, Y, Z) is also jointly Gaussian. The optimality of the Gaussian test channel can also be proved using EPI or utilizing I-MMSE and Single Crossing Property [19], [20]. In a different and more general case, when (X, Y, Z) are discrete random variables, a locally optimal P Z|Y can be found by iteratively solving a set of self-consistent equations. A generalized Blahuto-Arimoto algorithm was proposed to solve those equations [1], [21]- [23]. Finally, a particular case of deterministic mappings from X to Y was considered in [24].
In this work, we address the compound setting for the IB problem with the goal of providing similar results. First, we address the DSBS and Gaussian (scalar and vector) settings. Second, we consider general modulo additive channels, with modulo additive representations, and provide various bounds on the compound IB function with PF-based compound set and then with TV-based compound set, and again derive various bounds on the compound IB function. Finally, we return to the general discrete alphabet case with a PF-based compound set and propose an alternating algorithm, which essentially iterates between the maximization over P Z|Y (an IB problem) and minimization over P XY (a PF problem). Omitted proofs and other details are in the full version of this paper [25].

II. BINARY SYMMETRIC AND GAUSSIAN CHANNELS
A simple way to obtain precise analytical solutions to (1) is by establishing a saddle point property [39,Sec. 5.4.2].
In the rest of this section we provide basic examples for which full characterization of the problem in (1) is known.
A. Binary Y Consider X and Y being both Ber(0.5) random variables with PF type of P X Y , and C 1 , C 2 ∈ [0, log 2]. Let R bin (C 1 , C 2 ) denote the compound IB with a PF constraint for this setting. In such case, (X, Y) are restricted to be distributed as a DSBS with parameter α, i.e., being the binary entropy function and h −1 b (·) its inverse. Furthermore, the optimal P Z|Y in this case is a BSC with parameter where * is the binary convolution operator.
Next, assume that Y is Ber(0.5), but there are no constraints on X nor Z. In such case the optimal P Z|Y is a BSC with , while the optimal P X|Y is a BEC with parameter ϵ = 1 − C 1 . The optimal rate in such case is R bin (C 1 , C 2 ) = C 1 · C 2 . This result can be established by combining [7, IV.C] with [12, Thm. 1] and Lemma 1.

B. Scalar Gaussian Y
We proceed to consider another fundamental scenario where the marginal distributions of X and Y are both Gaussian. Note that in contrast to the symmetric Ber(0.5) setting, which restricts the channel from X to Y being a BSC, here, Gaussianity of the marginals does not imply the joint distribution of (X, Y) being Gaussian [40,Ch. 4.7]. Thus, the result of the following theorem is not trivial. Let R sc-G (ρ, C) denote the value of (1) with P X Y being the minimum correlation class with parameter ρ > 0 and Q Z|Y being the IB bottleneck class with parameter C ∈ R.
Theorem 1: Now, suppose that X and Y are jointly Gaussian random vectors of dimension n. Let R vec-G (C 1 , C 2 ) denote the value of (1) with P X Y being the PF constraint with capacity C 1 ∈ R and Q Z|Y is the IB bottleneck class with capacity C 2 ∈ R.
Theorem 2: where The optimal triplet (X, Y, Z) is jointly Gaussian with independent components. In particular, this result establishes that the worst case channel P Y|X is an Additive White Gaussian Noise, and its optimal representation P Z|Y is also white.

III. MODULO ADDITIVE CHANNELS WITH PF CONSTRAINT
In this section, we return to the general discrete alphabet case, yet we restrict our attention to a symmetric setting with the following assumptions: where u n is the probability vector of uniform distribution on n, and ⊥ stands for statistical independence. This setting implies |Y| = |Z| = n. Moreover, it also holds that Z = X ⊕ W ⊕ V.
Using H(W) ≡ H(P W ) and H(V) ≡ H(P V ), we observe that I(X; Z) = log n − H(P W * P V ), where * is the n-ary convolution operator, and η 1 , η 2 ∈ [0, log n]. Thus, the solution to (1) is equivalent to the solution of In (4) we have confined the channel P Z|Y to be modulo additive, which may be too restrictive in general. Nonetheless, when the IB function is strictly convex, the modulo additive channel assumption for Q Z|Y can be relaxed. Indeed: Proposition 1: Fix a joint pmf P XY ∈ P X Y , where P X Y defined in (4). Denote by T the transition probability matrix from Y to X. Assume that function R CEB 2022 IEEE International Symposium on Information Theory (ISIT) is a strictly convex function of η, then it is equivalent to the following problem: where ∆ n is the n-dimensional simplex, and the optimal channel from Y to Z is also a modulo additive channel. Thus, if the strict convexity holds then modulo additive channels will form a saddle point in (6) and thus optimal via Lemma 1 (assuming that P XY is modulo additive). Remark 1: Proposition 1 establishes equivalence between the problems addressed in [41] and [7]. But, as was shown in [41], the function g T (η) is not convex in general, therefore we cannot universally utilize Proposition 1, but only for regions of η where it is convex.
We next provide bounds on (6) which complement the result of Theorem 3.
Theorem 4: Let α be the positive root of (11), β be the parameter of the negative Hamming pmf (10) with entropy η 1 , and ζ be the positive root of (12). If η 1 ∈ (0, log(n−1)), then If n = 3, then If n > 3, then Finally, we consider the high-SNR regime, namely the scenario where η 1 is small. In such case we have the following characterization of the optimal distributions and rate.
IV. MODULO ADDITIVE CHANNELS WITH TV CONSTRAINT Let δ ∈ (0, 2) be given, and a nominal channel modulo additive channel represented by P (0) W . In this section, the constraint H(W) ≤ η 1 in P XY from the previous section is replaced with the constraint d TV (P W , P (0) W ) ≤ δ (the set Q Z|Y remains the same). We denote the resulting compound IB value as R TV (δ, η 2 ).
A natural approach is to relate R TV (δ, η 2 ) to the standard bottleneck problem R(0, η 2 ) ≡ R CEB T (η 2 ) via the continuity of entropy in the total variation metric. This idea was used, e.g., in [42], to establish generalization bounds for the bottleneck problem, that is, in the regime of vanishing δ. Here, we present a slightly tighter result, valid for any δ ∈ (0, 1). To this end, recall that the entropy difference of two pmfs in ∆ n of total variation δ is bounded by [43], [44] ω(δ, n) ≜ 1 2 δ log(n − 1) + h b δ 2 . Proposition 2: For any δ ∈ (0, 1) where R CEB T (η 2 ) is computed at P (0) W . Proposition 2 relates the compound IB to the standard IB problem, however, the latter is, in general, difficult to compute (and requires, for example, alternating minimization algorithm as in Section V). In what follows, we will state computable 2022 IEEE International Symposium on Information Theory (ISIT) upper and lower bounds to R TV (δ, η 2 ). To this end, let T be a channel transition matrix, and let θ(T ) ∈ [0, 1] be the Dobrushin contraction coefficient of T [45] where T i is the ith row of T (the second inequality is a "two-point characterization"). Thus, at worst case, θ(T ) is computable by merely n 2 − n total variation distance calculations. Furthermore, if T ∈ [0, 1] n×n is obtained by n permutations of a pmf, then only n − 1 total variation distance calculations are required. Second, let Γ(δ) ≜ min q∈∆n : dTV(q,un)≤δ H(q) be the minimal entropy over a total variation ball centered at u n . This problem has a closed-form solution [46,Thm. 3] as follows: If 1 − 1/n ≤ δ/2 then the optimal solution is q = (1, 0, . . . , 0) and Γ(δ) = 0. Otherwise, let n 0 (δ) ≜ ⌊n + 1 − nδ/2⌋. Then the optimal solution is q * = (1/n + δ/2, 1/n . . . , 1/n, (n − n 0 (δ) + 1)/n − δ/2, 0, . . . , 0) (there are n 0 − 2 terms of 1/n so the support size of this solution is n 0 ). Therefore, for δ ∈ [0, 2 − 2/n] the function Γ(δ) is strictly positive and strictly decreasing with extreme values of Γ(0) = log n and Γ(2 − 2/n) = 0. So, there exists an inverse function to Γ(δ), which we denote by D(η) : [0, log n] → [0, 2 − 2/n]. Third, for a given p (0) ∈ ∆ n , let Φ(δ; p (0) ) ≜ max q∈∆n : dTV(q,p (0) )≤δ H(q) be the maximal entropy over a total variation ball centered at u n . This problem also has a closed-form solution [46, Thm. 2] as follows: Let µ and ν be such that If ν ≥ µ then Φ(δ; p (0) ) = log n and the maximizing distribution q * = u n is uniform. Otherwise, q * is such that i , µ), ν}, and its entropy is the maximum. Theorem 6: Let T (P W ) be the channel transition matrix which corresponds to n cyclic permutations of P W . Then, W )≤δ Γ (θ(T (P W )) · D(η)) , (16) and that R TV (δ, η 2 ) ≤ min Since Γ(δ), its inverse D(η), as well as Φ(δ; p (0) ) are all computable, the expressions in the lower bound can be computed for any given T (P W ). In general, the optimization over P W in the lower bound is computationally difficult. However, any arbitrary choice of P W which satisfies the constraint leads to a valid lower bound. Analogous statements hold for P V in the upper bound. It should be noted that the optimization of the lower bound requires finding the minimal θ(T (P W )), whereas P V in the upper bound affects both the contraction coefficient θ(T (P V )) and the transformed nominal pmf T (P V )p (0) .
Note that as g T (η) ≥ η always holds [41, Lemma 5 (c)], and so the lower bound of Thm. 6 requires optimizing over P W for which θ(T (P W )) < 1. In general θ(T ) < 1 only if no two rows of T are orthogonal. Here, since the rows of T (P W ) are circular permutations of P W , it holds that θ(T ) < 1 if and only if the support of P W is strictly larger than n/2.
XY has valid marginals; t = t + 1; end Output: P * XY Remark 3: The proof of Thm. 6 provides a lower bound on g T (η) Witsenhausen's function from [41], which may be of independent interest.
V. ALTERNATING OPTIMIZATION ALGORITHM We return in this section to the general (C 1 , C 2 ) PF compound set. Applying a two-phase Lagrangian methodology, we obtain a set of self consistent equations for P XY and P Z|Y . We then propose a Blahuto-Arimoto type iterative algorithm that solves those equations.

A. The Inner Lagrangian
Fix P Z|Y that satisfies I(Y; Z) ≤ C 2 and consider the inner minimization problem from (1), given by f (P Z|Y , C 1 ) = min For λ 1 ≥ 0, the respective Lagrangian of (18) is given by, Proposition 3: Any stationary point P * XY of (19) satisfies where β 1 ≜ 1/λ 1 and Z(x, y, β 1 ) is the normalization constant. Furthermore, the optimal P Z|X (z|x) is given by The system of equations characterizing the stationary points in (20) and (21) must hold simultaneously for consistency. An alternating iteration algorithm is a common approach to solve these equations. L min (P XY , λ 1 , µ, ν).
These independent conditions correspond precisely to alternating interactions of (20) and (21). Denoting by t the iteration step, we obtain Algorithm 1.

B. The Outer Lagrangian
Note that maximization of I(X; Z) for a fixed P XY that satisfies I(X; Y) ≥ C 1 is just the standard information bottleneck, the proposed here technique is identical to the one suggested in [1]. The respective algorithm from [1,Theorem 5] is summarized in Algorithm 2.

C. The Compound Algorithm
To this end, two algorithms were proposed that aim to solve (1) in a isolated manner. In this section we propose a method that intervenes them together with an objective to find the solution simultaneously. There are two natural approaches to handle this problem. The first one is to alternate between the steps of each algorithm until convergence. The second one is to run the first algorithm until convergence and then the other one, and so on. We have found the second type of algorithms to be more effective, and this is summarized in Algorithm 3. We have no global convergence guarantees here as the standard IB does not have such [1].

Algorithm 3: COMIB Programming
Input: P X , P Y , C 1 and C 2 Initialize: P Z|Y . while Variation in I(X; Z) is greater then ϵ do for β 1 ∈ R + do P * XY (β * 1 ) = pf iterator(P X , P Y , P Z|Y . end Output: P * XY ,P * Z|Y VI. NUMERICAL SIMULATIONS We evaluate both the analytical bounds derived in Thm. 4 and the algorithm developed in Section V by comparing their results on a common example. A representative example of n = 5 and various rate constraints is shown in Figure 1. As expected, the algorithm's output lies in the medium between the upper and lower bounds. It is also somewhat closer to 0 0.2 0.4 0.6 0. 8   the lower bound, which hints that the upper bounds might be improved.
We also evaluate the bounds derived for the TV class setting in Section IV. An example for n = 15, and δ = 0.3, and P VII. CONCLUDING REMARKS We have defined the COMIB programming problem. We obtained various characterizations for the binary setting, the Gaussian settings, and derived upper and lower bounds for modulo additive channels with PF constraints, and with TV constraints. Under some qualifying conditions, Gaussian distributions and Hamming channels were shown to be extermal. Finally, we have proposed an alternating iteration algorithm that finds a locally optimal solution. Future research calls for further tightening these bounds, and establishing additional settings in which the optimal channels and representations can be analytically characterized.

ACKNOWLEDGMENT
The work has been supported by the European Union's Horizon 2020 Research And Innovation Programme, grant agreement no. 694630, by the ISF under Grant 1791/17, and by the WIN consortium via the Israel minister of economy and science.