Bayesian Active Meta-Learning for Reliable and Efficient AI-Based Demodulation

Two of the main principles underlying the life cycle of an artificial intelligence (AI) module in communication networks are adaptation and monitoring. Adaptation refers to the need to adjust the operation of an AI module depending on the current conditions; while monitoring requires measures of the reliability of an AI module's decisions. Classical frequentist learning methods for the design of AI modules fall short on both counts of adaptation and monitoring, catering to one-off training and providing overconfident decisions. This paper proposes a solution to address both challenges by integrating meta-learning with Bayesian learning. As a specific use case, the problems of demodulation and equalization over a fading channel based on the availability of few pilots are studied. Meta-learning processes pilot information from multiple frames in order to extract useful shared properties of effective demodulators across frames. The resulting trained demodulators are demonstrated, via experiments, to offer better calibrated soft decisions, at the computational cost of running an ensemble of networks at run time. The capacity to quantify uncertainty in the model parameter space is further leveraged by extending Bayesian meta-learning to an active setting. In it, the designer can select in a sequential fashion channel conditions under which to generate data for meta-learning from a channel simulator. Bayesian active meta-learning is seen in experiments to significantly reduce the number of frames required to obtain efficient adaptation procedure for new frames.


I. INTRODUCTION
A. Motivation as native components of a modular architecture that can be fine-tuned to meet the requirements of specific deployments [3].Two of the main principles underlying the life cycle of an AI module in communication networks are adaptation and monitoring [4].Adaptation refers to the need to adjust the operation of an AI module depending on the current conditions, particularly for real-time applications at the frame level.At run time, an AI model should ideally enable monitoring of the quality of its outputs by providing measures of the reliability of its decisions.The availability of such reliability measures is instrumental in supporting several important functionalities, from the combination of multiple models to decisions about retraining [5].Classical frequentist learning methods for the design of AI modules fall short on both counts of adaptation and monitoring (see, e.g., [6], [7]).First, conventional frequentist learning is well known to provide inaccurate measures of reliability, typically producing overconfident decisions [7].Second, the standard learning approach prescribes the one-off optimization of an AI model, hence failing to capture the need for adaptation.This paper investigates the integration of meta-learning and Bayesian learning as a means to address both challenges.As we detail in the next section, Bayesian learning can provide well-calibrated, and hence reliable, measures of uncertainty of a model's decision; while meta-learning can reduce the amount of data required for adaptation to a new task, thus improving efficiency.As a specific use case, we focus on the problems of demodulation and equalization over a fading channel based on the availability of few pilots (see Fig. 1).The goal is to develop AI solutions that are capable of adapting the demodulator/equalizer to changing conditions based on few training symbols, while also being able to quantify the uncertainty of the AI model's output.

B. Background
As illustrated in Fig. 2, frequentist learning assigns a single value to each model parameter as a result of training.This neglects (epistemic) uncertainty that exists at the level of model parameters due to the limited availability of data.In contrast, Bayesian learning can express uncertainty about the true value of the model parameter vector by optimizing over a distribution, rather than over a single point value [8].By averaging predictions over the distribution of the model parameters, Bayesian learning is known to be capable of providing decisions that are well calibrated [9]- [11].Calibration Fig. 1.Illustration of the meta-learning problem studied in this work for the example of 16-ary quadrature amplitude modulation .A receiver has available data corresponding to frames previously received from multiple devices, each possibly experiencing different channel conditions.Given metatraining data sets {Dτ } t τ =1 of pilots from previous frames, partitioned into training data and test data, the demodulator optimizes a hyperparameter vector ξ.For a newly received frame, the receiver uses the few pilots therein to adapt the demodulator/equalizer parameter vector φ * .In the Bayesian meta-learning framework, instead of a single parameter vector φ * , the receiver optimizes over an ensemble of parameter vectors through the hyperparameter vector ξ of a posterior distribution p(φ * |D tr * , ξ).
refers to the capacity of a model to produce confidence levels that reproduce well the actual accuracy of the decisions.Meta-learning, also known as learning to learn, optimizes training strategies that can fine-tune a model based on few samples for a new task by transferring knowledge across different learning tasks [12]- [18].Meta-learning is a natural tool to produce AI solutions that are optimized for adaptation.Prior work on meta-learning for communication systems, including [19]- [29], is limited to standard frequentist learning.Therefore, existing art is unable to produce models that provide well-calibrated estimations of reliability.Most related to our work is [19], which proposes to leverage pilot information from previous frames in order to optimize training procedures to be applied to the pilots of new frames (see Fig. 1).
Bayesian meta-learning aims at optimizing the procedure that produces the posterior distribution for new learning tasks.Accordingly, the goal of Bayesian meta-learning is to enhance the efficiency of Bayesian learning by reducing the number of training points needed to obtain accurate and well calibrated Bayesian models.The optimization of the Bayesian learning process is carried out by transferring knowledge from previously encountered tasks for which data are assumed to be available [30]- [32].To the best of our knowledge, with the exception of the conference version of this paper [1], this is the first work to consider the application of Bayesian meta-learning to communication systems.
Beside meta-learning, another approach to reduce the number of required training data points is active learning [33]- [37].Active learning amounts to the process of choosing which samples should be annotated next and incrementally added to the training set [38].Through this process, active learning can select relevant samples at which the model is currently most uncertain in order to speed up the training process.
A much less studied area is active meta-learning, which aims at reducing the number of tasks a meta-learner must collect data from, before it can adapt efficiently to new tasks [37], [39].Reference [37] proposes a method based on Bayesian meta-learning via empirical Bayes; while the paper [39] takes a hierarchical Bayesian approach, generalizing the Bayesian active learning by disagreements (BALD) criterion introduced in [33] to meta-learning.While [37] assumes labeled training sets, reference [39] considers unlabeled data during active meta-learning.As such, the setting in it is not applicable to the problem under study here in which data consists of supervised pairs of pilots and received signals (see Fig. 1).A summary of the relevant approaches built upon in this work is given in Table I.

C. Contributions
This paper introduces the use of Bayesian meta-learning to enable both adaptation and monitoring for the tasks of demodulation and equalization.Unlike prior works that considered either frequentist meta-learning [6], [19]- [26] or Bayesian learning [40]- [43], the proposed Bayesian metalearning methodology enables both resource-efficient adaptation and a reliable quantification of uncertainty.To further improve the efficiency of Bayesian meta-learning we propose the use of active meta-learning, which reduces the number of required meta-training data from previously received frames.Specific contributions are as follows.
• We introduce Bayesian meta-learning for the problems of demodulation and equalization from few pilots.The proposed implementation is derived based on parametric VI. • We introduce Bayesian active meta-learning as a solution to reduce the number of frames required by meta-learning.Active meta-learning selects in a sequential fashion channel conditions under which to generate data for metalearning from a channel simulator.• Extensive experimental results demonstrate that Bayesian meta-learning produces demodulators and equalizers that offer better calibrated soft decisions.Furthermore, they show that for a target meta-testing loss, active metalearning can reduce the number of simulated meta-training frames required.Part of this paper was presented in [1], which presented the idea of Bayesian meta-learning with some preliminary experiments.This journal version presents full technical details, new results and introduces for the first time Bayesian active meta-learning for communication systems.The rest of the paper is organized as follows.Section II introduces the channel model, along with background material on standard frequentist learning and frequentist meta-learning.Section III expands on Bayesian meta-learning.Then, we present Bayesian active meta-learning in Section IV.Numerical results are presented in Section V, and Section VI concludes the paper.

D. Related Work
For scalability, Bayesian learning can be implemented via approximate methods based on variational inference (VI) or Monte Carlo (MC).VI methods approximate the exact Bayesian posterior distribution with a tractable variational density [9], [44]- [47], while Monte Carlo techniques obtain approximate samples from the Bayesian posterior distribution [48]- [50].Each class of methods comes with its own set of technical challenges and engineering choices.For instance, VI requires the selection of a variational distribution family, such as meanfield Gaussian models, and the specification of a stochastic optimization algorithm.There are also non-parametric VI methods such as Stein variational gradient descent (SVGD) [51], which optimize over deterministic and interacting particles.For MC techniques, solutions range from first-order Langevin dynamics techniques [49] to more complex methods such as Hamiltonian Monte Carlo (HMC) [48].Implementing any of these schemes for a specific engineering application is a nontrivial task.
Bayesian learning has been applied in reference [52] to the problem of predicting the number of active users in LTE system; papers [53], [54] applied MC-based Bayesian learning for MIMO detection; the works [55]- [57] addressed channel prediction/estimation for massive MIMO systems; reference [58] studied the identification of IoT transmitters; and the authors of [59] proposed the use of robust Bayesian learning for modulation classification, localization, and channel modeling.
As for active learning, applications to communication systems include paper [60], which proposed a sample-efficient retransmission protocol; reference [61], which tackled initial beam alignment for massive MIMO system; work [62], which aimed at mitigating the problem of scarce training data in wireless cyber-security attack; and reference [63], which addressed resource allocation problems in vehicular communication systems.
Like Bayesian learning, meta-learning also provides a general design principle, which can be implemented by following different approaches.Optimization-based methods design the hyperparameters used by training algorithms; model-based techniques optimize an additional neural network model to guide adaptation of the main AI model; and metric-based schemes identify metric spaces for non-parametric inference (see, e.g., [64] and references therein).
The integration of meta-learning and Bayesian learning is highly non-trivial, and is an active topic of research in the machine learning literature.References [65]- [69] addressed Bayesian meta-learning via empirical Bayes using parametric VI [65], [66], particle-based VI [67], deep-kernels [68], and expectation-maximization [69]; while the papers [70]- [73] studied full Bayesian meta-learning that treats also the hyperparameters as random variables.Lastly, the work [74] proposed the use of quantum machine learning models as parameterized variational distributions.

A. Channel Model and Soft Demodulation or Equalization
In this paper, we consider frame-based transmission over a memoryless block fading channel model with constellation X and channel output's alphabet Y.The channel is characterized by a conditional distribution p(y|x, c) of received symbol y ∈ Y given transmitted symbol x ∈ X and channel state c.In the case of demodulation, we treat the set X as discrete; while for equalization we view it as the space of vectors of a certain size.In both cases, we will refer to channel input x as symbol.The channel state c is constant within each frame, and it is independently and identically distributed (i.i.d.) across frames according to an unknown distribution p(c).At frame τ , the transmitter sends a packet consisting of N τ symbols x τ = {x τ [i]} Nτ i=1 .Given the channel state c τ and the transmitted symbols, collected in a vector x τ , the received samples y τ = {y τ [i]} Nτ i=1 are conditionally independent and each received i-th sample is distributed as A soft demodulator/equalizer is a conditional distribution p(x|y, φ) that maps channel outputs y ∈ Y to estimated probabilities for channel input symbol x ∈ X .The demodulator/equalizer depends on a vector of parameters φ, and is applied separately to each received sample y[i] in a memoryless fashion as p(x|y τ [i], φ).The ideal frame-specific parameter vector φ τ for the frame τ is the one that best approximates the channel conditional distribution p(x τ |y τ , c τ ), within its model class, obtained from the Bayes rule as where p(x τ ) is the distribution of the input symbol vector x τ .In practice, as we detail below, the demodulator/equalizer is optimized based on pilot symbols.To simplify the terminology, we will also refer to demodulation/equalization as prediction henceforth.

B. Conventional Data-Driven Demodulators/Equalizers
Pilot-aided schemes utilize available pilot symbols to adapt the predictor p(x|y, φ) to the unknown channel state c in each frame τ .A typical choice for a predictor is a multi-layer neuralnetwork [75].With L layers, given received sample y, this class of models produces a vector where • is the composition operator; the weights {W l } L l=1 and biases {b l } L l=1 define the model parameter vector φ := {W l , b l } L l=1 for a total of D parameters; and the function for the l-th layer f W l ,b l is a linear mapping followed by an entry-wise activation function h(•), i.e., y l = f W l ,b l (y l−1 ) = h(W l • y l−1 + b l ) with y 0 = y.In the last, L-th layer, a soft demodulator applies the softmax function to vector a(y|φ), producing the probability distribution x ∈X exp([a(y|φ)] x ) , x as the x-th element of the vector.In contrast, a soft equalizer typically defines the conditional distribution where the precision β is fixed.Throughout this paper, we use N (x|µ, Σ) to indicate the probability density function of a Gaussian vector with mean µ and covariance matrix Σ.
In each frame τ , conventional learning optimizes the model parameters φ τ using N tr τ i.i.d.pilots as training data.Optimization of the prediction aims at minimizing the training log-loss which amounts to the cross entropy for demodulation (3) and the quadratic prediction loss for equalization (4).Minimization of ( 5) can be done via gradient descent (GD), or stochastic GD (SGD), a variant thereof [76].GD updates model parameter vector φ τ for I iterations with learning rate η > 0 starting from an initialization vector ξ.Accordingly, the updated parameters φ τ := φ GD (D tr τ |ξ) are obtained via the iterations The resulting prediction for a test input-output pair (y te

C. Frequentist Meta-Learning
The most prominent shortcoming of conventional learning is its potentially high sample complexity, which translates into the need for a large number of pilots, N tr τ , per frame.Meta-learning addresses this issue by transferring knowledge acquired over previous frames.Specifically, frequentist metalearning, as proposed in [19], treats the initialization vector ξ in (6) as a hyperparameter vector to be optimized based on the availability of pilots from t previous transmission frames.
As a preliminary step, we decompose the available pilots from each frame τ into a disjoint training set D tr τ and test set D te τ as D τ = {D tr τ , D te τ }.Furthermore, the data sets for all previous t frames are stacked as , and similarly for D te samples.Meta-learning has two phases: meta-training and metatesting.These are defined next by following the frequentist meta-learning strategy of [19].

Meta-training tackles the bi-level optimization problem
The notation φ(ξ) in (7b) indicates the dependence of the optimizer on the initialization vector ξ.By (7), the goal of frequentist meta-training is to find a hyperparameter vector ξ such that for any frame τ , the optimized model parameter vector φ τ (D tr τ |ξ) fits well the test data set D te τ .Problem ( 7) is addressed via a nested loop optimization involving SGD-based inner updates and SGD-based outer updates, which are also referred as meta-iterations.The inner loop tackles the inner optimization (7b) in a per-frame manner via (6) for a randomly selected subset T ⊂ {1, . . ., t} of frames, which are redrawn independently at each meta-iteration.The outer loop addresses the outer optimization (7a) via an SGD step of the meta-loss with learning-rate κ > 0, i.e., based on data from the batch T of selected frames, and using the notation N te T = τ ∈T N te τ for the total samples within the batch of selected frames.Meta-training updates the initialization vector ξ across multiple meta-iterations.When meeting some stopping criterion, here determined by a predefined number of meta-iterations I meta , meta-training stops, and the hyperparameter vector ξ is stored to be used for future learning tasks.
Upon deployment, i.e., during meta-testing, the meta-test frames also include pilots and data as the meta-training frames.Accordingly, each meta-test device loads the hyperparameter vector ξ for initialization, and produces the adapted model parameter vector φ * = φ GD (D tr * |ξ) as in (6)

A. Bayesian Learning
Bayesian learning treats the model parameter vector φ τ for some frame τ as a random vector, rather than as a deterministic optimization variable as in frequentist learning framework.As illustrated in Fig. 2, instead of producing a single demodulator parameters φ τ = φ GD (D tr τ |ξ) as in (6), Bayesian learning produces a distribution p(φ τ |D tr τ , ξ) over the space of the demodulator parameters φ τ .This distribution is computed based on training data D tr τ and on predetermined prior distribution p(φ τ |ξ), which depends in turn on the hyperparameter vector ξ, also fixed a priori.
The frequentist prediction (9), reviewed in the previous section, can be viewed as a special case in which one is limited to the choice p(φ τ |D tr τ , ξ) = δ(φ τ − φ GD (D tr τ |ξ)), with δ(•) indicating the Dirac Delta.With this choice, the distribution p(φ τ |D tr τ , ξ) is concentrated at one point, namely the GD solution (6).The frequentist approach is therefore inherently limited in its capacity to express uncertainty on the model parameters due to limited data.
Ideally, the distribution p(φ τ |D tr τ , ξ) should be obtained as the posterior distribution where p(D tr is the likelihood function for the training data.However, computing the posterior p(φ τ |D tr τ , ξ) in ( 11) is generally intractable for high dimensional vector φ τ .
To address this challenge, we follow VI and introduce a variational distribution approximation which depends on a variational parameter vector ϕ τ .A typical choice is given by the Gaussian mean-field approximation [77] which can be expressed as with variational parameter vector ϕ τ = [ν τ , τ ] , and the exponent function is applied element-wise.The variational parameter vector includes the mean vector ν τ ∈ R D and the vector of the logarithm of the standard deviations τ ∈ R D for the Gaussian random vector φ τ .Note that vector τ models uncertainty in the model parameter space.
To describe VI, we will use the Kullback-Liebler (KL) divergence KL(p(z)||q(z)) [78], which is a measure of the distance between two distributions p(z) and q(z).It is defined as the average of the log-likelihood ratio log(p(z)/q(z)) as VI-based Bayesian learning prescribes that the variational parameter vectors φ τ be obtained via the minimization of the KL divergence KL(q(φ τ |ϕ)||p(φ τ |D tr τ , ξ)) between the variational distribution q(φ τ |ϕ) and the posterior distribution p(φ τ |D tr τ , ξ).This problem can be equivalently formulated as the minimization [76], [77] where the variational free energy [79] is defined as In ( 16), we have defined as L D tr τ (ϕ τ ) the expectation of loss function L D tr τ (φ τ ) (5) over variational distribution q(φ τ |ϕ τ ), i.e., In ( 16), the second summand is a regularizer that restricts the variational distribution to be close to the prior distribution.Note that, if the variational distribution has ability to express the posterior distribution in (11), the minimizer of the problem (15) becomes the Bayesian posterior p(φ τ |D tr τ , ξ), since the KL divergence KL(q(φ τ |ϕ)||p(φ τ |D tr τ , ξ)) is minimized (and it equals zero) when the two distributions are the same.
A typical choice for the prior distribution p(φ τ |ξ) is the Gaussian distribution.In this case, we have which is defined by the hyperparameter vector ξ = [ν , ] , where ν ∈ R D and ∈ R D stand for the mean and logarithm of the standard deviation vector of the Gaussian random vector φ τ .
Assuming the Gaussian variational distribution in ( 13) and the Gaussian prior (18), the regularizer term in ( 16) can be computed in closed-form as which is a differentiable function for ϕ τ .With these choices of variational posterior and prior, problem (15) can be addressed via gradient-descent methods by using the reparametrization trick [80].This is done by writing the random model parameter vector φ τ ∼ q(φ τ |ϕ τ ) as φ τ = ν τ + exp ( τ ) e, with random vector e ∼ N (0, I D ) and being the element-wise multiplication.An estimate of the gradient of the objective (17) using the reparametrization trick is done with the aid of R drawn independently samples of the standard normal Gaussian random vector e, and differentiating the resulting empirical estimate of (17).
obtained by drawing samples e τ,r ∼ N (0, I D ) for r = 1, 2, . . ., R. This yields the estimated free energy This is a special case of Algorithm 1 with input G(φ τ ) = L D tr τ (φ τ ).The function (20) can be directly differentiated and used in SGD updates.
Once the variational parameter ϕ τ is inferred using Bayesian training, ensemble prediction for a payload data symbol (y te τ [i], x te τ [i]) can be obtained via (10) by replacing p(φ τ |D tr τ , ξ) with q(φ τ |ϕ τ ) to yield the ensemble predictor Practically, it uses Monte Carlo sampling with R model vectors, producing the approximated soft predictor p(x te

B. Bayesian Meta-Learning
While conventional Bayesian learning assumes that the random model parameter vector φ τ has a fixed prior distribution p(φ τ |ξ) parametrized by a predefined hyperparameter vector ξ, Bayesian meta-learning leverages the stronger assumption that there is a shared prior distribution p(φ τ |ξ) across all frames that can be optimized through a hyperparameter vector ξ.
In this section, we formulate Bayesian meta-learning by following empirical Bayes [81], with the aim of selecting a distribution p(φ τ |ξ) that provides a useful prior for the design of the predictor on new frames.Mathematically, Bayesian meta-training optimizes over the hyperparameter vector ξ by addressing the bi-level problem Problem (22) chooses the hyperparameter vector ξ that minimizes the average test loss on the meta-training frames τ ∈ {1, . . ., t} that is obtained with the variational posterior via (15).The subproblems in (22b) correspond to Bayesian learning applied separately to each frame as explained in Section III-A.
An illustration of all the quantities involved in problem (22) can be found in Fig. 3 by using the formalism of Bayesian networks [82].
To address problem (22) in a tractable manner, we apply the reparametrization trick for both outer (22a) and inner optimization (22b) by following the same steps described in Section III-A.Details on the optimization can be found in Algorithm 2. In short, the inner loop updates the frame-specific variational parameters ϕ τ by minimizing the approximated free energy (20) separately for each frame τ within a mini-batch T via GD (dashed blue line in Fig. 3b).Following [30], [66], the prior's parameter vector ξ plays two roles in the inner loop, namely (i) as the initialization for the inner GD update in Algorithm 2 line 9; and (ii) as the regularizer for the same update via the prior p(φ τ |ξ).The outer optimization (22a) is addressed via SGD to minimize the average log-likelihood for test set using Algorithm 1 with G(φ τ ) = L D te τ (φ τ ), shown as dashed green line in Fig. 3b.
After obtaining meta-trained hyperparameter ξ, meta-testing takes place, starting with the adaptation of the variational x tr τ [i] as done in (21).Bayesian meta-learning is illustrated comparatively to meta-learning in Fig. 4.

C. Computational Complexity
We now briefly elaborate on the complexity of meta-learning by analyzing the complexity of meta-training and of metatesting.To this end, let us define as C the complexity of obtaining the probability p(x|y, φ) for a data sample (y, x).This baseline complexity depends on the model dimensionality, and it accounts for the amount of time needed to carry out the forward pass on the neural network implementing the model p(x|y, φ).Accordingly, as seen in Table II, the per-data point complexity of meta-testing equals C for frequentist learning, and CR te for Bayesian learning, where R te is the size of the ensemble used for inference.
The complexity of computing the first-order gradient via backpropagation per-sample is given by G 1 C, with G 1 being a constant in the range between 2 and 5 [83], [84].Furthermore, computing the Hessian-vector product (HVP) has a complexity of the order G 2 G 1 C, where constant G 2 is also between 2 to 5 [19, Appendix A], [85,Appendix C].Assume that all tasks have data sets of equal size, i.e., N tr τ = N tr and N te τ = N te for any task τ .Therefore, for each meta-training iteration, for a batch of B tasks with I local updates, the complexity of the frequentist meta-update ( 8) is of the order HVPs in meta-update gradient in meta-update .
(24) For Bayesian meta-learning, the complexity increases linearly with the training ensemble size that is used for estimating the loss functions in (22a) and (22b).Note that the impact of the size R tr of the training ensemble used for meta-training is different from the size R te used for inference, as the first determines the variance of the stochastic loss functions, while the latter determines the quality of Bayesian prediction (see, e.g., [59] and references therein).Ignoring the constant cost of differentiating the KL term in the free energy and for sampling from the Gaussian distribution, the complexity analysis is summarized in Table II.

IV. BAYESIAN ACTIVE META-LEARNING
In the previous sections, we have considered a passive metalearning setting in which the meta-learner is given a number of meta-training data sets, each corresponding to a different channel state c.In this section, we study the situation in which the meta-learner has access to a simulator that can be used to generate random data sets for any channel state c via the channel p(y|x, c).The problem of interest is to minimize the use of the simulator by actively selecting the channels {c τ } for which meta-training data is generated.To this end, we devise a sequential approach, whereby the meta-learner optimizes the next channel state c t+1 , given all t meta-training data sets of frames τ = 1, . . ., t.
At the core of the proposed active meta-learning strategy, are mechanisms used by the meta-learner to discover model parameter vectors φ that have been underexplored so far, and to relate model parameter vector φ to a channel state.

A. Active Selection of Channel States
After having collected t meta-training data sets D 1:t = {D τ } t τ =1 , the proposed active meta-learning scheme selects the next channel state, c t+1 , to use for the generation of the (t + 1)-th meta-training data set D t+1 .We adopt the general principle of maximizing the amount of "knowledge" that can be extracted from the data set associated with selected channel c t+1 , when added to the t available data sets D 1:t .This is done via the following three steps: (i) searching in the space of model parameter vectors for a vector φ t+1 that is most "surprising" given the available meta-training data D 1:t ; (ii) translating the selected model parameter vector φ t+1 into a channel c t+1 ; and (iii) generating data set D t+1 by using the simulator with input c t+1 .
As illustrated in Fig. 5, in step (i), we adopt the scoring function introduced in [37], i.e., in order to select the next model parameter vector as The criterion ( 25) measures how incompatible model parameter vector φ is with the available data D 1:t .In fact, by the derivations in the previous section: the mixture of variational distributions 1 t t τ =1 q(φ|ϕ τ ) quantifies how likely a vector φ is on the basis of the data D 1:t (Fig. 5b); and the negative logarithm in ( 25) evaluates the information-theoretic "surprise" associated with that mixture.Problem ( 26) can be addressed either by grid search for low-dimensional model parameter space, or by using gradient ascent due to the differentiability nature of the scoring function (25), as illustrated in Fig. 5c.
In step (ii), we need to convert the selected model parameter vector φ t+1 , i.e., the outcome of ( 26), into channel state c t+1 .We choose the channel state c t+1 that minimizes the cross entropy loss when evaluated at φ t+1 , i.e., where we set p(x, y|c) = p(x)p(y|x, c), with p(x) being some fixed distribution and p(y|x, c) being the distribution of the output of the simulator.In (27), we have emphasized that there may be more than one solution to the problem.The rational behind problem (27) is that data generated from the distribution p(x, y|c t+1 ) can be interpreted as being the most compatible with the demodulator p(x|y, φ t+1 ), where compatibility is measured by the average of the cross entropy E p(y|ct+1) H p(x|y, c t+1 ), p(x|y, φ t+1 ) .
We emphasize that the proposed approach is different from the methodology introduced by [37], which uses another variational distribution in problem (22).In our experiments, we found the method in [37] to be ineffective and complex for the problem under study here.The main issue appears to be overfitting for the additional variational distribution, which is overcome by leveraging the availability of the channel simulator implementing the model p(y|x, c).
In some models, problem ( 27) can be solved analytically.For more complex models, SGD-based approaches can be used, either by differentiating an estimate of the loss in a manner similar to the discussion in Sec.III i.e., Algorithm 1 with G(φ t+1 ) = L p (φ t+1 |c) , or by directly estimating its gradient [86].
Finally, in step (iii), meta-training data set is generated using the simulator in an i.i.d.fashion following the distribution As a final note, we adopt the proposal in [37] of implementing active selection only after t init > 1 channel states that are generated at random, as a means to avoid being overconfident at early stages.The overall proposed Bayesian active metalearning scheme is summarized in Algorithm 3.
V. EXPERIMENTS In this section, we present experimental results to evaluate the performance of Bayesian meta-learning for demodulation/equalization.

A. Performance Metrics
Apart from the standard measures of symbol error rate (SER) and mean squared error (MSE), we will also evaluate metrics quantifying the performance in terms of the reliability of the confidence measures provided by the predictor.While such measures can be defined for both classification and regression problems, we will focus here on uncertainty quantification for demodulation via calibration metrics (see [87] for discussion on regression).
As discussed in the previous sections, for a new frame, we need to make a prediction for the payload symbols {y te The within-bin empirical average accuracy of the predictor for the m-th bin is defined as with 1(•) being indicator function and |B m | denoting the number of total samples in B m .The within-bin empirical average confidence of the predictor for the m-th bin is A perfectly calibrated demodulator p(x|y, θ) would have acc(B m ) = conf(B m ) for all m ∈ {1, . . ., M } in the limit of a sufficiently large payload data set, i.e., N te * → ∞.Reliability diagrams plot the accuracy acc(B m ) and the confidence conf(B m ) over the binned probability interval [0, 1].Ideal calibration would yield acc(B m ) = conf(B m ) in a reliability plot.If in the m-th bin, the empirical accuracy and empirical confidence are different, the predictor is considered to be over-confident when conf(B m ) > acc(B m ), and underconfident when conf(B m ) < acc(B m ).
The ECE quantifies the overall amount of miscalibration by computing the weighted average of the differences between within-bin accuracy and within-bin confidence levels across all M bins, i.e.,

B. Frequentist and Bayesian Meta-Learning for Demodulation
For the first set of experiments, we focus on a demodulation problem at the symbol level in the presence of transmitter I/Q imbalance [89], [90], as considered also in [19].The main reason for this choice is that channel decoding typically requires a hard decision on the transmitted codeword, whose accuracy can be validated via a cyclic redundancy check.In contrast, demodulation is usually a preliminary step at the receiver side, and downstream blocks, such as channel decoding, expect soft inputs that are well calibrated.For each frame τ , the transmitted symbols x τ [i] are drawn uniformly at random from the 16-QAM constellation X = 1/ √ 10({±1, ±3} + {±1, ±3}).The received symbol y τ [i] ∈ Y = C is given as for a unit energy fading channel coefficient h τ , where the additive noise is z τ [i] ∼ CN (0, SNR −1 ) for some signal-tonoise ratio (SNR) level SNR, and the I/Q imbalance function [91] f IQ,τ : X → Xτ is which depends on the imbalance parameters τ and δ τ .In ( 35 Note that the constellation Xτ of the transmitted symbols xτ [i] is also composed of 16 points via (36).By (35) and ( 36 (37) We set our base learner to be a multi-layer fully-connected neural network (3) with L = 5 layers.The real and imaginary parts of the input y[i] ∈ C are treated as a vector in R 2 , which is fed to layers with 10, 30, and 30 neurons, all with ReLU activations, while the last linear layer implements a softmax function that produces probabilities for the 16QAM constellation points.
To address the ability of meta-learning to adapt the demodulator using only few pilots, we set the number of pilots as N tr τ = 4 during meta-training and N tr * = 8 for metatesting [19].Fig. 6 shows the SER as a function of the number of total meta-training frames t.Since only half of the constellation points are available as pilots during meta-test (N tr * = 8 different symbols out of 16), conventional learning cannot obtain a SER lower than of 0.5.In fact, conventional learning performs worse than a standard model-based receiver applying linear minimal mean square error (LMMSE), followed by maximum likelihood (ML) demodulation, while disregarding the presence of I/Q imbalance function f IQ .Both meta-learning schemes are clearly superior to conventional learning and to the mentioned model-based solution, showing that useful knowledge has been transferred from previous frames to a new frame.Furthermore, Bayesian meta-learning obtains a slightly lower SER as compared to frequentist meta-learning.This advantage stems from the capacity of ensemble predictors to implement more complex decision boundaries [59].To gain insights into the reliability of the uncertainty quantification provided by the demodulator, we use the metrics defined in Sec.V-A by setting the total number of bins to M = 10.We plot the ECE as a function of the number of total meta-training frames t in Fig. 7. Bayesian meta-learning is seen to achieve a lower ECE than frequentist meta-learning, indicating that Bayesian meta-learning provides more reliable estimates of uncertainty.Furthermore, the increase in ECE as the number t of available meta-training frames increases may be interpreted as a consequence of meta-overfitting [92].This suggests that meta-learning may be considered as complete after a number of frames that depends on the complexity of the propagation environment.In practice, this can be assessed by evaluating the performance of the demodulator on pilots (see the online strategy in [19] for further discussion on this point).
To further elaborate on the quality of uncertainty quantification, Fig. 8 depicts reliability diagrams for frequentist and Bayesian meta-learning.The within-bin accuracy levels acc(B m ) in (32) and the within-bin empirical confidence conf(B m ) in ( 33) are depicted as dark (blue) and light (red) bars, respectively.Frequentist meta-learning is observed to produce generally over-confident predictions, while Bayesian meta-learning provides better calibrated predictions with wellmatching confidence and accuracy levels.

C. Bayesian Active Meta-Learning for Equalization
In this sub section, we illustrate the operation of active meta-learning by investigating a single-input multipleoutput (SIMO) Rayleigh block fading real channel model.At frame τ , the modulator uses a 4-PAM to produce symbols x τ [i], i = 1, 2, . . ., N τ , taken uniformly from the set with linear equalizer weight vector To obtain a soft equalization, we account for a precision level β via the conditional distribution The next model parameter φ t+1 is chosen to maximize the scoring function as in (26) by restricting the optimization to the domain ||φ|| ≤ 1.This restricted optimization domain is selected in order to match the circular symmetry of the problem.Furthermore, the corresponding next channel state c t+1 is selected by tackling problem (27), which amounts to the minimization In the set of solutions of problem (41b), we select the minimumnorm solution c t+1 = φ t+1 / φ t+1 2 .This way, the selected channel focuses on the more challenging low-SNR regime.Details of this experiment are provided in Appendix A.
Fig. 9 illustrates the scoring function (25) used to select the next model parameter φ t+1 as a heat map in the space of model parameter φ.Specifically, the figure shows the scoring functions after observing t = 4 and t = 5 meta-training frames.The optimized next model parameter vector φ t+1 (26) is shown as a star, while the previously selected model parameter vectors φ 1:t are shown as squares.Fig. 9 illustrates how active metalearning efficiently explores the model parameter space.It does so by avoiding the inclusion of channel states that are similar  to those already considered (i.e., the squares in the figure).This way, the model parameter space can be covered with fewer meta-training frames t, leading to a larger frame efficiency of active meta-learning.
Finally, to numerically validate the advantage of active metalearning, we plot the meta-test MSE loss in Fig. 10 for both passive and active Bayesian meta-learning versus the number of frames t.For passive meta-learning, we have generated random channel realizations by drawing from the distribution p(c) = N (c|0, I 2 ).We have repeated the experiment 100 times, and show the confidence interval of one standard deviation for the meta-test loss.The results in the figure confirm that active metalearning requires far fewer meta-training frames.Furthermore, the increased randomness of passive meta-learning is due to the random selection of channel states at each iteration.

VI. CONCLUSIONS
In this paper, we have introduced tools for reliable and efficient AI in communication systems via Bayesian meta-learning.Bayesian learning has the advantage of producing well-calibrated decisions whose confidence levels are a close match for the corresponding test accuracy.This property facilitates monitoring of the quality of the outputs of an AI module.Meta-learning optimizes models that can quickly adapt based on few pilots, producing sample-efficient AI solutions.This paper has focused on the application of Bayesian metalearning to the basic problems of demodulation/equalization from few pilots.We have demonstrated via experiments that the demodulator/equalizer obtained via Bayesian meta-learning not only achieves a higher accuracy, but it also enjoys better calibration performance than its standard frequentist counterpart.Furthermore, thanks to meta-learning, such performance levels can be obtained based on a limited number of pilots per frame.
To reduce the number of past frames required by metalearning, we have also introduced Bayesian active metalearning, which leverages the uncertainty estimates produced by Bayesian learning to actively explore the space of channel conditions.We have shown via numerical results that active meta-learning can indeed significantly speed up meta-training in terms of number of frames.
Future work may consider a fully Bayesian meta-learning implementation that also accounts for uncertainty at the level of hyperparameters (see, e.g., [73] and references therein).This may be particularly useful in the regime of low number of frames.Another direction for research would be to investigate different scoring functions for active meta-learning (see, e.g., [39]).A study on the impact of well-calibrated decisions obtained via Bayesian learning on downstream blocks at the receiver, such as channel decoding, is also of interest.Finally, the proposed tools may find applications to other problems in communications, such as power control [25] and channel coding [28], [93].III summarizes the parameters used for the numerical experiments in Sec.V for demodulation and equalization.Throughout the simulations, we used PyTorch [94] adopting autograd's option create graph = True to allow the computational graph to calculate second-order derivatives.

APPENDIX A EXPERIMENTS DETAILS Table
For the demodulation problem in Sec.V-B (Figs. 6 -8), the complex input space Y = C is treated as a twodimensional real vector space R 2 when is fed into the neural network demodulator.The KL term in ( 20) is suppressed by a multiplicative coefficient of 0.1, as a means to emphasize the average log-likelihood term should have over the prior.This is an approach known as generalized Bayesian inference [79], [95].To handle the discrepancy in the number of pilots for adaptation during meta-training and meta-testing, i.e., N tr * > N tr τ , we consider the following strategy akin to burn-in phase [49] during meta-testing as done in [19]: (i) start with I updates using learning rate η utilizing N tr τ pilots among the available N tr * pilots; (ii) then, additional I * − I updates are performed with reduced learning rate (5% of the original learning rate) with all available N tr * pilots.This strategy becomes particularly useful in practical scalable systems in which the number of pilots may change depending on the deployment environments.As for the equalization setting in Sec.V-C (Figs. 9 -10), we observe that reinitializing the hyperparameter ξ to a random value at each data acquisition iteration benefits meta-training in practice.While using the previous iteration's optimized hyperparameter vector ξ as the starting point for the current iteration is useful in reducing the computational complexity [19], [96], we found it beneficial not to do so in our equalization problem to avoid meta-overfitting especially in the few-frames (e.g., 10 frames) regime of interest.

Fig. 2 .
Fig.2.Network weights in frequentist and Bayesian learning: (a) in frequentist learning, each weight is described by a scalar value; (b) the scalar value can be viewed as random variable having a degenerated probabilistic distribution concentrated at a simple prior; (c) in Bayesian learning, the weights are assigned a probability distribution, which, unlike the frequentist point estimate (dashed vertical line), provides information about the uncertainty on the weight; (d) in variational inference (VI), the posterior is approximated with a parameter distribution.

Fig. 5 .
Fig.5.Illustration of how model parameter vectors are scored to enable active meta-learning provided t = 3 meta-training sets.(a) Frequentist meta-learning relies on point estimates, and is hence unable to score as-of-yet unexplored model parameters; (b) Bayesian meta-learning can associate a score to each model parameter vector φ based on the variational distributions q(φ|ϕτ ) evaluated in the previously observed frames τ = 1, . . ., t; (c) The scoring function can be maximized to obtain the next model parameter vector φ t+1 as the most "surprising" one.

Fig. 6 .
Fig. 6.Symbol error rate (SER) as a function of the number t of meta-training frames with 16-QAM, Rayleigh fading, and I/Q imbalance for N tr τ = 4, N tr * = 8.The symbol error rate is averaged over by N te * = 4000 data symbols and 50 meta-test frames with ensemble of size 100.

Fig. 7 .
Fig. 7. Expected calibration error (ECE) over meta-test data D te * as a function of the number t of meta-training frames, for the same setting as in Fig. 6.

Fig. 8 .
Fig. 8. Reliability diagrams (top) for frequentist meta-learning (left) and Bayesian meta-learning (right) with SNR = 18 dB, using t = 16 metatraining frames and predictions averaged over 50 meta-test frames.Frequentist meta-learning tends to be over-confident, whereas the Bayesian soft predictions are better matched to the true accuracy.The bottom figure shows the histogram of |Bm|/N of prediction over M = 10 bins.Full details in Appendix A

Fig. 9 .
Fig. 9. Scoring function (25) used by Bayesian active meta-learning to select the next model parameter vector φ t+1 at the fourth and fifth iterations.The scoring function is shown as a heat map over the two dimensional space of the model parameter vector φ for the example detailed in Sec.V-C.

Fig. 10 .
Fig. 10.Meta-test mean squared error (MSE) loss as function of the number of frames t.Bayesian active meta-training is able to achieve lower meta-test loss levels by using fewer meta-training tasks t.Solid lines are the mean test loss over 100 channel states.The confidence levels account for one standard deviation.

TABLE I A
SUMMARY OF THE RELEVANT TECHNIQUES CONSIDERED IN THIS WORK Having obtained the distribution p(φ τ |D tr τ , ξ), the ensemble prediction of a test point (y te τ[i], x te τ [i]) is given by the ensemble average of the predictions p(x te τ [i]|y te τ [i], φ τ ) with random vector φ τ having distribution p(φ τ |D tr τ , ξ), i.e., p x te τ

TABLE III PARAMETERS
FOR THE DEMODULATION AND EQUALIZATION META-LEARNING.