Lyapunov-Based Optimization of Edge Resources for Energy-Efficient Adaptive Federated Learning

The aim of this paper is to propose a novel dynamic resource allocation strategy for energy-efficient adaptive federated learning at the wireless network edge, with latency and learning performance guarantees. We consider a set of devices collecting local data and uploading processed information to an edge server, which runs stochastic gradient-based algorithms to perform continuous learning and adaptation. Hinging on Lyapunov stochastic optimization tools, we dynamically optimize radio parameters (e.g., set of transmitting devices, transmit powers, bits, and rates) and computation resources (e.g., CPU cycles at devices and at server) in order to strike the best trade-off between power, latency, and performance of the federated learning task. The framework admits both a model-based implementation, where the learning performance metrics are available in closed-form, and a data-driven approach, which works with online estimates of the learning performance of interest. The method is then customized to the case of federated least mean squares (LMS) estimation, and federated training of deep convolutional neural networks. Numerical results illustrate the effectiveness of our strategy to perform energy-efficient, low-latency, adaptive federated learning at the wireless network edge.


I. INTRODUCTION
I N THE last few years, with the advent of 5G (and beyond) systems, communication networks are evolving from a pure communication framework to service enablers in several different sectors (including verticals), such as Industry 4.0, Internet of Things (IoT), autonomous driving, remote surgery, etc.
[1]- [5]. As key enablers of this vision, machine learning (ML) and artificial intelligence will be largely exploited in future wireless communication networks, in order to build an effective complex system able to learn and dynamically adapt to the evolving network landscape [6]. Indeed of a new breed of intelligent devices and high-stake applications foreseen in beyond 5G have sparked a huge interest in distributed, low-latency and reliable ML, calling for a novel system design coined edge machine learning, in which: (i) training data is unevenly distributed over a large number of edge devices (including phones, cameras, vehicles, and drones); (ii) every edge device has access to a tiny fraction of the data and training is carried out collectively and distributively; (iii) the inference process is performed on the edge devices, requiring not only high learning accuracy and reliability, but also a very short response time necessary for autonomous decision making in highly dynamic environments. However, differently from cloud-based ML that has virtually infinite computing resources, edge ML is a nascent research field whose system design is entangled with communication and on-device resource constraints (e.g., energy and computing power). Moreover, the process of decentralized training involves a large number of devices that are interconnected over wireless links, hindering learning and adaptation due to communications under poor channel conditions. As such, enabling edge ML introduces novel research problems in terms of jointly optimizing inference, training, communication, computation, and control under end-to-end latency, reliability, and learning performance requirements [7]- [10]. Related works: Training ML models at the edge mainly relies on federated learning (FL) [11]- [19]. These learning architectures perform (variants of) parallel stochastic gradient descent (SGD) across multiple edge devices, whose intermediate results are aggregated by an Edge Server (ES). FL has several benefits of data privacy, and is empowered by a large amount of device participants with modern powerful processors and low-delay mobile-edge networks. The work in [11] provides a comprehensive survey on FL algorithms and introduces various challenges, problems, and solutions for enhancing FL effectiveness. In [12], the authors develop two update methods to reduce the uplink communication costs for FL. The work in [15] presents a practical update method for a deep FL algorithm and conduces an extensive empirical evaluation for different FL models. The authors in [18] study FL and the problem of joint power and resource allocation for ultrareliable low latency communication in vehicular networks. The work in [19] develops a new approach to minimize the computing and transmission delay for FL algorithms. Other works on FL explicitly focus on the optimization of radio resource allocation [20]- [33]. In [20], the authors propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. In [21], the authors propose energy-efficient strategies for bandwidth allocation and scheduling to enable latency-constrained FL. The work in [22] proposes a joint learning and wireless resource allocation framework to optimize the FL performance. In [23], the authors characterize how the computation and communication latencies of edge devices affect the trade-offs between energy consumption, learning time, and accuracy of the FL task. The work in [25] studies the relationship between batch size and convergence rate to alleviate the negative impact of synchronization barrier through adaptive batch size during model training in the FL paradigm. In [26], the authors model the interaction between a global server and the participating devices for federated learning via a Stackelberg game to motivate the participation of the devices in the federated learning process. In [27], the authors provide an optimization problem whose goal is to minimize the total energy consumption of the system under a latency constraint; to solve the problem, an iterative algorithm is proposed where, at every step, closed-form solutions for time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy are derived. The work in [28] propose a federated deep-reinforcement-learning-based cooperative edge caching framework which enables base stations (BSs) to cooperatively learn a shared predictive model by considering the first-round training parameters of the BSs as the initial input of the local training, and then uploads near-optimal local parameters to the BSs to participate in the next round of global training. In [29], the authors propose adapting federated averaging to use a distributed form of Adam optimization along with a compression technique. The work in [30] proposes a FL approach with adaptive and distributed parameter pruning, which adapts the model size during FL to reduce both communication and computation overhead and minimize the overall training time, while maintaining a similar accuracy as the original model. In [31], the authors consider two transmission protocols for edge devices to upload model parameters to edge server, based on non orthogonal multiple access and time division multiple access, respectively. Under both protocols, they minimize the total energy consumption at all edge devices over a particular finite training duration subject to a given training accuracy, by jointly optimizing the transmission power and rates at the edge devices for uploading model parameters and their central processing unit frequencies for local update. In [32], the authors propose a joint device scheduling and resource allocation policy to maximize the model accuracy within a given total training time budget for latency constrained wireless FL. The work in [34] analyzes how to design dynamic FL in mobile edge networks that optimally chooses the number of selected clients and the number of local iterations in each training round to minimize the total cost while ensuring convergence. Reference [33] proposes a dynamic user selection scheme to minimize the FL convergence time. In [35] authors propose two bandwidth allocation schemes to maximize the number of active clients under latency and bandwidth constraints. The work in [36] focuses on the design and analysis of physical layer quantization and transmission methods for wireless FL. In [37] authors propose a strategy that allocates different aggregation weights to different clients based on the heterogeneous quantization errors of all clients. The work in [38] propose a strategy for joint allocation of wireless resources and quantization bits across the clients to minimize the quantization errors while making the clients have the same transmission outage probability. Finally, some works exploited Lyapunov optimization for federated learning [39]- [41]. In [39] authors propose to statically optimize agents' schedule and power allocation to minimize the global learning loss under an energy consumption constraint. In [40] authors propose to optimize admitted data proportions, load balancing, training scheduling and numerical accuracy to minimize a unified cost function under stability of the data queue constraint. Finally, the work in [41] proposes to optimize agents' schedule to minimize long-term average model exchange time under a fairness constraint.
Contributions: The goal of this paper is to introduce a novel dynamic optimization framework for adaptive federated learning, which jointly encompasses communication, computation, and learning aspects of the problem. Differently from previous works that mainly focused on a static learning task (where FL is carried out up to convergence and then the learning process stops), here we consider adaptive FL strategies, with the aim of endowing wireless networks with continuous learning, adaptation, and tracking capabilities [42]. Hinging on Lyapunov stochastic optimization [43], we develop a dynamic resource allocation strategy that works at the same time-scale of the gradient-based algorithm, while optimizing on the fly radio parameters (e.g., set of transmitting devices, transmit powers, bits and rates) and computation resources (e.g., CPU cycles at devices and at edge server) in order to strike the best trade-off between energy, latency, and performance (e.g., convergence rate, accuracy or mean-squared error) of the adaptive FL task. The proposed method encompassing jointly communication, computation, and learning aspects of FL represents the main distinctive difference with respect to the previous approaches hinging on Lyapunov optimization [39]- [41]. In this paper, particular emphasis is devoted to the definition and the online control of proper performance metrics in both a model based scenario, where the metrics are known in closedform, and in a data-driven case, where performance must be inferred online from data. In both cases, the method is able to adaptively minimize the average power needed for the FL task, while ensuring guaranteed latency and learning performance. Finally, the proposed strategy is customized to adaptive federated Least Mean Squares (LMS) estimation and deep convolutional neural network training. Part of this work was presented in the preliminary conference paper [44], which is here largely extended in terms of theory, models, and numerical results. Due to the upcoming convergence of communication and learning in beyond 5G networks, it is fundamental to merge these aspects with a mathematically formal analysis of both communication and computation power consumption and delays, as well as learning performance metrics in terms of accuracy of the learning task, rate of convergence and adaptation. This mathematical analysis, also exploited in the proposed online solution, represents the main contribution of our work with respect to the current state of the art previously presented.
Notation: Scalar, column vector, and matrix variables are respectively indicated by plain letters a (A), bold lowercase letters a, and bold uppercase letters A. I(·) denotes the indicator function; a ij is the (i, j)-th element of A, I is the identity matrix, and 1 N (0 N ) is the N × 1 vector of all ones (zeros). diag{a} denotes a diagonal matrix having vector a on its main diagonal. E{·} denotes the expectation operator, Tr{·} denotes the matrix trace operator, and λ min {·} represents the minimum eigenvalue. Other specific notation is defined along the paper.

II. SYSTEM MODEL
In this section, we present the mathematical model used to describe the FL algorithm and its performance, the overall latency, and the system power consumption. Let us consider a scenario with N edge devices and an AP equipped with an edge server, as illustrated in Fig. 1. The devices are cooperatively performing a training task aimed at learning a weight vector w ∈ R m . To this aim, at each time t, the devices collect labelled data (i.e., input/output pairs) given by (x i,t , y i,t ) ∈ R d × R, for all i = 1, . . . , N , and t ≥ 0 . Then, assuming that device i has a local loss function J i (w; x i,t , y i,t ), whose structure depends on the specific learning task, the goal of FL can be mathematically cast as: where the expectation is carried out over the data distribution. Now, at each time t, letting w t be the instantaneous guess for the weight vector w, we proceed by optimizing problem (1) using an adaptive stochastic gradient descent procedure [42].
In particular, let us denote for all i, t, to shorten the notation. Then, the adopted SGD recursion reads as: where t ≥ 0 , μ > 0 is a (sufficiently small) step-size parameter, and S t is the set of nodes that participate to the optimization at time t. In the considered FL scenario, the algorithm in (2) can be implemented in two ways. The straightforward one requires that, at each time t, the edge devices belonging to S t (to be determined for all t) compute in parallel the gradients of the local cost functions (i.e., g i,t (w t )) and upload them to the AP. Then, the edge server aggregates the local information to compute the new estimate w t+1 , which is finally fed back to the devices. An example of the data exchange required for this implementation is illustrated in Fig. 1. The second implementation of algorithm (2) requires instead that, at each time t, multiple edge devices compute one step of gradient-based algorithms on the current model using local data, and then the server takes a weighted average of the resulting models. In particular, device i evaluates a local estimate ψ i,t given by: Then, the edge server aggregates the local information in (3) to compute w t+1 as: It is straightforward to see that the combination of (3) and (4) is equivalent to the direct SGD implementation in (2). The latter implementation is known as federated averaging (FedAvg), and is typically preferred for privacy reasons [15], since transmitting the gradients might reveal information about the data. However, the first implementation is less sensible to errors introduced by quantization effects, which can be controlled through the choice of the step-size μ. For this reason, following the idea of [10], we will act on the source encoder of each transmitting device, dynamically adapting the quantization level in order to strike the best trade-off between power, latency, and learning performance. Furthermore, adding quantization noise to gradients naturally induces a differential privacy behavior [45], [46]. Thus, in the sequel, we will consider only the direct SGD implementation in (2), thus modeling its energy consumption, latency, and learning performance. Similar expressions hold for the FedAvg implementation in (3)-(4).

A. Dithered Quantization of Uplink Data
We assume that each device i is endowed with a dynamic uniform quantizer that uses b i,t bits to transmit data to the AP at time t. The quantizer used by device i at time t is defined by the following vector mapping q (·; b i,t ) as: where the entries of z, the dynamic quantization step Δ i,t > 0, and the error e i,t satisfy ∀i , t, with A denoting the size of input data variation. Conditioned on the input, the quantization error e i,t (z ) in (5) is deterministic. This induces a correlation among the quantization errors at different times, which may affect the convergence properties of the iterative algorithm (2). To avoid undesired error correlations, we introduce dithering [47], [48]. In particular, the dither added to randomize the quantization effects satisfies a special condition, namely the Schuchman condition, as in subtractively dithered systems [49]. Then, adding to each which is uniformly distributed over [−Δ i,t /2, Δ i,t /2) and independent of z n . To implement (2), the devices transmit dithered quantized versions of g i,t (w t ) at time t, which from (7) read as: where

B. Performance of Adaptive Federated Learning
In this section, we derive expressions for the learning performance of the proposed FL strategy. Using the dithered quantized gradients (8) in (2), the SGD recursion is given by: is a gradient noise process defined as: for all i ∈ S t and t ≥ 0, which depends on both data and dithered quantization statistics. Then, from (9), the aim of the analysis is to find expressions for the algorithm's performance in terms of learning accuracy and convergence rate. Let us denote by G t the performance metric for the learning problem at time slot t. Depending on the specific task, G t might represent different metrics as, e.g., prediction error or classification accuracy. Also, let us denote by α t the convergence rate at time slot t. Now, performing a mean-square analysis for the recursion in (10) can be a formidable task for general cost functions. Thus, in the sequel, we consider two different assumptions on the global loss function in (1).

1) Strongly Convex Objective:
In this case, we assume that (1) has a favorable structure that helps making the problem mathematically tractable. In particular, letting we consider the following assumption [42].
Assumption 1: The aggregate cost J(w) is twice differentiable, ν-strongly convex, and its gradient is δ-Lipschitz.
This is the case of many important learning paradigms, spanning from least-mean squares adaptation, support vector machines, logistic regression, and so on [42]. As we will see in the sequel, Assumption 1 is useful to give closed form expressions for the steady-state performance and convergence rate of our federated learning strategy (cf. (13) and (15)), which are in turn important to control the resource allocation of our system in a very efficient manner using Lyapunov optimization (cf. Section III). Furthermore, following similar arguments as in [42,Ch. 5], we consider the following assumption on the gradient noise process affecting (9).
Assumption 2: The gradient noise process in (10) is zeromean and satisfies for all i = 1, . . . , N and t ≥ 0, and some β 2 and σ 2 s , where F t−1 is the filtration of the random process w t up to time t − 1, and w o is the global minimum of J(w) (which exists under Assumption 1). Assumption 2 ensures boundedness of the second-order moments of the gradient noise in (10), and is instrumental to derive the mean-square performance of the algorithm in (9). From (9) and (10), it is clear that the performance of the SGD algorithm depends on the set S t (i.e., which devices are transmitting) and {b i,t } i∈St (i.e., how many bits encode the information transmitted by each device). Of course, the two variables are related to each other as: i.e., a device belongs to the set of transmitting nodes if the number of uploaded bits is different from zero, and viceversa. Thus, the set S t is fully known given the quantization bits , which represent the variables to be selected by the resource allocation algorithm. In particular, for a fixed transmission scheme (i.e., S t = S ⇐⇒ b i,t = b i for all t), under Assumptions 1 and 2, for any 0 < μ < 2ν/(δ 2 + β 2 ), the Mean-square Deviation (MSD) and the Excess Risk (ER) can be expressed as [42,Ch. 5]: where is the covariance matrix of the local gradient noise in (10), both evaluated at w o , for all i = 1, . . . , N . Also, we can approximate the convergence rate (i.e., the rate at which the error variance E{ w t − w o 2 } approaches its steady-state region) of the algorithm in (9) by [42,Ch. 5]: The smaller is α from (15), the higher is the convergence rate. Thus, hinging on (13) and (15), a possible way to define approximate learning performance metrics of (9) at time t is: Alternatively, the learning accuracy G t can be defined with respect to the ER in (14). Of course, the expressions (16) and (17) represent only (instantaneous) approximations of the learning performance achieved by (9), considering the values of {b i,t } i∈St at a given time-slot t. However, as we will show in the numerical results, an accurate performance prediction can still be achieved, thanks to the fast adaptation capabilities of algorithm (9) in both training and steady-state phases.
2) Non-Convex Objective: The non-convex scenario is typical of many FL tasks involving, e.g., (deep) neural network training. In such a case, we almost never have theoretical performance metrics, which enable reliable prediction of the algorithm's accuracy, following a model-based approach. Thus, we follow an alternative data-driven approach, which involves an online estimation of learning performance, i.e., accuracy and convergence rate, in order to drive the dynamic resource allocation.
In Section III, we will first derive the resource allocation framework assuming to have access to some closed form expression of the performance metrics G t and α t , as in (13) and (15). Then, in Section IV, we will extend the proposed framework to handle the case where closed form expressions for G t and α t are not generally available and must be inferred online from the data, which is typical of non-convex scenarios. Specific details of the proposed data-driven approach will be provided in Section IV.

C. Latency of SGD Iterations
The latency necessary to perform one SGD iteration at time t, together with the convergence rate in (17), quantify the time the FL algorithm in (9) needs to learn and adapt. In particular, the overall latency of one iteration is composed of four main sources of delay, which vary over time due to availability of resources (radio and computation) and wireless channel states.
(i) The local processing time to compute the gradient g i,t (w t ) at the i-th device reads as: where N l i is the number of CPU cycles necessary to perform this task, and f l i,t is the CPU frequency of device iy at time t. (ii) The uplink communication time, necessary to upload the local gradients to the edge server. Since the i-th device adopts a dithered quantization scheme that encodes local gradients into m · b i,t bits at time ty, this latency term reads as: where R u i,t is the uplink data rate. In principle, if device iy belongs to S t , it is going to compute and transmit at time t, incurring in the delays (18) and (19).
(iii) The remote processing time at the edge server, necessary to produce the global estimate in (9) is given by where O is the number of CPU cycles necessary to perform one summation (between m-dimensional vectors), the cardinality |S t | denotes the number of transmitting nodes, and f r t is the CPU frequency of the edge server.
(iv) A downlink communication time, say L d i,t , i = 1, . . . , N , necessary to send the global estimate w t+1 back to the devices. Here, we assume that the number of bits used to encode downlink data is a fixed value, which is chosen sufficiently large to have a negligible impact on the performance of the algorithm in (9). Also, since our interest is mainly focused on uplink communications, the downlink communication time L d i,t is assumed to be given and ensured by the AP at any slot; thus, it will not be optimized over time.
Finally, to evaluate the overall latency of each SGD iteration, we need to consider the maximum among communication delays of all transmitting (and receiving) devices in the overall latency, plus the computation time at the edge server, which at a given time ty reads as: As we will show in the sequel, our aim is to keep the average value of L t in (21), i.e., the average time of SGD iterations, below a given threshold.

D. Power Consumption
In this paragraph, we evaluate the power consumption of the proposed federated learning strategy. We consider two sources of power consumption for each device: local computation and transmission; then, we take into account the power spent for computation at the edge server. At time t, the power spent by device i for local computation is: where κ l is the effective switched capacitance of the processor [50]. Moreover, given the uplink data rate R u i,t , the power spent for uplink transmission is computed by inverting the Shannon formula, thus obtaining: where B u i,t is the bandwidth assigned to device iy at time ty, h u i,t is the uplink channel power gain, and N 0 is the noise power spectral density. At each time t, we assume that the AP allocates orthogonal frequency channels with pre-allocated bandwidth B u i,t to all transmitting users (i.e., those belonging to S t ). On the other side, at time ty, the power spent by the edge server for computation of (9) is given by: where f r t and κ r are the CPU frequency and the effective switched capacitance of the ES processor, respectively.
In this paper, our goal is to minimize the long-term average of the system power consumption, given by the sum of the devices and ES powers, which reads as: In the following section, we will formulate the proposed dynamic strategy for wireless network edge optimization, aimed at performing energy-efficient FL with guaranteed latency and learning performance requirements.

III. DYNAMIC OPTIMIZATION OF WIRELESS EDGE RESOURCES
We can now formulate the problem of dynamic resource allocation for FL. The aim is to find the optimal joint dynamic resource allocation of radio (i.e., the set S t of transmitting devices, uplink data rates {R u i,t } i∈St , quantization bits {b i,t } i∈St ) and computation (i.e., CPU cycles at devices {f l i,t } i∈St and at the server f r t ) resources to minimize the long-term average system power consumption in (25), with constraints on the average learning performance in (16)-(17), and the average latency in (21). Then, the dynamic resource allocation problem can be cast as: , and the expectations are taken with respect to the random channel states, whose statistics are supposed to be unknown. The constraints of (26) have the following meaning: (a) the average latency of SGD iterations does not exceed a predefined value L (although more sophisticated probabilistic or instantaneous constraints can be used [51], the average latency constraint (a) relaxes the resource allocation policy, avoiding excessive power consumption or unfeasible solutions in the case of bad channel conditions); (b) the average performance metric G t does not exceed a predefined value G; if G t represents an accuracy metric, the sign of the constraint should be reversed, i.e., the average accuracy must be greater than a certain target value; (c) the average convergence rate is constrained to be equal to α; finally, the constraints in X t impose that {b i,t } i∈St can take values only from a finite set B i of discrete quantization bits, and impose instantaneous bounds (e.g., budget constraints, minimum rates and CPU frequencies) on the resource variables {R u i,t } i∈St , {f l i,t } i∈St , f r t . In the sequel, we introduce a dynamic algorithmic framework to solve the long-term optimization problem (26).

A. Algorithmic Solution via Stochastic Optimization
We now introduce a method to transform (26) into a stability problem, building on the tools of stochastic Lyapunov optimization [43]. In particular, to deal with the long-term constraints (a)-(c), we introduce three virtual queues. The first one, used to impose (a), evolves as: where z is a positive step-size used to control the convergence speed of the algorithm. The second virtual queue, used to impose constraint (b), read as: with q > 0. Finally, the virtual queue associated with constraint (c) is given by: with y > 0. Note that virtual queue Y t reads slightly differently from the others, due to the fact that it is used to impose an equality constraint [43]. Interestingly, ensuring the meanrate stability of the virtual queues in (27)-(29) is equivalent to satisfy the three corresponding constraints [43]. To this aim, we first define the Lyapunov function as: The Lyapunov function is a measure of the congestion state of the virtual queues, and is fundamental to define the drift-plus-penalty function [43]: The drift-plus-penalty function in (31) is the conditional expected change of U t over successive slots, with a penalty factor that weights the objective function of (26), with a weighting parameter V. Now, if Δ p t is lower than a finite constant for all ty, the virtual queues are stable and the optimal solution of (26) is asymptotically reached as V increases [43,Th. 4.8]. In practical scenarios with finite Vy values, the higher is V, the more importance is given to the objective function, rather than the virtual queue backlogs, thus pushing the solution toward optimality while still guaranteeing stability of the system. Thus, we proceed by minimizing an upper-bound of the drift-plus-penalty in (31), which reads as follows: where G t , α t and p tot t are defined in (16), (17) and (25), respectively; whereas, ζ is a positive finite constant. The derivations leading to (32) and the value of ζ can be found in the Appendix. Finally, using stochastic approximation arguments [43], we optimize (32) removing the expectation, thus obtaining the following deterministic problem at each time-slot t (omitting all the constant terms): where X t is the instantaneous feasible set of (26). Now, following [43], for a fixed Vy, solving (33) in each time slot guarantees that all virtual queues are mean-rate stable, so that constraints (a), (b), (c) of (26) are met. Furthermore, hinging on the concept of a C-additive approximation [43], if the per-slot solution comes within a finite constant C from the optimum, we have: where p tot,opt is the infimum time average power achievable by any policy that meets the required constraints, and ζ is the constant whose expression can be found in the Appendix, see (57). Of course, the higher is C, the higher is the value of V needed to approach this asymptotic optimality, which in this case translates into higher queue backlogs. Based on the above concept, to further decrease complexity, we now solve (33) by replacing L t with an upper bound L t given by obtained by applying the straightforward upper bound max x i . Now, because of the structure of X t , (33) is a mixed-integer nonlinear optimization problem, which might be very complicated to solve. However, for any given {b i,t } N i=1 at time-slot t, it is easy to see that (33) is separable into three sub-problems that admit closed form solutions for the optimal uplink data rates, the optimal CPU frequency of devices, and the optimal CPU frequency of the edge server, respectively. In the sequel, we present the formulation and the solution of the three sub-problems.

B. Uplink Radio Resource Allocation
The uplink radio resource allocation sub-problem aims at optimizing the transmission rates {R u i,t } i∈St of each transmitting device at time-slot t, once the quantization bits (25), (33) and (35), we obtain: where R max i,t is the maximum rate achievable using the maximum transmitted power, and R min i represents the minimum rate that a user should use in the case of transmission. Of course, in (36), there is an intrinsic admission control condition embedded in terms of feasibility. In particular, denoting the set of nodes for which problem (36) is feasible by: it clearly holds that the set S t of transmitting nodes must be selected as a subset of A t (S t ⊆ A t ). If problem (36) is feasible, it is also strictly convex with respect to the rates {R u i,t } i∈St , and admits a unique closed-form solution. In particular, the Lagrangian function of (36) writes as: where δ i and ξ i are the Lagrange multipliers associated with the constraints of (36) over the variable R u i,t , for i ∈ S t . Then, the Karush-Kuhn-Tucker (KKT) conditions of the strictly convex problem (36) are given by: for all i ∈ S t , where we used (23) for p u i,t . Now, exploiting the principal branch of the Lambert function W (·) [52], the KKT conditions (39)- (41) can be solved in closed form as: for all i ∈ S t , if problem (36) is feasible.

C. Local Computing Resources Allocation
The local computing resource allocation problem aims at optimizing the CPU frequencies {f l i,t } i∈St of the transmitting (and computing) devices. From (33), (35) and (25), for a given time t, it is easy to see how the local computing resource allocation problem decouples over the computing devices and over the iterations within a slot. Thus, we obtain the following sub-problem at each device: for all i ∈ S t . Problem (43) is strictly convex and enjoys a simple closed form solution. Indeed, solving the KKT conditions, it is immediate to obtain: Note that (44) contains implicitly the admission control condition of the radio resource problem (36), i.e., a local computation is needed only if data can be subsequently uploaded to the AP. Thus, if i / ∈ S t , then R u i,t = 0 and also f l i,t = 0.

D. Remote CPU Frequencies Allocation
The remote computing resource allocation problem aims at optimizing the CPU frequency f r t of the ES. From (33), (35) and (25), for a given S t at time t, the remote computing resource allocation problem writes as: Problem (45) is a strictly convex optimization problem, which enjoys a simple closed form solution. Then, solving the KKT conditions, it is straightforward to see that the optimal remote CPU cycle frequency is given by

E. Online Selection of Transmitting Users, Quantization Bits, and Wireless Edge Resources
In Sections III-B-III-D, we have derived the optimal allocation of edge resources as a function of the quantization bits {b i,t } N i=1 . Thus, exploiting (42), (44), and (46), the objective of (33), say Δ p t , can now be expressed as a function of only In principle, to find the optimal solution of (33) at each time-slot t, one should compute the optimal allocation of edge resources for all possible combinations of {b i,t } N i=1 , evaluate the corresponding objective function Δ p t in (33), and then select the one that yields the lowest value. This approach faces a main challenge: Even if for the single choice of {b i,t } N i=1 the resource allocation is efficient (cf. (42), (44), and (46)), the overall search procedure has still a complexity that grows exponentially with the number N of devices and the maximum cardinality of the set of quantization bits (i.e., max i |B i |). For this reason, in the sequel we act some simplifications to reduce the complexity of (33), while still achieving good performance. In particular, instead of performing an exhaustive search over all possible combinations, we use an iterative greedy approach that, starting from the empty set of transmitting nodes, iteratively adds the most convenient devices, selecting jointly the best number of quantization bits and the associated edge resources in (42), (44), and (46). The method keeps adding devices from the admissible set A t in (37) until the resulting value of the objective Δ p t in (33) decreases, and stops when there is no more incentive (in terms of reduction of the objective function) in letting other nodes to transmit any bit of information. Of course, if A t is empty, the t-th iteration of the FL algorithm in (9) does not take place. Such greedy method drastically reduces the complexity of the procedure, which becomes polynomial in N and max i |B i |. Then, once the resource have been selected, the federated learning algorithm is updated as in (9), and, finally, the virtual queues Z t , Q t and Y t are updated as in (27), (28) and (29), respectively. The overall dynamic optimization procedure for adaptive federated learning is illustrated in Algorithm 1.
Remark 1: Of course, there are no guarantees that the proposed greedy procedure in Algorithm 1 finds the optimal solution of (33), inevitably representing only an approximation of it, so that our approach can be, also in this case, interpreted as a C-additive approximation [43, p. 59], which entails inexact solutions (with bounded error) of the drift-pluspenalty method in (33) at each iteration t, as anticipated in Section III-A. In our case, since the objective and the feasible set of (33) are both bounded for all ty, the proposed greedy approach clearly leads to a valid C-approximation. In Section IV, we will numerically assess the performance of the proposed dynamic resource allocation strategy for adaptive federated learning at the wireless network edge.

IV. DATA-DRIVEN RESOURCE ALLOCATION FOR ADAPTIVE FEDERATED LEARNING
In the previous section, we have proposed a model-based algorithm for dynamic resource allocation to enable FL at the wireless edge, which exploits closed-form expressions for the  (37). if A t = ∅ then continue else S2. Find S t , the quantization bits {b i,t } i∈St , and the edge resources through the greedy procedure: (42); f l j * ,t as in (44); f r t as in (46); else Flag = 0 end end S4. Update the federated learning algorithm as in (9); S5. Update the virtual queues Z t , Q t and Y t as in (27), (28) and (29), respectively; end end learning performance and convergence rate metrics (e.g., (13) and (15)). Indeed, some convex learning tasks admit closed form expressions for different learning metrics, allowing us to use Algorithm 1. However, in several other cases (e.g., non-convex learning tasks such as deep neural network training), we do not have knowledge of such performance metrics expressions. Therefore, in this section, we extend the previous strategy incorporating an online mechanism that estimates the learning performance and the convergence rate in a fully data-driven fashion, in order to drive the dynamic resource allocation based on Lyapunov optimization. One of the nicest features of the proposed data-driven approach is that it does not necessarily rely on SGD recursions as in (9), but it works also with more sophisticated gradient-based algorithms such as, e.g., Adam, Adagrad, SGD with momentum, etc. Now, let us assume that the agents collect and process batches of data of size B t at time-slot ty. For simplicity, we assume that the batch size B t is the same for all devices, and that can be selected from a set C of discrete values at each time slot t. Then, assuming that N l i in (18) is the number of CPU cycles to compute the local gradient from one data unit, if we have batches of B t data, the local processing time in (18) will be simply multiplied by a factor B t . Furthermore, considering a general gradient-based algorithm (e.g., Adam), the remote processing time is obtained as in (20) simply generalizing the meaning of the constant Oy that, starting from the received local gradient, represents the number of CPU cycles necessary to perform the single step of the gradient-based algorithm for each device. Of course, the overall remote processing time is still proportional to the number of transmitting devices. To estimate online the learning performance, we assume that either the ES is provided with a validation set T or, in the absence of a validation set, the agents can sense an additional batch T of data at each timeslot, compute their local learning performance and send it (one scalar) to the server for the computation of the overall learning performance. Then, two task-dependent functions G t and α t are introduced to measure online the learning performance G t and convergence rate α t , respectively. As an example, let us consider a classification task, whose validation (or batch) accuracy and its moving average with length 2K can be used to estimate learning performance and convergence rate as: where y t is the prediction for data unit yy at time-slot t. 1 Clearly, different metrics can be used based on the task and its complexity, e.g., the ratio of gradients norms at successive time-slots can be exploited for tasks whose learning accuracy/error could be difficult or inefficient to estimate [23]. Then, we propose to exploit the performance estimates in (47)- (48) to drive the Lyapunov-based resource allocation. In particular, we introduce two new virtual queues updates, which are specific of this data-driven approach. The first virtual queue reads as: and has the goal to drive the estimated learning performance G t above the target G. The second queue aims at controlling the convergence rate of the FL algorithm, and is updated as: where y,t is an adaptive (i.e., time-dependent) step-size. The queue evolution defined in (50) is motivated by the fact that, if the distribution of the data is stationary, there is no need to overshoot the convergence rate after the target level of learning performance is reached. Thus, when G t ≥ G (i.e., the estimated learning performance is greater than the target), the queue Y t is set to zero and has no more impact on the convergence rate and the resource allocation. Moreover, the dynamic step-size y,t is chosen to adapt the update speed of the queue Y t depending on the distance of G t from the target G. A possible choice is y,t = y |G − G t |. The rationale avoids the queue to unnecessarily impact the resource allocation when the learning performance is approaching G, because the target convergence rate α t is no more achievable at that point. At the same time, non-stationary behaviors can be detected observing a sharp deterioration in G t , which reactivates the virtual queue Y t , thus boosting again the learning process with the desired convergence rate. Moreover, an adaptive step-size z ,t can also be exploited also for the latency queue in (27) to speed up the adaptability of the method.
Using G t and α t is useful for the virtual queues update in (49) and (50), but they are not explicitly (i.e., mathematically) related to the number of quantization bits and to the batch size, which must be optimized and adapted to drive the learning performance and the convergence rate. Thus, the control action might still not be easily implementable, due to the lack of closed-form expressions for the performance metrics. One possible solution to this issue builds on the following assumptions [10], which are consistently verified both from a theoretical and a numerical point of view (practical examples follow in Section V).
Assumption 3: G t is a monotone non-decreasing function of the quantization bits {b i,t } i∈St .
Assumption 4: α t is a monotone non-decreasing function of the quantization bits {b i,t } i∈St and of the batch size B t .
Assumption 3 hinges on the fact that a finer representation of the data generally leads to better learning performance. At the same time, Assumption 4 exploits the fact that, increasing the batch size and the number of quantization bits, the (stochastic) gradient estimates in (2) get better, thus improving the overall convergence rate. Then, under Assumption 3 and Assumption 4, we propose to exploit two surrogate functions, say G t and α t , which respectively approximate the non-decreasing behavior of G t and α t with respect to the quantization bits {b i,t } i∈St and the batch size B t . Of course, there are several possible surrogates that we can exploit, but the best choice depends on the specific performance metric that we need to approximate (e.g., classification accuracy, meansquared error, etc.). Examples will be given in Section V-B, for the case of deep neural network training. The rationale underlying the choice of the surrogates comes again from the concept of Cyy-additive approximation [43] of the drift-pluspenalty method in (33), which makes possible to use inexact updates of the algorithm at each iteration, provided that the approximation error can be bounded within a finite error C. Then, at a given time slot t, exploiting (49), (50), and the surrogate functions G t and α t in (33), we solve the following deterministic problem: , and Z t = X t ∪ {B t ∈ C}, with C denoting the discrete set of possible choices for the batch size B t . We solve problem (51) as in the previous case, slightly modifying Algorithm 1, as illustrated in Algorithm 2. Essentially, Algorithm 2 adds a further selection step for the batch size B t to the greedy procedure of Algorithm 1. This is done with a small additive complexity, since the number of selectable batch sizes is assumed to be small (e.g., 3 or 4 possibilities). The main steps of the proposed data-driven approach are the same as in Algorithm 1, with  (37).
if A t = ∅ then continue else S2. Find S t , the quantization bits {b i,t } i∈St , the batch size B t , and the edge resources through the greedy procedure: (42); f l j * ,t as in (44); f r t as in (46); Update the federated learning algorithm as in (9), or according to a tailored gradient-based optimizer; S4. Update the virtual queues Z t , Q t and Y t as in (27), (49) and (50), respectively; end end the difference that the virtual queues Q t and Y t are replaced by Q t and Y t in (49)- (50). This data-driven strategy will be numerically assessed in Section V.
Remark 2: Interestingly, Algorithms 1 and 2 implement a double-step struggler mitigation at each time-slot, which selectively avoids that worst-case devices hinder the performance of the proposed strategy. First, there is a radio admission control step defined in (37), which selects the set A t of agents that can transmit with a minimum rate R min , thus discarding the agents experiencing bad wireless channel conditions (and, consequently, high communication latency). Then, starting from the set A t of potential transmitters, Algorithms 1 and 2 choose the set S t ⊆ A t of transmitting agents in order to minimize the per-slot optimization problems in (33) and (51), respectively. Since the objectives of (33) and (51) encompass jointly power, latency, and learning performance of the FL task, this second struggler mitigation step selects the subset S t in order to strike the best trade-off between these three fundamental aspects of the problem.

V. NUMERICAL RESULTS
In this section, we assess the performance of the proposed method, considering both model-based and data-driven scenarios. In particular, we will exploit the model-based approach for a least-mean squares regression task (in Section V-A), and the data-driven approach for a classification task aimed at training a deep convolutional neural network (in Section V-B). We consider a scenario with N = 9 devices and one AP equipped with an edge server, as illustrated in Fig. 1. We set the radio and computation parameters as follows: N 0 = −174 dBm/Hz, i,u is assigned equally splitting the overall bandwidth among the devices transmitting at time t. Moreover, for all i, and t. The channels are generated using the ABG model [53], with a carrier frequency of 23 GHz and adding a Rayleigh fading with unit variance. The LMS results are obtained with MATLAB, using a PC with an Intel Core i7-7700HQ CPU at a frequency of 2.80 GHz. The data-driven results are obtained using Python and the JAX framework, exploiting an NVIDIA Tesla K80 GPU.

A. Federated Least-Mean Squares
For this learning task, the input data x i,t ∈ R 20 is related to the corresponding output via a linear model y i,t = x T i,t w o + v i,t , at each time instant t. In this context, the SGD algorithm in (2) boils down into a federated LMS adaptive algorithm aimed at learning (and tracking over time) the vector w o [42]. The devices locations are chosen at random such that the distance of each device from the AP is sampled from a uniform distribution in the interval [70,130], for all i = 1, . . . , N . We assume that input data x i,t are zero-mean random vectors with covariance matrix σ 2 x ,i I 20 with σ 2 x ,i = 1 for all i. The observation noise v i,t is Gaussian, zero-mean with variance σ 2 v ,i uniformly selected in the interval [0, 2 × 10 −3 ], for all i = 1, . . . , N , independent from the data and among devices. Also, the overall bandwidth is 100 KHz, O = 2· 10 3 , N l i = 5 · 10 6 . The step-size μ is set to 0.015. The learning performance is measured in terms of MSD as in (13), and the convergence rate is given by (15). As a first result, in Fig. 2, we illustrate the learning curve of the FL algorithm in (2), obtained for different values of learning performance constraints G = MSD, while fixing the convergence rate to α = 0.99 and the latency constraint L = 20 ms. The results are averaged over 50 independent simulations, setting empirically z , q , y to obtain the fastest convergence. From Fig. 2, we can notice how the proposed optimization method is able to guarantee the prescribed performance in terms of convergence rate α and steady-state accuracy MSD. Then, in Fig. 3, we show the histogram of quantization bits usage for different values of MSD, fixing α = 0.99 and L = 20 ms. From Fig. 3, we notice how the method requires on average more quantization bits to obtain a stricter requirement  on learning performance, due to the finer required representation of transmitted data. Finally, in Fig. 4 (a) we illustrate the trade-off curve between average latency and TX power consumption (i.e., the sum of powers transmitted by all users) achieved by the proposed method, considering different values of MSD and fixed α = 0.99. Each point in the curves of Fig. 4 (a) represents a different value of V, whose magnitude grows from right to left. From Fig. 4 (a), increasing V, the method reduces the transmission power up to a limit value (i.e., the optimum) that still enables to guarantee the target latency constraint. As expected, the trade-off gets worse imposing a stricter requirement on learning performance, due to the larger power (and number of bits) necessary to obtain the target performance. Also, from a computation point of view, in Figs. 4 (b) we illustrate the average remote and local processing power consumption vs V, fixing MSD = −40 dB, α = 0.99 and L = 20 ms. As expected, the proposed method is able to reduce all the single contributions of the overall power consumption as V increases.

B. Federated Deep Neural Network Training
In this section, we consider a learning task aimed at training a classifier based on deep convolutional neural networks (CNN). We exploit a CNN made of four convolutional layers with 32, 32, 10 and 10 filters, respectively, with final flatten and dense layers; SAME padding, ReLu non-linearities and Batch Normalization are applied after each convolutional layer, and a final Softmax non-linearity is applied after the flatten and dense layers. The filters dimensions are 5 × 5, 5 × 5, and 3 × 3 and 3 × 3, respectively. To train the CNN, we use the MNIST dataset [54], which is made of 28 × 28 grayscale images of handwritten digits divided in 10 classes. The training data is composed of 6 × 10 4 examples, while the test set is made of 10 4 elements. The loss is the well-known cross-entropy, and the model is trained using a federated ADAM optimizer [55]. The devices locations are random and the distance of each device from the AP is sampled from a uniform distribution in the interval [20,80], for all i = 1, . . . , N . Also, the overall bandwidth is 100 MHz, O = 10 6 , and N l i = 10 7 . The ADAM step-size is set to 0.008, with forgetting factors β 1 = 0.9, and β 2 = 0.99. For this experiment, we use the performance estimate G t in (47), and α t in (48) with K = 10. Also, as a surrogate function for the accuracy metric, we exploit: where σ(·) is the logistic sigmoid function, and Median{·} represents the median value. Clearly, (52) satisfies Assumption 3. Regarding the convergence rate, we use instead the surrogate α t = B t i∈St b i,t , which of course satisfies Assumption 4. Moreover, C = [1, 3, 7, 14] for Fig. 5 and C = [1, 3, 7] for the others. As a first result, in Fig. 5, we illustrate the temporal behavior of the estimated accuracy of the FL algorithm, obtained for different values of the convergence rate α, while fixing the accuracy to G = 0.8 and the latency constraint L = 50 ms. As we can notice form Fig. 5, the proposed data-driven method is able strike the desired learning performance, while controlling the convergence rate. Similarly, in Fig. 6 (a), we illustrate the temporal behavior of the estimated accuracy, obtained for different values of the performance constraint G, while fixing the convergence rate to α = 0.2 and the latency constraint L = 50 ms. Then, in Fig. 6 (b) we show the temporal evolution of the overall latency and the overall uplink transmission power consumption, respectively, corresponding to Fig. 6 for G = 0.8 and G = 0.9. As we can notice from Fig. 6, the proposed method keeps the latency around the requirement L, while driving the accuracy on the target value G during the steady state phase. Interestingly, from Fig. 6 (c), we notice how a significant power saving can be achieved at steady-state if the accuracy requirement is not very strict (i.e., for G = 0.8), thanks to the impact of the adaptive step-size in (50). Furthermore, the results empirically confirm the choice of the adopted surrogate functions, and the consistence of Assumptions 3 and 4. Comparisons: Even though there are several works on resource allocation for FL, our problem formulation, jointly encompassing communication, computation and learning aspects of FL in a dynamic and adaptive fashion is novel, and does not come from a straightforward modification of existing results. Thus, it is extremely difficult to provide fair comparisons with other techniques available in the literature. However, we decided to assess the advantages of our joint strategy by comparing it with (sub-)procedures involving the optimization only of single aspects. We consider the following strategies for comparison: i) Equal Rate Policy (referred to as Equal Rate): All the agents always transmit with a fixed number of quantization bits (to match a certain learning accuracy), the remote and local frequencies are fixed, and the uplink rate is equally adapted for all the agents to match the latency constraint; ii) Fixed Scheduling & Quantization Bits Policy (referred to as Fixed S&B): All the agents always  transmit with a fixed number of quantization bits (to match a certain learning accuracy), whereas remote frequency, local frequencies and uplink rates are optimized via Lyapunov Optimization. iii) Joint Optimization with Random Scheduling and Quantization Bits (referred to as Random Joint): It is our joint procedure, but the scheduling and quantization bits are not assigned via the proposed greedy method, but rather with a random search of comparable complexity, meaning that multiple random realizations of the variables are checked to select the best one. In Fig. 7, we illustrate the trade-off curve between average total power consumption and average latency L for the aforementioned strategies, referring to our procedure as Proposed. The values are obtained by fixing the accuracy thresholdḠ = 0.8, the frequencies for Equal Rate strategy to the average frequencies of our procedure, the quantization bits to six for Equal Rate and Fixed S&B strategies (to tightly match the accuracy constraint). From Fig. 7, the proposed method results in a better trade-off with respect to the other strategies, i.e., in a sensible power saving for any given delay. Moreover, we empirically observed that our strategy is the only one effectively able to control also the convergence rate. Although the proposed comparisons do not (and cannot) refer to specific other works, they follow the optimization approach (in terms of optimization variables) of other works, e.g., [39] (they optimize only scheduling and power allocation), [36]  (they optimize quantization schemes), [22] (they optimize scheduling, power allocation and RB-OFDMA allocation).
Adaptation in non-stationary conditions: Finally, in Fig. 8 we illustrate the temporal behavior of the estimated accuracy in a non-stationary scenario, in order to highlight the adaptation capabilities of the proposed method. In particular, the MNIST dataset is divided into two sub-dataset of 5 classes each; then, the architecture is trained for the first 180 time-slot with one of the two sub-datasets and for the remaining time-slots with the other one (the last dense layer is obviously reduced to a 5 dimensional output). Equivalently, this introduces a nonstationary behavior of the data distribution. Then, at time slot 401, we change the accuracy requirement from G = 0.8 to G = 0.9, introducing a further level of non-stationarity. The results are averaged over 10 independent simulations. From Fig. 8, we can notice that our dynamic strategy is able to react promptly to both changes in the data distribution and in the accuracy requirement, exhibiting powerful learning and adaptation capabilities in a fully data-driven fashion.

VI. CONCLUSION
In this paper, we have proposed a dynamic resource allocation strategy enabling adaptive federated learning at the wireless network edge. The strategy dynamically minimizes the power expenditure of the system, while guaranteeing target learning performance and latency constraints. The proposed method builds on stochastic Lyapunov optimization, which leads to low-complexity procedures for the resource allocation at each time slot, without requiring a-priori knowledge of wireless channel statistics. The approach is valid both for a model-based approach, where performance metrics can be evaluated in closed-form, or for a data-driven approach, where performance are estimated online from streaming data. Several numerical results assess the performance of the proposed strategy over both synthetic and real data. Future research directions include model-based approaches for the non-convex learning scenario, where theoretical expressions for the convergence rates to (local) optimality can be used to control the performance of FL.

APPENDIX
Let us present the derivations leading to the upper bound in (32). In particular, considering Z t defined in (27) and , we have [43]: From (26), since we assume a minimum uplink data rate R min i and a minimum local CPU frequency f min i for every device i ∈ S t , and a minimum server CPU clock frequency f r ,min , the term L t in (53) is bounded for all t by a finite constant L SGD,max , i.e., Applying the same arguments to Q t defined in (28), and exploiting the upper-bound G t ≤ G max (which holds for any suitable performance metric), we obtain: This last condition holds due to (16) and the fact that we impose N i=1 a i,t ≥ 1 (cf. (26)). Finally, let us consider the virtual queue Y t defined in (29). Although Y t presents a different evolution from the other virtual queues, we can still apply the same arguments and, exploiting the upper-bound α t ≤ α max (which holds for any metric of convergence rate), we obtain: Finally, plugging (54), (55) and (56) into (31), we derive the upper bound in (32), i.e., where ζ is a finite positive constant that reads as