Lazy Lagrangians for Optimistic Learning With Budget Constraints

We consider the general problem of online convex optimization with time-varying budget constraints in the presence of predictions for the next cost and constraint functions, that arises in a plethora of network resource management problems. A novel saddle-point algorithm is designed by combining a Follow-The-Regularized-Leader iteration with prediction-adaptive dynamic steps. The algorithm achieves <inline-formula> <tex-math notation="LaTeX">$\mathcal O(T^{(3-\beta)/4})$ </tex-math></inline-formula> regret and <inline-formula> <tex-math notation="LaTeX">$\mathcal O(T^{(1+\beta)/2})$ </tex-math></inline-formula> constraint violation bounds that are tunable via parameter <inline-formula> <tex-math notation="LaTeX">$\beta \!\in \![1/2,1$ </tex-math></inline-formula>) and have constant factors that shrink with the predictions quality, achieving eventually <inline-formula> <tex-math notation="LaTeX">$\mathcal O(1)$ </tex-math></inline-formula> regret for perfect predictions. Our work extends the seminal FTRL framework for this new OCO setting and outperforms the respective state-of-the-art greedy-based solutions which naturally cannot benefit from predictions, without imposing conditions on the (unknown) quality of predictions, the cost functions or the geometry of constraints, beyond convexity.


I. INTRODUCTION
T HE online convex optimization (OCO) framework intro- duced in [1] is employed to solve various learning problems in networks, ranging from spam filtering, to data caching [2], network routing [3] and flow control [4], among others.At each round t an algorithm selects an action x t from a convex set X ⊂ R N and incurs cost f t (x t ), where the convex function f t : X → R is revealed after x t is decided.The algorithm's performance is measured using the metric of regret: which quantifies the difference of the total cost from that of the best action selected with hindsight x ∈ arg min x∈X T t=1 f t (x).The goal is to select actions {x t } that ensure sublinear regret, i.e., R T = o(T ).
A practical extension of this setting is the constrained OCO framework, where the actions must satisfy long-term constraints of time-varying functions: t (x), g (2) t (x), . . ., g t (x) 0, which are unknown when x t is decided.In this case we are additionally interested in achieving sublinear total constraint violation, V T = o(T ), where: Constrained OCO algorithms have applications in the control of capacitated communication systems; various network queuing problems [5]; and network management with multiple constraints and performance criteria [6].Nevertheless, these problems are notoriously hard to tackle.In particular, [7] showed that no algorithm can achieve sublinear regret and constraint violation relative to the ideal benchmark: x ∈ X max T = x ∈ X T t=1 g t (x) 0 .
An aspect that has received less attention, however, is whether constrained OCO algorithms can be assisted by predictions for the next-round functions f t and g t .Such information can be provided by a pre-trained model that uses incomplete data and hence cannot be fully trusted -yet, can still assist the online algorithm.Leveraging predictions to improve learning algorithms is attracting increasing interest and has many practical applications, e.g., in data caching [17]; online rent-or-buy problems [18]; and in scheduling algorithms [19], among other areas.In this context, a key challenge is that the predictions might exhibit time-varying and unknown accuracy, which, furthermore, may vary across the cost and constraint functions.This confounds their incorporation in online learning algorithms and raises the question: how much can predictions improve the performance of constrained OCO algorithms and how can we accrue these benefits in the presence of inaccurate, potentially even adversarial, predictions?Our goal is to tackle this question by developing a framework that brings together online learning [20] and data-driven (prediction-based) network management solutions, Fig. 1.

A. Background and Related Work
Early works studying the impact of predictions include [21] and [22] which considered linear costs c t = ∇f t (x t ) and predictions ct with guaranteed correlation c t c t ≥ αc t2 .These predictions improve the regret from O( √ T ) to O(log T ).Reference [23] considered the case when at most B of the predictions fail the correlation condition and provided an O (1 + √ B)/α) log(1+ T − B) regret algorithm, that was further extended to combine multiple predictors [24].However, these prior models assume X is time-invariant.A different line of works use adaptive regularizers and define prediction errors ε t = c t − ct to obtain O t ε 2 t regret bounds [25], [26].We adapt these methods to time-varying constraints (x t ∈ X T ) where we incorporate predictions for the cost and constraint vectors.
From a network management perspective, the availability of predictions is common in wireless networks and (mobile) computing systems and several works use deep learning or other function approximators [27] to predict f τ , g τ for τ ≥ t and select a better x t [28], [29].And predictions have been previously included in stochastic optimization algorithms [30], [31], which however assume the requests and system perturbations are stationary.Similarly, prior works using predictions in online learning [32], [33], [34], do not adapt to the predictions' quality nor consider budget or other time-varying constraints, i.e., operate w.r.t. a fixed set X .If a learner has access to perfect predictions, then it might be possible to use them directly in order to select a better action. 1 Nonetheless, the problem becomes fundamentally different when the predictions have unknown and/or time-varying accuracy.This, more general setting requires the learner to adapt to predictions.To that end, we focus here on the richer constrained OCO problem where {x t } are also subject to time-varying budget constraints (x ∈ X T ) for an unknown horizon T .We place no assumptions on the predictions accuracy, the geometry of set X T or the functions {f t , g t }, beyond being convex.
Technically, our approach benefits from a novel "lazy" update that aggregates all previous cost and constraint vectors and uses data-driven steps that adapt to prediction errors.In particular, we build on FTRL, cf.[35], which we extend with time-varying accumulated constraints -a result of independent interest.Previous greedy-based algorithms for time-varying budget constraints and benchmarks in X T include [8], [10] 1 E.g., one could use a simple iteration xt = arg min gt(x)≤0 ft(x), which however will yield arbitrarily bad performance for imperfect predictions.[40], [41], AND [39] O(T 3/4 ) assuming fixed and known horizon 2 T ; [16] that offers R T , V T = O( √ T ) but confines the constraints to be linearly-perturbed; [37] which studies a similar setup with specific switching-actions constraints; [11], [38] with R T , V T = O( √ T ) that restrict the constraints to be i.i.d.stochastic or non-positive over a common subset space; and [6], [39], [40], [41] which achieve sublinear bounds w.r.t.t-slot benchmark {x t } t (dynamic regret) under additional assumptions on the variability of the functions, and by knowing several problem parameters.We note that dynamic regret bounds, although more refined, they cannot attain sublinear rates in the general case.We summarize how our work compares to prior works in Table I.
It is important to stress that none of those approaches can benefit from predictions, and hence their performance does not improve even if the costs and constraint functions are predictable.On the contrary, our approach ensures improved learning performance whenever good predictions are available.

B. Contributions
We study the general constrained OCO problem where in round t our algorithm, which we name LLP (Lazy Lagrangians with Predictions), has access to all prior cost gradients {∇f i (x i )} t−1 i=1 and constraints {g i (x)} t−1 i=1 , and receives predictions gt (x t ), ∇ ft (x t ) and gt (•).After selecting x t , LLP incurs cost f t (x t ) and violation g t (x t ), and the process repeats in the next round.Our first result, Theorem 1, presents the regret and constraint violation bounds and demonstrates how they benefit from predictions.Theorem 2 characterizes the (tunable) growth rates of the bounds and exhibits their dependency on the accumulated prediction errors.Theorem 3 and Lemma 3 present the respective bounds when LLP employs fully-linearized cost and constraint functions and non-proximal regularizers, cf.[35], in order to reduce its computation and memory requirements.For this linearized version, it suffices to have gradient predictions ∇g t (x t ) instead of gt (•).Indeed, in some practical problems it might be easier to obtain such predictions (single vectors) compared to predictions for the entire constraint function; but we note that this is not always the case 3 ; LLP can handle both scenarios.Finally, Lemma 4 presents LLP's performance for linearly-perturbed constraints, a special but important case that was studied in [16].
The performance of LLP is summarized in Table I.
2 ) for worstcase (or, no) predictions, which are tunable through parameter β ∈ [1/2, 1).For instance, with β = 1/2, we obtain R T = O T 5/8 and V T = O T 3/4  , that are further reduced to , when {x t } does not outperform x by more than that.With perfect predictions, LLP achieves which are tunable via β ∈ [0, 1); while for linearly-perturbed constraints (as in [16]) ).These results improve previously-known bounds for the general OCO problem with time-varying budget constraints, i.e., without imposing assumptions such as strong convexity of functions and domains, or knowing the horizon T .And they include as special cases the benchmarks with static or stationary constraints of [11], [12], [13], [14], and [36].Importantly, unlike all prior constrained-OCO algorithms, the constant factors of R T and V T shrink proportionally to the predictions' accuracy.This is, probably, the most important advantage of LLP compared to all prior works which, even though in some special cases attain better theoretical bounds, they cannot benefit from predictions.
We believe these results pave the road for extending the seminal network utility maximization (NUM) framework [42], [43] with robust learning techniques which seamlessly encompass any available predictions of unknown quality.

C. Assumptions and Notation
We write {x t } for a sequence of vectors and use subscripts to index them; • denotes the Euclidean ( 2 ) norm and [x] X , [x] + the 2 -projection of x on sets X and R N + .We use the index function I X (x)= 0 if x ∈X and I X = ∞, otherwise.Vector c t denotes the gradient ∇f t (x t ) of f t or an element of its subdifferential ϑf t (x t ) if it is non-differentiable; and ∇g t (x) denotes the Jacobian of the vector-valued constraint.We use the shorthand notation c 1:t for t i=1 c i , and ãt for the prediction of some vector (or, function) a t .
The analysis requires the following basic assumptions.A1.The set X ⊂ R N is convex and compact, and it holds x ≤ D, ∀x ∈ X .A2. Functions f t , g (j) t : X → R, ∀t, j ≤ d, are convex and Lipschitz with constants

D. Paper Organization
Section II introduces the LLP algorithm and the regret and constraint violation bounds.Section III presents the adaptive multi-step and characterizes the convergence rate of LLP, with special focus to the case of perfect predictions and worst-case (or no) predictions.Section IV modifies LLP for linearized constraints and non-proximal updates, and Sec.V derives the performance bounds for the special case of linearly-perturbed constraints.We conclude in Sec.VI.The paper is accompanied by an appendix, Sec.III, that includes the remaining proofs, explanatory figures, and numerical examples.

II. THE LLP ALGORITHM
Our approach is inspired by saddle-point methods that perform min-max operations on a convex-concave Lagrangian.Starting from the t-round problem: we introduce the dual variables λ ∈ R d + by relaxing g t (x) 0, and define the regularized Lagrangian: where we linearized f t (x).Function r t : X → R is a proximal 4primal regularizer and q t : R d + → R a non-proximal dual regularizer.We also set L 0 (x, λ) = r 0 (x)− q 0 (λ).
We coin the term Lazy Lagrangians with Predictions (LLP) for our Algorithm, which proceeds as follows, Fig. 2. In each round t, LLP uses observations {c i } t−1 i=1 , {g i } t−1 i=1 , dual variables {λ i } t i=1 , and predictions ct , gt (•) to perform an optimistic FTRL update: which induces cost f t (x t ) and constraint violation g t (x t ).
After the t-round information c t and g t (•) is revealed, LLP calculates the prescient action: and uses prediction gt+1 (x t+1 ) to update the duals: where note the use of {z t } instead of {x t }.The process then repeats in the next round.LLP has key differences from previous constrained OCO algorithms.These stem from the usage of lazy as opposed to greedy updates in the primal and dual iteration, where instead of using x t−1 and λ t to decide x t and λ t+1 respectively, Algorithm 1 Lazy Lagrangians With Predictions (LLP) Input: x 0 ∈X , λ 1 = 0, r t (x), q t (λ) with ( 7), (8).for t = 1, 2, . . .do • Calculate r 0:t−1 and decide x t using (4) • Pay cost f t (x t ) and violation g t (x t ) • Calculate r 0:t and decide z t using (5) • Receive predictions ct+1 , gt+1 (•), gt+1 (x t+1 ) • Calculate q 0:t and decide λ t+1 using (6) end for we aggregate in a projection-free fashion all prior cost gradients and constraints.This approach can be traced back to lazy algorithms discussed in [1]; to fictitious (as opposed to best response) strategies in game theory [44]; and to FTRL algorithms [35] for problems with fixed constraints.However, to the best of our knowledge, this is the first time such lazy updates are used with time-varying budget constraints.
Performance.The regret and constraint violation are quantified using ( 1) and ( 2), respectively, with benchmark x ∈ arg min x∈XT T t=1 f t (x).The performance of LLP is shaped by the regularizers which adapt to predictions.In particular, we use the primal regularizers: where r 0 (x) ensures that x ∈ X ; x t is given by ( 4); and the regularization parameter σ t accounts for the cost and constraint prediction errors, where the latter are modulated by the dual variables.The intuition for ( 7) is that we add regularization commensurate to the prediction errors; and the rationale for selecting this particular σ t will be made clear below.On the other hand, we use the general dual regularizer: where again q 0 (λ) ensures λ ∈ R d + , and {a t } is the dual learning rate which, for the first Theorem, suffices to be nonincreasing.Our first main result is the following.
Theorem 1: Under Assumptions (A1)-(A4) and with {r t } and {q t } satisfying ( 7) and ( 8), LLP ensures ∀x ∈ X T : Discussion.We observe from Theorem 1 the effect of predictions on the bounds of R T and V T , which diminish proportionally to their accuracy.The bound of R T is further reduced if we set σ = L f /D (when these parameters are known), and settles to zero for perfect predictions, i.e. when x t = z t , ∀t, and: while the same is not true for V T .Moreover, this theorem reveals the tension between R T and V T .Indeed, observe that −R T appears in the bound of V T which means that when {x t } outperforms x , we might incur higher constraint violation.
Observing the steps of the algorithm, we can see that LLP requires predictions for the next gradient ct = ∇ ft (x t ), next cost function gt (•) and next constraint point gt (x t ).It is important to note the timing of these predictions.Updating x t requires ct and gt (•) and access to regularizers r 0:t−1 which are calculated using the prediction errors up to slot t − 1. Knowledge of ct is the standard prediction that all prior works employ, e.g., see [22], [24] and references therein.On the other hand, since we have not linearized the constraint function, the respective predictions involve function gt (•) and its next-round value gt (x t ).In Section IV we present a version of LLP where we linearize the constraints and hence use only gradient predictions for the constraints -similarly to cost functions.
The complexity of LLP is comparable to its greedy-based counterparts -sans the additional prescient update (5)i.e., it requires the solution of strongly convex problems and a closed-form iteration for the dual update.Finally, it is worth emphasizing that the impossibility result of [7] holds even if {f t , g t } are revealed before {x t } is selected, as stated in the next Lemma that is proved in the Appendix.
Lemma 1: No online algorithm can achieve concurrently sublinear regret and constraint violation, T , even if the algorithm selects {x t } with knowledge of {f t , g t }.This result exhibits the challenges in tackling constrained OCO problems with time-varying budget constraints, when using the ideal set of benchmarks X max T ; and reveals that using predictions, even if they are known to be perfect, does not suffice to escape this limitation.

A. Regret Bound
Our strategy is to derive a regret bound w.r.t.prescient actions {z t }, and then use the distance of {z t } from {x t } to prove Theorem 1(a).We will use the following Lemma that is proved in the Appendix.
to both sides of ( 9): and dropping the first non-negative sum in the LHS, setting λ = 0 to get q 0:T −1 (0) = 0, and adding/subtracting φ t λ t 2 /2 in the RHS so as to build L t (z t , λ t ), ∀t ≤ T , we arrive at: where (a) stems from the Be-the-Leader (BTL) Lemma [45, Lemma 3.1] applied with x ∈ X T ⊆ X to (5); and (b) from expanding L t (x , λ t ), using g t (x ) 0 and x − x t 2 ≤ 4D 2 , ∀t.Adding T t=1 c t x t to both sides and rearranging: The last term can be upper-bounded using the Cauchy-Schwarz inequality and Lemma 2, i.e.: where in (a) we used [46, Lemma 3.5], and this was made possible due to the specific formula of the regularization parameter σ t .Replacing in (11) and using R T ≤ c t (x t − x ), we eventually get: which concludes the proof for Theorem 1(a).

B. Constraint Violation Bound
To prove Theorem 1(b), we start from (10) where we drop again the non-negative T t=1 r t (z t ) in the LHS, add and subtract the term T t=1 φ t λ t 2 /2, ∀t, and rearrange to get: where again we applied BTL to L t (z t , λ t ).Next, expanding L t (x , λ t ), using q 0:T −1 (λ) = φ 0:T −1 λ 2 /2, g t (x ) 0, ∀t and r 0:T (x ) ≤ 4D 2 σ 1:T , we get: For the LHS of ( 12), we can use the result: and if we denote with V z T the LHS norm and replace this in (12), we obtain: Lastly, we define w t = ∇g t (x t ) (z t − x t ) and write: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where (a) uses the identity the convexity of g t ; and (c) the Cauchy-Schwarz inequality, ∇g t (x t ) ≤ G, and Lemma 2. This concludes the proof.The next section characterizes the convergence rates of the bounds, focusing on two special cases: when the predictions are perfect and when we have worst-case (or no) predictions.

III. CONVERGENCE RATES
We start by specifying the dual learning rate {a t }.The rationale for selecting the primal regularizer was made clear in the proof of Theorem 1; here, we refine (8) in a way that ensures the desirable sublinear regret and constraint violation growth rates.In detail, we will be using: This multi-step combines the typical time-adaptive step appearing in online gradient-descent algorithms [1] with a data-adaptive step that accounts for the prediction errors.This ensures that a t will induce enough regularization when the predictions' quality are not satisfactory, but will keep diminishing even in the case of perfect predictions -a condition that is necessary in order to tame the growth rate of the dual vector.Finally, note that the term 4G 2 ≥ ξ 2 t , ∀t corrects the off-by-one regularizer of the non-proximal dual update.
Before we analyze the convergence for the cases of perfect and worst predictions, it is important to emphasize that in each round t, LLP has at its disposal all the necessary information to calculate a t .In particular, a t is used to update the dual vector λ t+1 after the cost and constraint functions, f t and g t , have been revealed, and the prescient vector z t is calculated.Hence, we know ξ t before performing update (6).Furthermore, we stress that the analysis below does not make any assumptions on the apparatus which creates the predictions; in fact, our framework is orthogonal to how the predictions are being created and oblivious to their quality: if the predictions are accurate, LLP learns to trust them, and if they are inaccurate, LLP gradually discards them.

A. Perfect Predictions
The next Corollary to Theorem 1 describes the regret and constraint violation bounds for perfect predictions.
Corollary 1: Indeed, for perfect predictions holds h 1:T = 0 independently of the value of {λ t }, while the second term in the bound of R T can be written (detailed derivation in Sec.III-C: which diminishes to zero when ξ t = 0, ∀t.This manifests the advantage of this doubly-adaptive dual step which creates a bound similar to those in OCO problems without budget constraints, see [26].Furthermore, the step is simplified to a t = a/t β which remains bounded.Hence R T = B T = O(1), and if we substitute B T in V T , we get: Here, we can set β = 0 to get V T = O( √ T ), and reduce the constant factor of V T further by increasing6 a.
Furthermore, note that when the regret is non-negative, i.e., when the sequence {x t } t does not outperform x , then we get V T ≤ 0 for any value of β.And, more generally, if there is a non-trivial bound for the negative regret, i.e., −R T = O(T b ) with b < 1, then the bound of the constraint violation is improved in a commensurate amount, namely . Finally, it is worth stressing that, even when the predictions are perfect, the algorithm still needs to learn to trust them gradually, as their quality is a priori unknown.

B. Worst-Case Predictions
On the other hand, when we do not have any predictions at our disposal, or when these are as far as possible from the actual data, then the dual multi-step induces more regularization using the observed prediction errors.The performance of LLP in this scenario is captured by the following theorem.
What prevents the LLP bounds from improving further is the term −R T that appears in V T .While we have used in the analysis the worst-case −R T = O(T ), it is important to note that when −R T = O( √ T ), i.e., when LLP does not outperform the benchmark by more than √ T , then for perfect predictions we achieve R T = O(1), V T = O(T 1/4 ) (setting β = 0), and for worst-case predictions it is R T , V T = O( √ T ).
On the other hand, if LLP does outperform the benchmark consistently (∀ T ) by at least Ω( √ T ), we achieve negative regret and V T = O(T 2/3 ).The general case, for which the above two Corollaries hold, is when the sample path is such that LLP bounces above and below the performance of {x }.A schematic description of these cases is included in the Appendix and the different achieved rates by LLP are summarized in Table I.
Concluding, it is worth discussing the inherent difficulties of the problem at hand.Namely, one might argue that we could directly apply the optimistic FTRL result of [26] (or our equivalent Lemma 5) both to the primal and to the dual update, and combine the results to bound R T and V T .The interested reader, however, can verify that this straightforward strategy leads to much worse bounds.Our approach instead is to apply the optimistic FTRL only to the dual update and carefully reconstruct tighter bounds for the Lagrangian, while using a fixed-point iteration to find the exact (minimum) growth rate of λ T .Moreover, unlike prior works such as [16] or [13], we update {x t } using {λ t } (instead of λ t−1 ); and then update {λ t+1 } using the newly calculated {x t }.This strategy facilitates the inclusion of predictions as we only need to predict x t and not λ t .Besides, it is exactly this circular relation between the primal and dual variables which renders the inclusion of predictions in OCO with budget constraints fundamentally different from the respective OCO problem without time-varying constraints.

C. Proof of Theorem 2
Our strategy is to bound the growth rate of the dual vector norm and use it to bound R T and V T .First, we define its minimum growth rate k = min{ϕ : λ t = O(t ϕ }, and introduce Λ t = max {λ i : i ≤ t}, where Λ t ≤ Λ T = O(T k ), ∀t ≤ T .Using the closed-form solution 7 of (6), we can write: where we used that g t (z), g t (z) ≤ G, ∀t, and the triangle inequality.Also, we can write: We will use this bound and the definition of λ t , which ties it to V z t , to find a smaller growth rate than the one we would get by directly bounding V z t in (17).Indeed, starting from (17) and using (13), we have: 7 Eq.( 6) simplifies to Next, note that based on the definition of a t and the fact that we consider worst-case predictions (which increase linearly with T ), the following inequalities hold: Thus, it follows that: Similarly, the following inequalities hold: where we used [46, Lemma 3.5], the identity T t=1 t −β ≤ T 1−β /(1−β) (Lemma 6 in Appendix) and ξ t ≤ 2G, ∀t.Hence we obtain: Finally, replacing B T in (19) with its definition from Theorem 1, we arrive at: where we used the worst case bound −R T = O(T ) and a T G = O(T θ ).To find the dominant term, note that, since β ∈ [0, 1), it is n ≤ 1, and hence (1 + θ)/2 ≥ (n + θ)/2, thus we omit the second term.Also, θ ≤ 0 and hence we can omit the last term; and finally, from (17) we observe that k ≤ 1 + θ ≤ 1, thus, the third term is larger than the first, and we conclude with . Having found the growth rate of λ T to be k = (1 + θ)/2, we use (18) and (21) to refine the bounds: and these conclude the proof of Theorem 2(a).For the constraint violation V T , observe first that and hence holds ν = −θ.Using this bound along with (23) and −R T = O(T ), we get from Theorem 1(b): We conclude by noticing that: and observing that conditioning on the value of β, we get the bounds in Theorem 2.

IV. LESS COMPUTATIONS AND PREDICTIONS
We discuss next how to reduce the computation and memory requirements of LLP by using non-proximal primal regularizers; and the impact of linearizing the constraint functions on the required predictions.

A. LLP With Non-Proximal Regularizers
This new algorithm, LLP2, uses the same dual regularizer (8) and update (6), but the general primal regularizer: r 0 (x) = I X (x) and r t (x) = σ t x 2 /2, ∀t ≥ 1, with: where μ t E m +a t−1 GtΔ m .The new updates are: where L t (x, λ) is defined using r t (x) from ( 24), and note the off-by-one regularizer of (26) compared to (5).These non-proximal regularizers facilitate solving for {x t } and {z t } since r 1:t (x) = σ 1:t x 2 /2 involves only one quadratic term and can be represented in constant memory space, unlike r 1:t (x) = t i=1 σ i x − x i 2 /2 that expands with time.On the other hand, non-proximal updates yield looser bounds, cf.[35], and require a new saddle-point analysis.Interestingly, they do not worsen the growth of LLP2, but do prevent it from achieving R T = O(1) for perfect predictions.
Theorem 3: Under (A1)-(A4) and with {r t } and {q t } satisfying ( 24) and ( 8), LLP2 ensures for every x ∈X T : where The proof of Theorem 3 can be found in the Appendix.Corollary 2: LLP2 achieves the same bounds as in Theorem 2 in the general case, and under perfect predictions it ensures: For example, LLP2 with perfect predictions and β = 0 achieves R T = O( √ T ) and V T = O( √ T ); and with β = 1/3 yields R T = O(T 1/3 ) and V T = O(T 2/3 ).Hence, it outperforms the state-of-the-art constrained-OCO algorithms with no predictions, but does not perform as well as LLP for perfect predictions.

B. LLP With Linearized Constraints
Another way to reduce the computation load of LLP is to linearize the constraint function.This, however, is not trivial since we cannot recover V T by simply using the convexity of {g t }, as we do with R T and the linearization of {f t }.Hence, we use linear proxies for the constraints and their predictions: LLP with linearized constraints, which we call LLP3, runs similarly to Algorithm 1, but uses predictions ct , ∇g t (x t ) and gt+1 (x t+1 ), i.e., does not need to predict the entire constraint function -nor xt , despite appearing in (28).And this does not affect its performance.Lemma 3: LLP3 achieves the R T , V T bounds and convergence rates in Theorems 1 and 2, respectively.The proof of the Lemma and the details of the linearized LLP can be found in the Appendix.Now, whether it is more difficult to predict the next constraint gradient or the entire next constraint function, is a question pertaining to the problem at hand, and practitioners can select the respective version of LLP that suit their needs.For instance, as explained in Section I, for single-parameter functions it is easier to predict the entire next-slot function by guessing the parameter, as opposed to predicting the next gradient and next value, which requires a guess for the system's operation point as well.In other cases predicting the structure of an entire function (e.g., if g t+1 is totally different thatn g t ) might be more challenging.

V. LINEARLY-PERTURBED CONSTRAINTS
In this section we consider the special type of constraints that are linearly-perturbed, which was studied first in [16].In detail, the constraints and their predictions are: where b t , bt ∈ R d are the unknown per-slot perturbations that are added to the fixed (and known) function component g(x).
For example, consider a network routing problem where the network graph and capacities are fixed, g t = g, ∀t, but the incoming flow varies in an arbitrary fashion according to vectors {b t } t , [42].This simplification has important ramifications for the analysis and, eventually, improves the bounds of LLP as follows: Lemma 4: Under the conditions of Theorem 1, with constraints and predictions given by ( 29), LLP ensures: This result improves the bounds for the case of general constraints of the previous section, and yields only 1/8 worse constraint violation than the bounds in [16] (and the same regret bounds).However, [16] cannot leverage predictions.That is, even if we have perfect predictions at our disposal, [16] will still offer the same regret and constraint violation bounds.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
On the contrary, we see that LLP, due to its predictionadaptive steps, yields no regret for perfect predictions and the constant factors of the bounds shrink commensurately with the predictions' accuracy.This becomes clear if we express the bounds as follows: where we have defined the parameters: Note that in this case the quantity h 1:T does not depend on the dual vectors.This is due to the fact that the perturbations do not affect the primal step (see details Sec.VI-H of the Appendix), which disentangles -to some extent -the primal and dual iterations.The proof of the lemma and the details for deriving ( 30) and ( 31) can be found in the Appendix.

VI. CONCLUSION
LLP differs from related algorithms since the primal and dual updates are lazy.This allows an FTRL-based design, which is widely used in fixed-constraints OCO algorithms (x ∈ X ) but is new in the context of time-varying constrains (x ∈ X T ).The LLP order bounds, even with worst-case or no predictions, are competitive with existing algorithms while dropping several impractical assumptions these are using.Indeed, prior algorithms with time-varying constraints require strongly convex cost and constraint functions or linearly-perturbed fixed constraints, and rely on the Slater condition.Other proposals assume time-invariant constraints, still rely on a fixed and known time horizon T (which cannot be remedied by the doubling trick); and need access to all Lipschitz constants and constraint bounds.Hence, LLP can be applied in a wider range of network management problems.
The most important advantage of LLP is that it encompasses predictions of unknown quality.This is the first work that proposes and tackles this problem in the context of OCO with time-varying budget constraints.Clearly, despite the good performance of prior works, none of them can benefit from the availability of (potentially inaccurate) predictions.LLP, instead, gains directly from them as the constant factors of the regret and constraint violation shrink in proportion to predictions' quality; this is oftentimes even more beneficial than having a slightly tighter learning rate w.r.t.T , especially in problems with large dimension (D is comparable to T ).And when the predictions are perfect, LLP achieves R T = O(1) and V T = O( √ T ).Last but not least, this framework is unified as it and can run without predictions (setting them zero) since we impose no assumptions on their quality, and can be applied to problems with time-invariant constraints.Hence, it opens the road for employing network datasets and measurements to make predictions (e.g., using Deep Learning for predicting capacity or demand), without the concern of their accuracy, that indeed might vary widely in different cases.
Concluding, there are several interesting directions for future work, such as extending this framework to dynamic regret benchmarks by imposing additional restrictions on the variability of cost and constraint functions, or exploring whether the achieved bounds can be further reduced.

APPENDIX
The Appendix includes the missing proofs from the main document; the supporting lemmas and their proofs; and additional discussion for the main results.

A. Performance Cases of LLP
We start by discussing the different cases regarding the performance of LLP in order to facilitate the reader understanding the consequences of Theorems 1 and 2. Figure 3 summarizes the three possible scenarios.Case (i) is realized when −R T = O( √ T ), i.e., when LLP does not outperform x by more than this growth rate, and here LLP achieves competitive rates, R T , V T = O( √ T ) and zero regret for perfect predictions.Case (ii) arises when the condition −R T = O( √ T ) is consistently violated and in fact yields even better performance in terms of regret, while maintaining the general V T = O(T 2/3 ).And finally, Case (iii) arises when the above condition might be violated during some time intervals and sample paths, but not consistently; and for this scenario the general bounds R T , V T = O(T 2/3 ) hold.In all cases, the constant factors of the regret diminish as the predictions' quality improves.

B. Proof of Lemma 1
Lemma 1: No online algorithm can achieve concurrently sublinear regret and constraint violation, T , even if the algorithm selects x t with knowledge of f t , g t .
Proof: We provide an opponent strategy that ensures there is an increasing sequence t(1), t(2), . . . of rounds with either R t(i) ≥ t(i)/8 or V t(i) ≥ t(i)/8.Our opponent will select f t+1 , g t+1 based only on x 1 , . . ., x t .Hence the impossibility result holds even if the player knows f t+1 , g t+1 on round t+1.
Consider the domain X = [0, 1].The cost functions are linear and the pair (f t , g t ) is always one of p = (−x, −1) or q = (−2x, 2x − 1).Before giving the opponent strategy we make some general observations.To derive the set T t=1 g i (x) ≤ 0 suppose the opponent plays p exactly n times and q exactly T − n times.Then we have: In particular for n ≥ T /2 the second term is at least 1/2 and G T = [0, 1].Since f t are negative linear, the regret is with respect to x = 1.
For each m and each t = min J m we have x t−1 < 3/4.There are two cases to consider.First assume (a) there are infinitely many I m , J m .Since each |I m | = |J m | we see on the final turn t(m) of each J m the opponent has played p exactly n = T /2 times.Hence the above says G t(m) = [0, 1] and x = 1 and the regret is R t(m) ≥ t(m) i=1 (1 − x i ).On turn s(m)= max I m the regret is at most: where the first inequality uses (4) and the second uses t(m) = s(m) + 1 ≥ t(m)/2.Since there are infinitely many t(1), t(2), . . .there are infinitely many turns t with R t ≥ t/8.Hence the regret is Ω(T ).Now assume (b) there are only finitely many By (1) the opponent plays q = (2x, −2x + 1) on turns s + 1, s + 2, . . .Thus for T ≥ 4s the constraint violation is: where the last inequality uses (3).Since T ≥ 4s the RHS is at most T /2 − s ≥ T /2 − T /4 = T /8.Hence we can take t(i) = 4s + i.
To complete the proof we give an opponent strategy that satisfies these four conditions.Reference [7] suggest the following approach: (i) Play q on turns min I n , min I n + 1, . . ., m where m is the first turn with x m < 3/4.(ii) End I n and begin J n .(iii) Play p over the next |I n | turns.(iv) End J n and begin I n+1 .The decision on turn m + 1 to end I n depends only on the average x m of x 1 , . . ., x m .Hence the opponent does not need to see the player's current move to implement the strategy.Equivalently the player is allowed to see the opponent's next move.This concludes the proof.

C. Proof of Lemma 2
Using the property of the proximal regularizers, x t = arg min x r t (x), ∀t, we can expand (4) and write: where adding the t-round regularizer r t (x) does not change the minimizer of the RHS argument -and it is easy to see this using a contradiction argument.Now, recall that the prescient action is: Applying Lemma 7 from [35], with: and recalling that the regularizer r 0:t (x) is 1-stronglyconvex w.r.t.norm x (t) = √ σ 1:t x that has dual norm Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
x (t), = x/ √ σ 1:t , we can write: Lemma 5 (Optimistic FTRL) In the proof of Theorem 1 we relied on [26,Theorem 2].However, that work considers a learning problem over a compact convex set, while the dual update to which we apply this result has an unbounded decision space λ ∈ R d + .This indeed does not pose a problem for our analysis.Firstly, one can see that for the standard FTRL analysis it suffices to have closed sets, see for example [35].Secondly, the boundedness of the set is useful when we need to upper-bound the term q 0:T −1 (λ) that appears in the RHS of (9).In our analysis, this is not necessary as we cancel this term by setting λ = 0.
To complete this discussion, we provide here an alternative proof for [26,Theorem 2] that makes clear it is valid even if the decision set is not bounded.We note that this is presented in terms of the primal variables and functions {f t (x t )} to streamline the presentation, but the application to the dual variables and dual updates is straightforward.
Replacing m t (x) and n t (x), dropping the non-negative term T −1 t=1 r t (x t ) in the LHS, adding T t=1 c t x t to both sides and rearranging, we eventually get where in the last step we used the Cauchy-Schwarz inequality.
For the term x t −z t (t−1) we can apply [35,Lemma 7] with: Replacing in (33) we conclude the proof noting that we did not use boundedness of X in any step of the proof.

F. Proof of Theorem 3
First, note we can directly obtain the bound: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where the RHS term appears in μ T .Hence, the strong convexity of LLP2 can be lower-bounded as: Next, we replace Lemma 2 with an updated bound.Lemma 7: For the actions x t and z t obtained by ( 25) and ( 26), respectively, it holds: It is easy to see that since the dual update has not changed, equation ( 9) holds as is, and we can readily obtain (11), and eventually write: Therefore similarly to (19), we can write: And finally, observe that since the dual regularizer q t (λ) and learning rate a t are given by ( 8), ( 15), the bounds ( 20), ( 21) hold for LLP2 as well.Putting these together, we arrive at: That is, the growth rate of λ T remains O T (1+θ)/2 and is not affected by the non-proximal primal regularizers.Similarly, the growth rate of % B T is the same as that of B T , see (23), since h 1:T = O(T (3+θ)/2 ) dominates μ T +1 = O(T 1+θ ).Therefore, the growth rate of R T and of V T are exactly as those of LLP.
On the other hand, the R T bound of LLP2 is not zeroed for perfect predictions.Indeed, when t = δ t = ξ t = 0, ∀t, and h 1:t = 0, we get: which shows that we cannot achieve zero regret, and even more so, that as we reduce the bound of R T by increasing β, we deteriorate in a commensurate amount the bound of V T .

G. Proof of Lemma 3
If we define L t (x, λ) as in (3) but replace g t (x) with g t (x) and also use the linearized prediction g t (x), with: then, the updates that we use in LLP2 are: where the primal and dual regularizers are given again by ( 7), (8) using the modified error parameters, ∀t: ), and t = c t − ct as before.
Note that in case of perfect predictions, i.e., when: then ε t = 0, δ t = 0, and where the last step follows as for perfect predictions, clearly, it holds z t = x t .Next, it is easy to see that Lemma 2 holds and yields the same bound x t − z t ≤ t + λ t δ t /σ 1:t with the redefined {δ t }.Applying [26, Theorem 2] to (37), we get: Then, we can repeat the analysis in Sec.II-A, noting: due to convexity of g t (•) and the property of x , to arrive at the same bound B T for the regret R T , sans the redefined {δ t } and {ξ t } parameters.Similarly, repeating the analysis of Sec.II-B we get: Minimizing the LHS, similarly to Sec.II-B, we obtain: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Finally, note that using the definition of g t (x), we can write: Hence, we obtain the bound: where we used the identity Therefore we arrived at the same result as with the nonlinearized constraints: by using a slightly different proof.Finally, it follows directly from the proof of Theorem 2 that the convergence rates are not affected by this linearization of the constraints.We conclude by stressing that in this version of LLP we only required predictions ct , ∇g t (x t ) and gt+1 (x t+1 ); while with non-linearized constraints LLP was using predictions ct , gt (•) and gt+1 (x t+1 ).

H. Proof of Lemma 4
For this specific type of constraints, the primal and prescient updates are as follows: From this result, we can see directly that when t = ξ t = 0, ∀t, we get R T ≤ 0 since B T = 0 and a T −1 is positive.And for the constraints, it holds: For worst-case predictions, we can drop the positive term a T −1 (V z T ) 2 and write: and replace −R T ≤ F T = O(T ) in ( 40) and rearrange to obtain the bound for V z T and then using the relation of V T to V T (see (40)) to bound the former.Finally, it is interesting to observe that the derivation of the bounds did not require explicitly bounding the dual variables, and this stems from the fact that x t is independent of the constraint perturbations.

I. Numerical Tests
Finally, we conclude by providing some simple, yet illuminating, numerical results comparing LLP with three competitor algorithms: the MOSP algorithm by Chen et al. [6]; our previous work Valls et al. [16]; and Sun et al. [10].Figure 4 presents the first set of results.The algorithms run on the following cost and constraint functions: For the first LLP run we do not use any predictions, so as to demonstrate the efficacy of the algorithm even when no predictions are available.The second LLP plot runs the linearized version of the algorithm and uses perfect gradient predictions for the cost and constraint functions, but no predictions for the next constraint value.The three competitors have been optimized for R T , by using the steps and tuning parameters that are suggested in their respective references.We observe that LLP achieves lower regret from all competitors, and it reaches that point faster.
In the second experiment, we run the algorithms on the time-invariant cost function f t (x) = −2x, ∀t, and constraint: where, again, x ∈ X = [−1, 1].Note that in this example we plot the total, not the average, constraint violation so as to shed light on the actual operation of each algorithm.We observe that LLP satisfies continuously the constraints in each t, while the competitors oscillate or fail to converge, despite that the cost function is constant.The above results demonstrate that LLP performs quite well in practice, where even in these simple examples (one dimension space, time-invariant cost functions, etc.) it has clear advantages over the state-of-art competitors.For example, we see that it achieves fast lower regret points (first experiment); and is able to handle the probabilistic constraints in the second example -which is not surprising given that it uses a lazy dual update scheme which turns to be robust in such variations.

Lazy
Lagrangians for Optimistic Learning With Budget Constraints Daron Anderson , George Iosifidis , and Douglas J. Leith , Senior Member, IEEE Abstract-We consider the general problem of online convex optimization with time-varying budget constraints in the presence of predictions for the next cost and constraint functions, that arises in a plethora of network resource management problems.A novel saddle-point algorithm is designed by combining a Follow-The-Regularized-Leader iteration with prediction-adaptive dynamic steps.The algorithm achieves O(T (3−β )/4 ) regret and O(T (1+β )/2 ) constraint violation bounds that are tunable via parameter β ∈ [1/2, 1) and have constant factors that shrink with the predictions quality, achieving eventually O(1) regret for perfect predictions.Our work extends the seminal FTRL framework for this new OCO setting and outperforms the respective state-of-the-art greedybased solutions which naturally cannot benefit from predictions, without imposing conditions on the (unknown) quality of predictions, the cost functions or the geometry of constraints, beyond convexity.Index Terms-Network control, network management, resource allocation, online convex optimization (OCO), online learning.

Fig. 1 .
Fig. 1.Block diagram of dynamic and adaptive network control solutions with prediction-assisted online learning.

Fig. 3 .
Fig.3.Performance region and cases for the Regret of LLP, under worst-case (or, no available) predictions.

TABLE I x
BELONGS IN X AND SATISFIES gt(x ) 0, ∀t.FOR THE ALGORITHMS WITH TUNABLE BOUNDS WE PRESENT THE BEST ACHIEVABLE W.R.T. R T , WHILE [6] USES DYNAMIC REGRET AS ∀t, x ∈X .A3. Predictions ct , gt (•) and gt (x t ) are known at t. A4.The prediction errors ε t c t − ct , δ t ∇g t (x t ) − ∇g t (x t ) are bounded: (12)minSimilarly, we can redefine B T and re-derive(12)as follows: t i=0 g t (z i ) .With this modification, Lemma 2 yields the bound:x t − z t ≤ t σ 1:t , with t = c t − ct , σ t = σ t (x t − z t ) ≤ t and ξ t = g t (z t ) − gt (x t ).Selecting the λ that maximizes the second term in the LHS as before, we arrive at: R T