Explainable and Transferable Loss Meta-Learning for Zero-Touch Anticipatory Network Management

Zero-touch network management is one of the most ambitious yet strongly required paradigms for beyond 5G and 6G mobile communication systems. Achieving full automation requires a closed loop that combines (i) network status data collection and processing, (ii) predictive capabilities based on such data to anticipate upcoming needs, and (iii) effective decision making that best addresses such future needs through proper network control and orchestration. Recent seminal works have proposed approaches to jointly implement the last two phases above via a single deep learning model trained on past network status to directly optimize future decisions. This is achieved by designing custom loss functions that directly embed the management task objective. Experiments with real-world measurement data have demonstrated that this strategy leads to substantial performance gains across diverse network management tasks. In this paper, we go one step beyond the loss tailoring schemes above, and introduce a loss meta-learning paradigm that (i) reduces the need for human intervention at model design stage, (ii) eases explainability and transferability of trained deep learning models for network management, and (iii) outperforms custom losses across a range of controlled experiments and practical use cases.

Explainable and Transferable Loss Meta-Learning for Zero-Touch Anticipatory Network Management Alan Collet , Student Member, IEEE, Antonio Bazco-Nogueras , Member, IEEE, Albert Banchs , Senior Member, IEEE, and Marco Fiore , Senior Member, IEEE Abstract-Zero-touch network management is one of the most ambitious yet strongly required paradigms for beyond 5G and 6G mobile communication systems.Achieving full automation requires a closed loop that combines (i) network status data collection and processing, (ii) predictive capabilities based on such data to anticipate upcoming needs, and (iii) effective decision making that best addresses such future needs through proper network control and orchestration.Recent seminal works have proposed approaches to jointly implement the last two phases above via a single deep learning model trained on past network status to directly optimize future decisions.This is achieved by designing custom loss functions that directly embed the management task objective.Experiments with realworld measurement data have demonstrated that this strategy leads to substantial performance gains across diverse network management tasks.In this paper, we go one step beyond the loss tailoring schemes above, and introduce a loss meta-learning paradigm that (i) reduces the need for human intervention at model design stage, (ii) eases explainability and transferability of trained deep learning models for network management, and (iii) outperforms custom losses across a range of controlled experiments and practical use cases.
Index Terms-Loss meta-learning, zero-touch networking, anticipatory network management.

I. INTRODUCTION
T HE COMPLETE automation of mobile network man- agement and orchestration (MANO) is one of the holy grails of future-generation communication systems.Fostered by the ever growing complexity of mobile infrastructures, and enabled by the unprecedented flexibility granted by the progressive virtualization and then cloudification of network functions, zero-touch networking and service management (ZSM) [1] is expected to complement or even supplant current human-based operations in 6G systems [2].
A paramount role in making ZSM a reality will be played by data-driven models, which are expected to build on much improved monitoring and data collection capabilities to take effective network management decisions at a fast pace that is not achievable by human-in-the-loop strategies.To this end, standard-defining organizations (SDO) are working towards integrating legacy machine learning operations (MLOps) within MANO frameworks [3], [4], [5].These emerging standards are in particular intended to support deep learning models that thrive on large amounts of measurement data and are widely regarded as a most promising tool for network environments [6].
In many cases, ZSM solutions involve anticipatory operations, which allow for a management of the network based on forecasts of the evolution of its status and are ostensibly more effective than reactive policing [7].And, as mentioned above, deep learning models empower the vast majority of the prediction-based frameworks proposed in recent years.
The classical process followed by deep-learning-based anticipatory networking schemes consists of a closed loop where data about the recent history of a certain variable or Key Performance Indicators (KPI) is collected and fed to a deep neural network (DNN) in charge of performing a forecast of the future value.Then, the predicted value is input to a decision-making algorithm, which can be based on statistical modelling, machine learning or optimization approaches, in order to produce the final network management decision.The resulting pipeline is outlined in Figure 1(a).
However, recent seminal and award-winning studies have started questioning this straightforward approach to anticipatory networking and proposed to let instead the deep learning predictor directly forecast the management decision [8], [9], [10], [11].The integration of the two phases is achieved via the customization of the loss function used during training.By tuning the shape of the loss in a way that it reflects the management objective, the predictor is optimized to produce forecasts that align with the actual management goal.The concept is illustrated in Figure 1(b), and it is in contrast with the legacy approach in Figure 1(a) where the deep learning data predictor is trained to minimize a vanilla loss such as Mean Squared Error (MSE) or Mean Average Error (MAE).The early studies above found that making predictors aware of the management goal is very beneficial to network 1932-4537 c 2024 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Fig. 1.Different approaches to forecasting for network management.(a) Traditional data predictor that minimizes, e.g., MAE in forecasting future samples.(b) Forecasting based on a custom loss that is manually designed to mimic the decision-making problem, so that the output of the model is directly the management decision, as introduced by recent seminal works [8], [9].(c) Our proposed forecasting strategy with a meta-learned loss optimized during training to represent the decision-making problem.
management performance, across a variety of tasks that range from resource allocation to traffic engineering.
In this paper, we specifically target management for performance optimization, and not for all of fault, configuration, accounting, performance and security (FCAPS) tasks at large.In this context, we argue that, while the previous works represent a right step in the direction of network managementnative machine learning, models relying on custom losses suffer from multiple drawbacks in practical cases, as follows.
• The design of appropriate (and mathematically viable) loss functions for each and every management task requires substantial expert knowledge and engineering time, which does not scale to the growing diversity of problems with associated objectives.The problem is best illustrated by intent-based networking (IBN), an emerging paradigm mandating that an anticipatory decision model is automatically generated on-demand, with operation and output aligned to any specific management intent expressed by the human controller [12], [13].This results in potentially infinite anticipatory network management problems, and building (and maintaining) a comprehensive catalog of fine-tuned models that tackle each and all of them is not possible in practical contexts.• A manual design of the loss is necessarily limited by the understanding of the system that the expert has.This could easily miss inherent but hidden properties resulting in sub-optimal performance.We will indeed observe the phenomenon in some of our tests in Section VI-C, where different accuracy levels of the prediction that depend on the specific input pattern are impossible to capture at design stage, preventing the loss to capture them, or in situations where the sheer dimensionality of the loss makes it very hard to model, as showcased in Section VII.• In many practical cases, the relationship between predictions and final performance is not known a priori, which immediately prevents a manual design of the loss function that should capture such a relation.Examples abound: for instance, forecasts of computing resources allocated to specific network slices affect the Quality of Experience (QoE) of the end users in ways that are hard to estimate in advance [14]; anticipatory associations of Distributed Units (DUs) to Central Units (CUs) in virtualized Radio Access Networks (vRANs) are linked to performance and energy consumption in tangled ways that are best observed only after deployment in real-world systems [15]; or, all IBN use cases inherently require translation of high-level intents into objectives whose relation to the actual management decisions are unknown a priori [16].To alleviate the limitations above, we propose AutoManager, an original strategy that, instead of imposing a predefined expression of the loss function used to train the deep learning predictor, lets the forecasting model free to learn the loss function that best suits the network management objective at hand.In other words, the loss itself becomes a trainable element of the model, which shall capture the relationship between the predictor's forecast and the target management goal.
The concept underpinning AutoManager is illustrated in Figure 1(c) and ultimately lets the data-driven forecasting model learn at once (i) how its predictions affect the performance goal, and (ii) how to optimize such predictions so that the goal is met.In machine learning terminology, this corresponds to the problem of meta-learning the loss function that shall drive the training of the model.As detailed in Section II, loss meta-learning is a very recent and active field of study, and no ultimate solution exists for regression tasks such as forecasting.
Our experiments show that our proposed approach (i) is able to improve the performance of manually-tailored, expert-knowledge based solutions for loss-metric mismatch on more than 25%, and does so without any knowledge on the loss function (see Section VI-D), and (ii) it even performs better than a model directly trained with the closeform formal expression of the loss (see Section VI-C); moreover, (iii) AutoManager finely characterizes complex non-differentiable discrete losses (see Section VII), allowing it to improve the operator's profit up to 36%, and (iv) it maintains the great performance when applied for transfer learning (see Section VI-D).
Overall, our study unveils the advantages of meta-learning of loss functions for forecasting, and paves the road for the adoption of this paradigm in practical application domains.

II. BACKGROUND AND NOVELTY
Our work lies at the interface of computer networking and machine learning, and contributes to advancing the state of the art in both domains.In this section we discuss how this is the case, by positioning our proposed model with respect to (i) current frameworks for automated anticipatory management in mobile networks, in Section II-A, and (ii) techniques for meta-learning in DNN architectures, in Section II-B.
We also remark that early versions of this study have appeared in subsequent editions of IEEE INFOCOM [17], [18]: the present manuscript represents a comprehensive treatise of the proposed loss meta-learning strategy for network management, and includes a range of original evaluations of the capabilities of the AutoManager model.

A. Anticipatory Network Management for B5G and 6G Systems
As mentioned in Section I, two main strategies to anticipatory networking can be identified in the literature, one being more traditional and vastly more popular, and the other representing a relatively new proposal.We discuss them next.
Mobile network prediction.Traditionally, forecasting methods for mobile networking have built upon statistical models.The predominant method has been autoregression [19], [20], [21], [22], [23], although there are also proposals based on tools from Markovian [24] or information [25] theory.However, the design of data-driven autonomous solutions for network management with deep learning-based approaches has gathered momentum in recent times [6], and this trend includes forecasting and anticipating future network states, where recent works have shown important improvements in performance on account of employing diverse DNN architectures [26], [27], [28], [29], [30].
All these predictors aim at producing a forecast that deviates as little as possible from the future sample, by minimizing legacy error metrics such as MAE or MSE.In DNN models, as exemplified in Figure 1(a), this is achieved by using MAE or MSE as the loss function, i.e., the expression that the neural network (NN) learns to minimize during training.The output provided by these predictors is completely agnostic to the network management objective.Therefore, the prediction does not offer a solution to the network management task, rather is a mere input to the actual decision-making process.
The inherent problem of the approach is that predictions are inevitably imperfect, and yield errors whose impact on the downstream decision-making process is not trivial.As a simple yet representative example, a typical (unbiased) traffic forecasting model incurs in roughly equivalent probability of committing positive and negative errors in the estimation of future demands.Yet, while positive errors result in unnecessary but not too harmful over-dimensioning by the decision-maker, negative errors can cause critical underprovisioning and service disruption.To avoid the latter, the decision-making module must somehow compensate for the prediction inaccuracy, and yet it has no information about if, when, and in what way (e.g., the error is positive or negative) the forecast is inaccurate.Ultimately, this creates a cumbersome operation where the decision-making solution must be designed to fix, usually in very simplistic sub-optimal ways (e.g., by introducing a static large over-dimensioning to mitigate underestimation in the example above).
Loss customization.While the classical predictors above factually decouple the problems of forecasting and activation, recent works on deep learning for network management have proved that jointly solving the two problems is a much more effective approach.Abiding by this strategy, the predictor does not just anticipate future KPI samples, but directly forecasts the network management decision (e.g., the amount of resources needed to serve the incoming traffic.).As illustrated in Figure 1(b), this is achieved by training a DNN model with a loss function that encodes a specific management objective.In this way, the forecasting model directly outputs the decision needed to meet the management goal.
This approach has been recently tested to address varied problems, including network function resource reservation [8], [10], anticipatory bandwidth allocation at individual base stations [11], or predictive traffic engineering for wide area networks [9].In all cases, large performance improvements (typically in the 20%-60% range) have been recorded over the legacy strategy of disjoint traffic prediction and downstream decision-making.The reason is that this design creates a single model that is optimized to translate the previous KPI history into a network management decision; thus, during the training phase such a model learns management resolutions that are aware of the potential prediction inaccuracies.
However, state-of-the-art models for capacity forecasting employ loss functions that are designed manually, based on expert knowledge [10].This strategy has several limitations that we already presented in Section I, i.e., (i) it requires human intervention, (ii) it assumes that an effective differentiable loss function can be devised by hand, (iii) it cannot cope with situations where the relationship between the actionable network parameters and the management objective is not known a priori, or just tangled beyond expert knowledge.The novel AutoManager design we propose in this paper aims at removing all the limitations above.

B. Meta-Learning for Deep Neural Networks
Meta-learning, also referred to as learning-to-learn, overcomes the limitations of fixed learning-based models and allows automatically tuning different aspects of the learning algorithm to the target task [31].Meta-learning has been successfully applied to, e.g., distillation [32], augmentation [33] or batching [34] of training data, initialization [35] or optimization [36] of the model parameters, tuning [37] of its hyper-parameters, and discovery [38] of the actual architecture, possibly as a composition of modules [39].
Our focus in this work is on meta-learning of loss functions, which aims at learning the parameters, components, or shape of the loss to be used to train the actual model.The problem can be seen as an instance of a hierarchical optimization, where a meta-model is optimized under a constraint represented by the main model optimization [40].We stress that this is instead semantically different from meta-learning optimization schedules in iterative and alternate optimization processes [41].
Three main approaches to loss meta-learning have been explored to date in the literature, and we detail them next.
Learning to parametrize a predefined loss.The majority of works on loss meta-learning propose dedicated models Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to infer the most suitable configuration of a predefined, parametrizable loss function.Here, a number of studies have investigated the use of decision networks to select among a set of predefined (family of) loss functions [42], [43], or to learn the parameters of known and differentiable meta-losses [44].Other relevant investigations have focused on multi-part loss functions, where the goal is setting [45], [46] and possibly dynamically updating [47] the function weights based on (live) performance metrics.Also related to the same concept are strategies such as training a network to correct the optimization trajectory produced by a fixed loss [48], or introducing general loss functions that contain hyper-parameters to be learned during training along with the neural network parameters [49].
In all these cases, the loss -independently of whether it is expressed as a tunable function or set of primitives -must be designed or selected manually, which is not possible when the performance metric of interest is not known a priori.Instead, we seek a solution that can learn a clean-slate loss.
Learning surrogate loss functions.Surrogate losses are often used as a proxy for discontinuous or otherwise challenging prediction-metric relationships that cannot be directly used to drive the learning process.Surrogates are typically taskspecific and handcrafted, which involves significant manual effort for each new task -and of course requires prior knowledge about the relationship of interest.To remove the need for time-consuming human expert intervention, recent proposals have explored the possibility of meta-learning surrogates.Such solutions have proposed to express the performance metric as a function of a set of simple surrogates [50], or to compose a loss function from primitive mathematical operators [51].A more elaborated strategy is that of learning the loss as a convex-by-construction function within a given parametric family (e.g., MSE or other quadratic operations of the loss input) and identify the exact parameters providing the best performance [52].
All the proposals above still require a-priori knowledge of the original relationship to identify a relevant set or family of surrogates, hence do not answer to the need of learning the loss at model runtime upon deployment in the target system, which is our main target.A closer design to the one we adopt for AutoManager is a clean-slate surrogate learning based on a neural network modeling of the loss [53].Yet, the approach is intended for classification tasks only and is not adapted for regression: e.g., it explicitly makes the result invariant to the ordering of the minibatch samples, which is discrepant e.g., with a time series forecasting goal; or, it adopts a bilevel programming optimization of the model parameters that we later show not to perform well in our target regression tasks.
Learning to teach.The concept of representing the loss function via a neural network has been in fact explored beyond the context of surrogate losses, as the so-called learning to teach paradigm.In an early work, the use of a teacher network was proposed to dynamically train parameters of a loss function that adapts to the learning stage of the main model [54]; yet, this approach still relies on a generic known loss function to be parametrized by the teacher.The seminal idea of a trainable task-parametrized and clean-slate loss generator was introduced for reinforcement and supervised learning by the meta-critic model, where an action-value function neural network learns to criticise the actions in a specified task [55], [56].However, the meta-critic model is only applied to supervised learning problems as a tool for pre-training in the few-shot learning of new tasks (e.g., by generalizing to unseen value ranges in the same domain), and is explicitly indicated by its authors as inappropriate for single tasks like the ones we consider.
It is also worth noting that meta-critic and its extensions [50], [57], [58], [59] are dedicated to discrete-space classification tasks or ranking problems.The same is also true of recent proposals to employ genetic programming tools to learn clean-slate loss functions [60], [61].
Novelty in the machine learning domain.Overall, prior studies on loss meta-learning have focused on (i) parametrizable loss functions or (ii) clean-slate losses for classification tasks.Little attention has been paid to the meta-learning of clean-slate losses for regression.Part of the reason comes from the fact that loss meta-learning has been considered to create an indirection that makes single-task regression less efficient [55], under the assumption that losses such as MAE or MSE can already optimally drive training in that case.Our study challenges this assumption and shows that there exist practical use cases, e.g., in system engineering, where loss meta-learning benefits single-task predictions.By investigating meta-learning solutions that target loss functions for forecasting tasks, our work sheds new light on the advantages that this emerging paradigm can bring to a class of machine learning problems where loss meta-learning has been overlooked.Also, all previous solutions in the literature are designed for scenarios where perfect knowledge of the prediction-metric relationship is available at model design time.To the best of our knowledge, ours is the very first study to propose a solution for practical use cases where a-priori information about such a relationship is not available.

III. LOSS META-LEARNING FOR NETWORK MANAGEMENT
We first present the fundamental concept underpinning our design of a loss meta-learning solution for network management.We also explain how this approach compares to reinforcement learning (RL), and formalize the meta-learning problem via the notation used in the rest of the document.

A. High-Level Concept
We propose a novel approach based on machine-learning for forecasting under imperfect knowledge of the loss-metric relationship, i.e., when we cannot fully characterize a priori the mapping between the decision triggered by the model and its impact on the objective performance.Our design, named AutoManager, is built on an elemental idea: we do not pre-select or assume a specific shape for the loss function; instead, we let the model find through learning the function that best characterizes the corresponding network management objective, providing full freedom to explore the possible relationships.This freedom is achieved thanks to the fact that feedforward neural networks are universal approximators [62].
In practical terms, we compose the predictor with another ML block that takes the role of a loss function meta-learner, as illustrated in Figure 1(c).In other words, this piece of the system aims at apprehending the relationship existing between the target network management objective and the prediction (i.e., anticipatory decision) made by the system.In such manner, once the loss meta-learner has finished its training and has learnt a certain function, it becomes an automatically tailored loss function that can assess the quality of the decision taken by the predictor given a certain system state and the considered network management objective.
As we will detail later, the implementation chosen in the subsequent analyses for the previously described predictor and loss meta-learner is DNN-based, although such a choice is not a limitation of the model and other choices are possible.
It is worth noting that, as shown in Figure 1(c), the loss meta-learner is trained so as to minimize a legacy standard loss function (e.g., MAE or MSE) of its output with respect to the network management objective.This choice of a standard loss function is based on a twofold rationale.First, it is aligned with the idea that the meta-learner shall be trained to simply minimize the difference between the estimated and actual performance of the management decision: in particular, the meta-learner does so by directly using performance measurements collected in the target system, removing all need to formalize its operation as a mathematical function as it happens instead with recent custom loss strategies [8], [9].Second, it makes the approach general and applicable to many different tasks, as it does not involve any (application-specific) expert knowledge: we will show later in the manuscript that the model can recover the performance of expert knowledgebased solutions without explicitly utilizing such knowledge.
As a result, AutoManager handles the fundamental limitations of previous approaches: (i) it can learn the function mapping the forecast decisions with the management objective directly from measurements with no human intervention; and, (ii) it does not require prior system knowledge to correctly characterize entangled, non-linear and multivariate objectives that characterize network management tasks.

B. Relation to Reinforcement Learning
While it lies within the category of supervised learning methods, the approach proposed above has certain conceptual similarities to RL, from the viewpoint that the learning process is based on observations of the outcome of the taken decisions in both cases.Yet, there are also important differences and some advantages for AutoManager, which we describe next.
First, RL is known to be best suited for discrete-space decisions; however, AutoManager is a method for regression problems (and specially forecasting tasks) that is naturally designed for the continuous input and output spaces that are often encountered in network management tasks, while also being compatible with discrete spaces.
A second crucial aspect is that our model inherently separates the logical components of (i) anticipatory decision-making, implemented by the predictor element, and (ii) relationship between the decision and the resulting performance, embedded by the loss meta-learner that steers the learning process.This logical detachment allows us to isolate the learned loss function at the end of training, which is not possible with existing RL techniques.Isolating of the loss has two main advantages.
• Explainability: we can explore (e.g., by injecting controlled input) the trained neural network that implements the meta-learner in order to discover how different decisions impact the network performance, thus obtaining precious insights on the system operation that are not known at design time.More generally, this allows revealing how to the decision process occurs and makes the "reasoning" of the deep learning model way easier to explain, which is a much demanded feature absent in the vast majority of complex black-box data-driven models proposed for zero-touch network management.• Portability via transfer or few-shot learning: once the loss is learned, the logical independence of the loss metalearner allows reusing the learned loss in different settings where the decision-metric relationship is expected to be the same, but the statistics or correlations between inputs may vary: for instance, in scenarios where the same network management task needs to be performed in presence of diverse traffic demands, such as in different cities, urban versus rural areas, or across countries.Furthermore, this advantage can be exploited in cases where the loss is expected to be similar but not exactly the same as the learned one: as an example, if an identical energyoptimization management task is to be run in presence of hardware that entails different power consumption profiles.In such situations, a pre-trained loss meta-learner can be further fine-tuned in the new setting via few-shot learning approaches [63].These two advantages will be demonstrated through our experiments: e.g., Figures 7d, 10b, 11(b) visualize meta-learned losses, and Table VI summarizes transfer learning results.

C. Formal Problem Formulation
We formally describe the mathematical formulation of the problem that we are considering, and we introduce the mathematical notation that will be used hereinafter.
The input and output spaces of the predictor are respectively denoted as X and Y, and the parameters of the predictor are denoted as W p .Hence, the predictor is modeled as f W p : X → Y, and the multi-dimensional decision of the predictor at time t for time t + 1 can be written as ŷ t+1 = f W p (X t ).Here, X t = {x t−T , . . ., x t } ∈ X includes observations of the input space for the past T time steps.We further define the number of input and output variables as n in = |x t | and n out = |ŷ t+1 |, respectively.The vector ŷ t+1 can be expressed as We note that, generally, ŷ t+1 = x t+1 , i.e., we do not consider that the output is a direct prediction of the input as in plain time series forecasting: instead, each decision ŷ t+1 is a compound function dependent on the whole set of input variables X t .
The decision taken at time t triggers a network management performance cost that is denoted by M t+1 = Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
f M (ŷ t+1 , v t+1 ), which depends on the prediction ŷ t+1 itself, but also on system variables v t+1 that might affect the performance at time t + 1. Hereinafter, the term cost refers to the performance degradation that the anticipatory action produces with respect to the target performance metric.That is, it is equivalent to the notion of loss in the machine learning field.It is important to recall that f M (•) is unknown a priori, either because it is too complex to characterize or because it depends on information not available at model design time.Yet, we can measure the performance a posteriori and thus collect samples of f M (•) by monitoring the system.
Since f M (•) is initially uncharted, we need first to discover it to be able to enact decisions that align with the network management objective; i.e., we need to identify an estimator Mt+1 = f W (ŷ t+1 , v t+1 ) with parameters W that minimize the L2 loss1 (the MSE loss) with respect to the true cost M t+1 .Formally, we need to solve the optimization problem: The cost estimator from (1) can then be used to steer predictions towards the performance objective, by embedding it into a second optimization problem The solution to (2) is the predictor f W p that we seek, which produces a forecast minimizing the expected network management cost f W at time 2 t + 1 from the input X t at time t.

IV. THE AUTOMANAGER MODEL
The proposed approach, named AutoManager, is a generalpurpose loss-function-agnostic regressor that realizes a bifold learning, as previously mentioned: On one hand, (i) it learns to predict multiple actions to jointly optimize a specific global objective.Yet, since the model is unaware of such optimization goal, (ii) it must also learn the appropriate loss function that correctly reproduce that goal from a-posteriori system measurements.We first provide a model overview to later describe the detailed implementation.

A. AutoManager: Model Overview
AutoManager adopts a data-driven deep-learning approach that follows the general design for loss meta-learning outlined in Figure 1(c) to jointly solve the optimization problems (1)- (2).
Specifically, our solution relies on two NN-based blocks: AutoManager implements the loss learning with a dedicated neural network; this neural network is trained to solve problem (1) and approximate the network performance based on a-posteriori observations of the aftereffect of the predictive decisions.Then, a second neural network realizes the actual anticipatory network management decision making, which is trained to optimize (2), i.e., to minimize the cost estimated by the first network.A crucial aspect of AutoManager, which we will extensively illustrate afterwards, is the specific structure of the generic two-stage design above, as portrayed in Figure 2 and detailed next.
The predictor implements ŷ t+1 = f W p (X t ).Internally, the predictor has a defined structure: it is actually composed of multiple Individual Processing Units (IPU), which are not connected to each other; the task of an IPU is to learn the unique temporal correlations that characterize a single one of the n in variables.Each IPU receives past values for one of the n in variables.Upon introducing the notation The output of all IPUs is then gathered by a single Aggregator block, which is an elemental part of the predictor: The Aggregator learns the interaction between each of the input variables and how they are intertwined among them and with the performance metric, exploiting the across-dimension knowledge to provide predictions ŷ t+1 that optimize the network management objective.
Finally, the performance cost estimator (PCE) implements Mt+1 = f W (ŷ t+1 , v t+1 ), i.e., it realizes the loss metalearner described in the conceptual model in Figure 1(c).The purpose of the PCE is to meta-learn the loss to train the predictor, and therefore it is not active during inference.This performance cost estimator generates the estimated cost that the decision ŷ t+1 produces in the system.Thus, its purpose is to learn as accurately as possible this cost of performance, and hence its loss must just capture the sheer distance between them.
As already noted, the internal structure of the predictor is an essential novelty that allows for forecasting from intertwined variables and provides key advantages: 1) The challenging task of optimizing f W p is sliced into simpler learning sub-tasks thanks to the logical structure based on parallel IPUs all feeding the aggregator.2) The AutoManager design allows for co-training the predictor and loss estimator during the same gradient descent iteration.This makes the learned loss f W (•) adapted to the inherent forecasting limits of the predictor.

B. Detailed Implementation
We explain the detailed implementation of the proposed model and how we integrate Automated Machine Learning (AutoML) approaches [64] in sub-elements of the model.

1) Pre-Training of Individual Processing Units:
AutoManager's design is steered towards facing the challenge of decreasing the learning time, because the particular structure of the predictor block makes possible to individually pretrain each IPU: we can optimize each IPU to minimize t }, i.e., training each IPU as standard time-series predictor.Note that, even if such pretraining is carried out with a legacy L 2 (•) function, it turns out that it accelerates the learning process in realistic settings.
2) Joint Training: AutoManager is implemented as a series of cascaded DNNs, as illustrated in Figure 2, where the PCE block is fed by the current observations and the forecast output by the predictor.This allows for simultaneously optimizing all the blocks (IPUs, aggregator, PCE) through the same backpropagation process.Actually, AutoManager contains both cascaded and parallel structures, due to the independent design of the IPUs.Specifically, the weights of the DNNs are optimized during training as follows.
First, during the forward pass, the predictor is fed with a set of past observations of the system state from N previous time instants (as well as other possibly relevant inputs), and it outputs a prediction for t + 1 at time t, f W p (X t , v ).At time t + 1, the current observations are measured and passed to the performance cost estimator system, which computes the estimated performance function At the same time, the actual performance of the decision Then, the mismatch between estimated and true performance is evaluated via a legacy or standard loss function, and backpropagated first to the PCE DNN.Here, the PCE DNN updates its weights ω t+1 to better capture the relation between M t+1 and the combined values of the prediction ŷ t+1 and the system state v t+1 .Within the same iteration, the updated loss is sequentially backpropagated to the predictor DNN, which allows improving the alignment of the forecast with the optimal decision that minimizes M t+1 .
This design increases the efficiency of the training phase with respect to the case where each block is optimized independently, e.g., by feeding the metric estimator block with random predictions and, once the loss has been learned, using it to train the predictor.Indeed, co-training allows learning a loss f W (•) that is adapted to the intrinsically limited accuracy of the predictor; as an example, co-training may lead to learning diverse shapes of the loss depending on the magnitude of the target variable if the quality of the prediction is found to be affected by the absolute value of the target variable.
It is worth noting that such a co-training represents a major novelty of our model with respect to previous losslearning proposals [53], [54], [55].Indeed, the end-to-end backpropagation training was not possible in prior models, and the two elements (i.e., the learning-to-act block and the learning-to-correct block) were trained either iteratively or in a nested manner only.

3) Noisy Exploration During Training:
A key element of the architecture of AutoManager in Figure 2 is that we incorporate random noise to the decision received by the loss meta-learner, i.e., the input of the performance cost estimator (PCE) at time t is ŷ t+1 + v instead of the ŷ t+1 provided by the predictor, where v is a zero-mean random variable.
This idea, inspired by the standard habit in RL of exploring undiscovered states by taking random decisions, allows the PCE to improve the characterization of the continuous inputoutput relationship, better exploring the domain of the loss.
Mathematically, when we introduce the distortion v , the gradient-descent updates of the NN weights become where α p t and α are respectively the learning rates of the predictor and the performance cost estimator DNNs.
The noise v is only used in training, and it is set to v = 0 during testing, once the expression of the loss f W (ŷ t+1 , v t+1 ) is assumed to be learnt.In this regard, a critical design feature of AutoManager is that v is also input Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to the regressor DNN: during training, this lets the prediction block learn the correlation between such input and the added disturbance to its output, since otherwise the predictor would try to compensate for the distortion added by v .Then, during inference, setting to 0 allows producing forecasts ŷ t+1 that are not biased by the loss exploration used in training.
The goal of the random variable is to allow for further exploration of the input values, supplying the metric estimator with a broader observation of the input domain beyond that provided by the training samples, and improving the reliability of the characterization of the loss function over the continuous domain.There is in fact an intuitive analogy between this additive noise and the typical approach applied in reinforcement learning of taking a non-optimal (and typically random) decision with some probability, creating an exploration-exploitation trade-off.Similar to what occurs in the reinforcement learning case, we observe that the noise is most beneficial when it exponentially decays as the training advances.More details are presented in Section VI for practical use cases.
4) Cyclic Learning Rate: We incorporate the Cyclic Learning Rate (CLR) method [65] into the design.CLR consists on dynamically adapting the learning rate during training to better explore the properties of the gradient of f W (ŷ t+1 , v t+1 ) during the process of optimizing the weights W p of the predictor.This method is beneficial especially because, in realistic use cases, the representation of f W (ŷ t+1 , v t+1 ) learned by the PCE (or loss meta-learner) is a multi-variate high-degree polynomial that does not have saddle points.
CLR makes the learning rate vary within a certain range, where the extreme values of such range are updated at each training batch.The changes of the learning rate are defined by the designer, e.g., by defining that it follows a triangular function between two values, or an exponential decay function.Eventually, CLR enables faster automatic convergence for any shape of the network management objective function f M , and it also gratefully aligns with the AutoML principles embraced by AutoManager because it automates part of the ML hyperparameter configuration: it reduces the model sensitivity to the initial value of the learning rate by autonomously varying it, preventing the learning from being stuck due to a wrong default learning rate choice.

5) Internal Design of Components:
AutoManager is a conceptual model that accommodates any type of backpropagation-trained models.We remark that our focus is not improving the design of the predictor over state-ofthe-art forecasting algorithms, such as N-BEATS [66] or DeepAR [67], but on addressing the loss-metric mismatch.In fact, the predictor block (IPUs and Aggregator) in Figure 2 can implement any forecasting model, including those mentioned above.Yet, even a perfect predictor would not minimize a practical performance metric if trained with standard errorreduction losses.
The dimensionality of the input (n in ) and output (n out ) depends on the use case, and, hence, the exact architecture of the neural networks, e.g., their layering and activation, also depends on the considered task and the complexity of the involved data.Nevertheless, we propose a generic and not limiting implementation that sets the type of NN that is used for each component: Each IPU is constructed as an LSTM neural network, while the aggregator and the PCE are implemented as fully connected Multi-Layer Perceptrons (MLP).The rationale for the choice of LSTM-based IPUs is that the IPUs are designed to learn the marginal temporal correlations of each variable, while for the PCE the MLP is the most general model one can adopt to learn completely unknown correlations like those linking y t+1 and M t+1 .We would like to clarify that the PCE is not used as a predictor, since the forecasting role is taken by IPUs.Instead, the PCE acts as a 'shape-matcher' whose purpose is learning to mimic the behavior of the performance metric.In fact, MLPs are broadly used for shape representation, where their capability of capturing very complex polynomial functions is well suited: advanced models such as Implicit Neural Representations (INRs) adopt MLPs with sinusoidal functions as activation functions (SIREN).Nevertheless, as mentioned above, the internal structure of the blocks is a design choice and AutoManager accepts any other inner block architecture.
The PCE is then trained against the MSE loss, and the SGD is computed through the well-known Adam optimizer.
This modest architecture can in fact model rather complex non-linear loss functions, and thus satisfy the performance requirements expected in many network management tasks.We remark that this is an example structure and the method does not rely on the mentioned specific NN types.Moreover, one could integrate AutoML solutions to automatically select the adequate size or type of the neural networks, increasing the autonomy of AutoManager to a further extent.

V. MEASUREMENT DATASETS
Before presenting the different analyses and use cases that we consider, we describe the datasets that we use in the remainder of the document for the sake of clarity, as we make use of these datasets in several sections.We have four different datasets, each containing different metrics and considering different use cases.Each dataset is unevenly used across the document.The first one, containing real-world measurements about traffic demand of services in a commercial country-wide cellular network, is the main dataset, and it is used for all the different analyses.The other three datasets (of power grid demand, per-household energy consumption, and generic time series) are also considered here to generalize the results and offer a broader view of the implications and potential of the proposed approach.These datasets are described as follows.
a) DatasettrafficApp: This dataset contains real-world mobile data traffic demand generated by twelve popular service providers in a large metropolitan area during several consecutive months [68].The data was collected and aggregated by the network operator using passive measurement probes, resulting in traffic levels (in bytes) every five minutes, for more than 22,000 samples for each of the services.Individual IP sessions were mapped to specific services using Deep Packet Inspection (DPI) and commercial traffic classifiers deployed by the operator.The data was geo-referenced The data was processed in secure premises by the operator, in compliance with international regulations, and under the supervision of the relevant data protection officers.For our study, we had access to de-personalized aggregates of traffic time series at core network and Edge datacenters.trafficApp is the main dataset in the subsequent experiments because of the quality, quantity, and relevance of the data that it contains.In general, we employ 9 weeks of data for training, 1 week for validation, and 1 week for testing.
b) DatasetpowerGrid: This dataset describes the hourly energy consumption in a part of the Eastern Interconnection grid in the United States of America (U.S.A.) between 2002 and 2018.The data comes from a regional transmission organization (RTO) [69], and it incorporates information from different power providers, each one of them managing a different geographical area.Overall, the time series consists of more than 145,000 samples.c) DatasethomeEnergy: This dataset describes the perhousehold energy consumption, using outdoor and indoor characteristics such as temperature, humidity or wind speed as inputs.These data are provided every 10 minutes for more than 4 months, and include over 20,000 data points [70].
d) DatasetM4data: The M4 dataset is a data collection created for the 4th Makridakis forecasting Competition [71].It is composed of 100,000 time series with different seasonality m: yearly, quarterly, monthly, weekly, daily and hourly series.M4data was collected by randomly sampling 100,000 time series from the ForeDeCk database, and the time series were anonymized to guarantee objectivity, included time stamps.

VI. CONTROL EXPERIMENTS AND ABLATION STUDIES
We provide a comprehensive analysis of the performance of the proposed approach for several controlled experiments, with the aim of describing in detail the contributions of each of its components and its overall advantage over previous proposals for loss meta-learning network management applications.
We first focus on proving that the proposed structure is indeed required and crucial for the model.For that, we compare the performance of the architecture presented in Section IV against three alternative architectures in a simple toy example.Second, we provide an ablation study and comparison with state-of-the-art approaches for several loss functions with different levels of complexity, so as to prove the advantage of loss meta-learning, and we do so for singlevariable problems for the sake of comprehension.After that, we provide results on generic time-series forecasting problems and, finally, we provide a detailed analysis for realistic yet simple use cases.

A. Ablation Study of AutoManager Logical Architecture
First, we compare the proposed architecture with three alternative architectures, illustrated in Figure 3 and defined as: • The bm-monolithic approach, presented in Figure 3(a), implements the predictor with a single block composed of fully connected LSTM and MLP layers, in such way that it receives the whole input X t to also output all forecasts ŷ t+1 .Hence, this model is the most naive generalization from a single-input single-output predictor.• The bm-split model applies a parallelization of the multi-dimensional forecasting task along single input-output tuples, decomposing the problem into n in single-variable problems, i.e., independent prediction instances, as in Figure 3(b).• The bm-merged model is an architecture compromising between the previous solutions: it is composed of multiple parallel predictors that feed a single performance cost estimator during training, as illustrated in Figure 3(c).The aforementioned benchmarks represent in some way simplified versions of AutoManager: while bm-monolithic merges the internal IPU-based architecture of the predictor in Figure 2 into gathered LSTM layers, bm-merged withdraws the other key element, the aggregator.Thus, these approaches allow us to deliver an ablation study for the proposed model.We also remark that bm-split and bm-merged implicitly assume that n in = n out and that ŷ t+1 can be inferred from x t only, and consequently the generality of both is notably reduced with respect to AutoManager, which removes such assumptions.
We also compare the whole architecture against versions of it that miss some of the key components.Namely: • The noNoise design, which consists in setting v = 0 in AutoManager's training.• Finally, Grabocka is an approach inspired from [53], using a similar training approach as the bilevel programming optimization of the model parameters adopted by [53], where the loss model and predictor are alternatively trained for some epochs, instead of being trained simultaneously in the same gradient descent iteration as in AutoManager.We make use of a simple yet illustrative toy example where the unknown objective is producing an average of all inputs in the next time slot, to test the four models.Formally, applying the notation introduced in Section IV, the predictor shall output a scalar3 ŷt+1 = 1 t+1 .This is an uninvolved problem with intertwined predictions, 4as the output depends on predictions for all inputs.This use case allows for a straightforward way to experiment with the different models and is not intended to model any practical network management task (which will be analyzed in Section VII).
We summarize the results in Table I. bm-split and bm-merged both suffer from the decoupling of each output ŷ(i) t+1 from all inputs X t , and AutoManager reduces the error generated by these approaches between 75% and 90%.This proves, in particular, that the inner structure of the predictor, with the separated IPUs that are fed into the aggregator, is crucial to combine the contribution of each variable, making sure that each IPU learns its (potentially different) role towards achieving the objective and ensuring that the task is learnt.Additionally, the poor performance of the bm-monolithic model, whose MAE more than doubles the one attained by AutoManager, showcases the importance of parallel individual IPUs to manage independently the input time series that are not naively correlated, since bm-monolithic treats all the input time series as a single vector, reducing the performance of the LSTM layers.Finally, the simultaneous training of metric estimator and predictor through the same backpropagation iteration improves the performance up to 25%, as observed by comparing AutoManager against Grabocka.Overall, AutoManager yields largely superior performance over the state of the art and competitor architectures even in a simple (for intertwined variables) toy example.

B. Comparative Evaluation of AutoManager Loss Learning
Next, we perform a comprehensive comparative evaluation and ablation study to demonstrate the advantages of AutoManager over the state-of-the-art approaches on tailored loss functions.For the sake of clarity, we consider now a single-variable use case, such that there exists only one IPU, and thus the aggregator can be removed (or implicitly incorporated into the IPU), and hence we have a single-block predictor.
1) Considered Use Cases in Controlled Environments: We evaluate the performance in a controlled environment that favors the interpretability of the results.For that, we consider a scenario where the management objective (i.e., f M using the notation of Section IV) is known and simple enough to be expressed in a closed form.This allows for a manually design loss function that is tailored to the objective, and thus we can provide a baseline benchmark that has full knowledge of the loss function.We consider four different use cases with diverse forms of f M for the sake of generalization, as follows.
a) Traffic forecasting under absolute error.The objective is a traditional traffic forecasting, i.e., predicting a traffic d t that matches the future data traffic volume s t+1 , with an error cost linearly proportional to the absolute discrepancy.Formally, M t+1 = |d t −s t+1 |, which is optimized by a MAE loss.
b) Traffic forecasting under squared error.Similar, but the cost for the operator grows quadratically with the magnitude of the errors.Formally, M t+1 = (d t − s t+1 ) 2 , which is minimized by a legacy MSE loss function.
c) Resource allocation with probabilistic guarantees.The management objective is providing a probabilistic guarantee on the anticipatory allocation of resources, so as to accommodate the future traffic demand s t+1 a fraction τ of the time.Formally, where R(x ) = x • 1 x ≥0 , and 1 C is an indicator function that takes value 1 if condition C is verified and 0 otherwise.This is a quantile forecast that can be optimized via a pinball loss [72].
d) Capacity forecasting.The goal of the operator is anticipating the capacity to (i) avoid an expensive monetary fee α incurred for non-serviced future traffic s t+1 , and (ii) limit unnecessary overdimensioning beyond s t+1 .Formally, The capacity forecasting problem has been addressed in the literature by developing the expert-designed DeepCog loss function [8].
All experiments use real-world traffic generated by Facebook Live from dataset trafficApp.We employ 9 weeks of data for training, 1 week for validation, and 1 week for testing.
2) Benchmarks: We compare AutoManager with a wide range of benchmarks that include baselines, state-of-theart models for loss learning, and variants of our proposed scheme.a) Baseline approaches: Two different solutions are used as a basis for our comparative performance evaluation.
• Manual: The same predictor used in AutoManager is trained with a loss function designed manually to fulfill the specific target objective.As anticipated, MAE, MSE, pinball and DeepCog losses are used in the four use cases.
• Disjoint: The prediction and loss-learning functionalities are logically separated: first, the PCE is trained in isolation, receiving as input uniform random noise and measuring the resulting performance, so as to learn the correct loss of the objective; then, the predictor is trained using the loss previously learned by the PCE.b) Loss-learning models: The Adaptive Loss Alignment (ALA) method is a state-of-the-art approach for loss learning in classification tasks [45].While ALA has been originally proposed for discrete classification tasks, we adapt it for regression by: (i) changing the type of characteristics used for validation, replacing the (logarithmic) probabilities that the input pertains to each class with the first-and second-order statistics of the regression values; and, (ii) swapping the set of classification-oriented loss functions originally used by ALA with expressions that are suitable for regression.We test two ALA models, which differ by the expression used.
• ALA-manual applies ALA on a linear combination of all manually designed loss functions that are used separately in the Manual approach above.The rationale is having ALA automatically select the correct loss function for each use case, by tuning the linear combination weights.• ALA-moldable applies ALA on a single loss function with a highly parametrizable shape.The function, illustrated in Figure 4, can potentially mimic any of the losses for the controlled environment use cases, and the experiment aims at testing if ALA can learn the correct values of the parameters x 0 , x 1 , x 2 , y 1 and y 2 .c) Variants of AutoManager: To complement the ablation study presented in Section VI-A, we test also in these experiments the following variants of our proposed model.
• fixedLR and noNoise are as defined in Section VI-A.
• Iterative adopts an alternating training strategy, instead of AutoManager co-training.Specifically: the predictor DNN is trained in isolation during odd iterations, using the loss currently implemented by the loss meta-learner; then, the loss-learning DNN is trained during even iterations, by adding noise to the output of the predictor.• Half is a complementary technique that can be adopted in combination with the noNoise and Iterative approaches above.In a first moment, it trains the predictor and loss meta-learning DNNs jointly, and as mandated by either model; then, it freezes the loss meta-learner and only keeps training the predictor for better convergence.

3) Results:
We first illustrate AutoManager's capability to learn a suitable loss function in the considered use cases.Figure 5 portrays the four loss functions L w t learned by our model, which map the error d t − s t+1 into the target system performance.The corresponding objective f M is superposed to the learned loss to facilitate the interpretability of the result.The two shapes are well aligned in all cases, and thus the match is good.The only significant difference, which emerges in the case of capacity forecasting, is due to the fact that the original objective is not differentiable or even continuous, and hence it cannot be directly used as a loss function: as mentioned in Section IV, and as a desirable by-product of cotraining, AutoManager learns a differentiable version of f M , so that the latter can be used to train the predictor DNN.
Complete results from the controlled environment use cases are summarized in Table II.Across all settings, AutoManager stands out as the model with best performance, or a close second; more precisely, when not yielding the best result, our solution is typically within the variance of the method ranking first.A closer inspection of the exact figures reveals several important observations, as follows.
• The models that perform close to AutoManager are those that involve human intervention, which is needed to define a tailored (and possibly parametrizable) loss function for the specific goal, such as Manual or ALA-moldable; instead, the training of AutoManager is fully automated.• The models performing best for some use cases tend to have highly fluctuating performance under other objectives, where they generate poor predictions; instead, AutoManager performs consistently well across all target use cases, which demonstrates its flexibility and generality.• AutoManager produces results with sensibly lower standard deviation, which elicits a more consistent quality of anticipatory decisions.• Juxtaposing AutoManager with its variants proves that all design elements in Section IV contribute to the performance of the model, and removing co-training (Iterative), noisy exploration (noNoise), or learning rate adaptiveness (fixedLR) deteriorates results.Overall, the results obtained in the controlled environments clearly showcase the gain of AutoManager over other methods, in terms of sheer performance and flexibility.Importantly, this is attained while also reducing the need for human intervention.

C. Advantages of AutoManager for Plain Traffic Forecasting
We now evaluate the performance of the proposed solution for simple time series forecasting with standard loss functions such as MAE or MSE.This analysis aims at bringing light to one of the key aspects of AutoManager: the fact that it can actually train the predictor to optimize a given loss function in such way that it obtains better results than if the predictor Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.was directly obtained from training with the true static loss function.As we are going to see in the following, this is due to the fact that AutoManager provides more flexibility in the construction of the loss function, being able to adapt the shape of the loss function for different values of the input data.

1) Comparison Against a State-of-the-Art Algorithm:
We challenged our model with the winner of the well-known M4 competition [74], i.e., the ES-RNN model [73], which combines exponential smoothing and a Recurrent Neural Networks (RNN).The results are computed without the ensemblelearning part (for computational limits) and compared to our model by evaluating both solutions on the same M4 dataset M4data.The M4 competition defined the winner according to three metrics: the mean absolute scaled error (MASE), the symmetric mean absolute percentage error (sMAPE) and the Overall Weighted Average (OWA), formally expresses as follows.
where H represents the number of samples in the subset of the dataset corresponding to a certain seasonality, and m the period of the season (e.g., 24 for hourly, 7 for daily, etc.).
The results are summarized in Table III and show that AutoManager obtains better performance for all the considered seasonalities except for quarterly, where both values are comparable and lie within the confidence interval of each other.This result proves that AutoManager achieves state-ofthe-art performance even if its design is not focused on plain forecasting (but rather on joint loss-learning and forecast).
2) Simple Forecasting of Time Series of Demands: The results obtained for the M4 competition motivate us to further investigate the reasons for which AutoManager can overcome State-of-the-Art solutions without the need for complex and too deep NN architectures.Next, we compare the one-step forecast returned by a same predictor DNN, which is trained under the AutoManager model (i.e., via co-training with a loss meta-learning DNN), and with a standard legacy MAE loss.The predictor DNN and its hyperparameters are identical in the two cases.The results are presented on MinMaxscaled datasets and averaged with 5 different runs for each experiment.
In order to implement the loss meta-learner employed by AutoManager, we use for all considered datasets a simple Multi-Layer Perceptron (MLP).The size of the MLP depends on the number of inputs, i.e., on the size of ŷ and v .Importantly, we assess the quality of the prediction in terms of MAE itself.Therefore, we use as a benchmark a DNN trained with a regular MAE loss, which in fact represents the   apparently optimal (and very commonly adopted) choice to drive the optimization.Experiments are run using different real-world datasets, all described in Section V.The first dataset corresponds to the mobile data traffic demand generated by four video streaming services (Facebook Live, Netflix, Twitch, and Youtube) from dataset trafficApp.The second dataset is the powerGrid, which contains hourly energy consumption part of the power grid of U.S.A., and the third dataset corresponds to the perhousehold energy consumption in homeEnergy.
The results for the application of AutoManager to standard loss functions are shown in Table IV and Figure 6.Table IV shows the MAE performance (mean and std.deviation) obtained from (i) training the predictor block with standard MAE loss function (leftmost column), (ii) using the approach AutoManager presented above, where the loss function is also learned (central column), and the percentage gain of (ii) over (i).AutoManager reduces MAE costs up to 5% in the best cases observed.This represents a significant gain, considering that the predictor architecture is the same for both cases, especially if we take into account that AutoManager is trained to learn the MAE loss function.Thus, AutoManager succeeds in learning more than the teacher knows.
This excellent performance mostly comes from the fact that AutoManager adapts its shape to the input data and, therefore, it optimizes the training phase almost independently for different input values.We can clearly see this phenomenon happening in plots (b) and (d) of Figure 6, which tell apart the meta-learned loss and the error distributions for samples to be predicted that belong to the top (in red) and bottom (in green) deciles of the overall value distribution, i.e., to especially high and low demands, respectively.The plots show how the loss learned by AutoManager has different shapes depending not only on the forecast error (i.e., the x axis value in the plots), but also on the absolute value of the demand to be predicted (i.e., the red and green curves).When comparing the error distributions with those obtained under a fixed MAE loss in plots (a) and (c) of Figure 6, the such flexibility of AutoManager leads to errors with lower means in both cases of high and low demands.In other words, loss meta-learning proves capable to correct bias in the accuracy of the prediction that emerge for different magnitudes of the demand, in use cases across different domains.

D. Hyper-Parametrization and Transfer Learning
Next, we consider two controlled yet realistic use cases to finalize the detailed characterization of the proposed approach.
Capacity forecasting for resource allocation.We consider first a capacity forecasting problem, along the lines of that introduced in Section VI-B.We consider that the operator needs to predict the required capacity resources for the next 5 minutes, which lies within the standard operation time scale of modern network function orchestrators [75].In this case, the performance metric matches the asymmetric expression of Section VI-B-use case (d).As mentioned there, recent studies in the computer networking literature have proposed an expertdesigned loss function, DeepCog, which can be considered a handcrafted, differentiable surrogate of the true performance metric [8].We thus employ DeepCog as a state-of-the-art benchmark for AutoManager in this use case.
Power grid management.The complex field of power grid management and the study of smart grids are ruled by a significant number of diverse KPIs [76], [77], [78].One of the fundamental dimensions that define the performance in such scenarios is the reliability of the network, i.e., how often the network fails to provide the required power.Interestingly, the reliability in power management is not only measured by the frequency of power cuts due to the under-provisioning, but Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.also by the duration of these cuts [76].The smart grid manager is especially interested in preventing under-estimations, as in the previous use case of network resource allocation; however, the relevant performance metric also incorporates memory about past outages.Specifically, in presence of underestimation, if the previous forecast was also underestimated, the current cost is added to the precedent cost in a recursive manner.Thus, the metric depends on the previous state of the system, and involves discrete (thus discontinuous) jumps when consecutive outages occur.Manually designing a surrogate version of the multi-dimensional loss function matching the metric above is not trivial, and we are not aware of any such attempt in the literature.AutoManager removes this barrier by learning the correct loss in an automated way. 1) Performance Evaluation for Both Use Cases: For both use cases, we train the corresponding predictor with AutoManager, and we compare its performance against that obtained when the predictor is trained with standard MSE and MSLE losses, and with the manually designed surrogate when applicable.The metrics used to compute the quality of the forecast are those returned by the emulation environments, as they reflect the actual goal of the prediction for the network or grid manager.
Results are summarized in Table V for the two use cases (which use the datasets trafficApp and powerGrid, respectively).Specifically, the capacity forecasting is done for the Netflix traffic demand in trafficApp.AutoManager clearly outperforms a traditional training based on standard loss functions, because the latter creates a mismatch with the actual performance metric.Moreover, for the first use case, we also provide the performance obtained with the expert-defined DeepCog loss [8].AutoManager yields cost reductions over the expert-designed solution of around 25%, which corresponds to significant operating expense cuts in mobile network infrastructure management.
Similarly to what discussed in Section VI-C, the gains of AutoManager derive from the fact that it can learn a loss that is better tailored to the performance metric than generic losses or human-designed functions developed with expert knowledge of the system.While the advantage over a generic loss roots in the inherent mismatch between the target metric and the loss used to train the predictor, the improvement with respect to DeepCog needs further discussion.The first two plots in Figure 7 show the loss and error distributions for the 10% of samples with lowest and highest traffic demands, under (a) the surrogate DeepCog loss and (b) AutoManager.While the learned loss in AutoManager always captures an invariant general behavior, we can observe slight shifts in the error ŷ t+1 − y t+1 that minimizes the cost, depending on the absolute value of y t+1 .In other words, AutoManager learns a loss that naturally compensates for the different accuracy of the predictor in anticipating traffic values of diverse magnitude.Such an adaptation results in a significantly better prediction of low-volume demands, as shown by the major shift towards the origin of the green error distribution.
This type of adjustments in the loss function are impossible to ascertain by just looking at the performance metric, i.e., the so-called teacher, as they inherently depend on the prediction quality.Yet, co-training the predictor and metric estimator block as done by our proposed model allows apprehending more than the teacher knows, since it also learns the inherent variable level of aptitude of the student to understand the task for different regions in the feature space.
In the second use case, we are particularly interested in presenting how AutoManager can adapt to a recursive cost relationship.Plots (c) and (d) of Figure 7 show the learned loss function: the dependency on the extra dimension representing the number of consecutive outages is correctly captured, as shown by the color scale in plot (d).The result shows the potential of the proposed approach to characterize unknown and non-trivial loss functions for regression problems.In this particular case, AutoManager learns a differentiable approximation of a step-wise discrete function.
2) Sensitivity to Hyper-Parameters: The model hyperparameters comprise the learning rate and the loss exploration noise, and our implementation makes the model robust to those hyper-parameters.Experiments are run for the first use case on capacity forecasting for resource allocation.a) Learning rate: We employ CLR, which is known to reduce the sensitivity of models to the choice of learning rate, and we recall that the ablation studies of Table I and Table II showed how CLR is beneficial in most settings.
b) Loss exploration noise: We make the exploration noise decay exponentially over time.This (i) avoids the need to fine-tune a fixed noise hyper-parameter, and (ii) improves the overall performance with respect to any fixed noise value.To demonstrate that this is the case, we consider a zeromean Gaussian noise and analyze the performance for different values of the noise variance.First, we consider the case with no noise (i.e., zero variance), which serves as a benchmark and shows the performance of a baseline model without this parameter.Later, we consider ten different variances, such that the standard deviation follows a geometric progression from 0.001 to 0.1.This extreme values are selected as meaningful range after a detailed evaluation.Finally, we consider the exponential decay case in which the variance decreases as the number of epochs increases, borrowing the idea from the exploration-exploitation trade-off of reinforcement learning [79].In particular, we consider that the standard deviation at epoch t is equal to 0.2 (0.1t) φ , where φ is selected to ensure that the value at the first epoch is 0.1 and the value at the last epoch is 0.001, for a fair comparison with the fixed-variance cases.
Results are shown in Figure 8.In Figure 8b, we represent the final metric evaluated for the different noise variance, averaged over several realizations.We can observe how the results are quite sensitive to the variance of the noise, and adding noise can also be detrimental, whereas the exponential decay case obtains the best performance.This last adaptive approach is beneficial not only because it achieves the best performance, but also because it allows us to avoid the hyperparameter tuning required in view of the sensitiveness of the results to the variance.In Figure 8a, we show the learning curve (averaged over training realizations), where we only highlight the no-noise and the exponential decay curves for the ease of visualization.The exponential decay approach provides a much faster and stable convergence.It has also been observed that it reduces the probability of obtaining a diverging solution.
3) Transfer learning: We demonstrate the potential of our design for transfer learning by considering different network datacenters and service providers with diverse user populations, which thus experience diverse demands for the first use case.
The measurement dataset trafficApp includes data from different network datacenters, each serving the traffic transiting at a different set of radio base stations.As the traffic serviced by each base station can differ substantially, each datacenter experiences quite diverse temporal dynamics of the demand [68].Yet, the performance metric is the same across datacenters, and the resource allocation predictors dedicated to each datacenter shall all be trained to optimize it.This is an ideal setting to test the capability of AutoManager for transfer learning.
We evaluate how a predictor trained with the AutoManager approach (i.e., by jointly training the predictor and the metric estimator block) performs with respect to a predictor that is trained with a not-varying metric estimator block that has been previously trained with the data from another datacenter.Each datacenter serves a geographically separated group of antennas, and each combination of service and datacenter entails dissimilar user demand patterns [68].
Results are shown in Table VI.Each column indicates the datacenter in which the PCE has been trained, whereas the rows indicate the datacenter for which that same metric estimator block, once trained for the column datacenter data, is used to train from scratch the predictor.Quite naturally, the values on the diagonal reflect the best performance.However, by looking at the other values, where a loss function trained on traffic in one datacenter is used to train a predictor for a different datacenter, performance remains very good.Specifically, by comparing the values for Netflix in Table VI Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.V.In all cases, AutoManager outperforms the expert-designed surrogate loss DeepCog by 10% in the worst scenario, even without prior knowledge of the relationship between prediction and performance metric.

VII. EMULATED REALISTIC USE CASES
After analyzing and decomposing in detail the proposed solution, in this section we evaluate the performance of AutoManager in realistic scenarios that are closer to realworld network management activation tasks, and which are considerably more demanding and complex.Specifically, we consider two practical case studies, detailed next.In both analyses, we employ the real-world traffic measurements from dataset trafficApp to ensure the credibility of the results.
For both use cases, we consider that each Service Provider (SP) owns a network slice through which the network operator transports all the data generated by the SP's users.The slice is active during the whole duration of the experiment.We recall that the dataset trafficApp contains real mobile network traffic traces that contain the SP's traffic volume every five minutes; the actual number of users is not known, but it is neither required since the slice must ensure data volume irrespectively of the user count.

A. Minimizing Video Streaming OPEX at the Edge
We consider a realistic mobile edge environment composed of several computational sites located close to the radio access, where each of these facilities serves between 70 and 130 base stations.In this scenario, we consider that four video streaming services from different SPs get each a dedicated Network Slice Subnet Instance (NSSI) at the Edge facilities.The intent from the management perspective here is to minimize the operating expenses (OPEX) of the network operator induced by operating the video streaming slices at the edge.This maps to a network management objective of periodically and preemptively re-scaling the compute resources assigned to NSSIs in a facility.
We develop a sensible model that relates OPEX to the system variables to emulate the ground-truth OPEX.In this use case, instead of defining a close-form expression for f M , we make use of a whole pipeline to mimic the described scenario that considerably increases the complexity, and which is illustrated in Figure 9.At the beginning of the pipeline, we introduce the two main input variables for each one of the users.These two variables are the computing capacity d t allocated to the corresponding slice and the future traffic vt+1 .From these variables, both the available computing power (A), from d t , and the maximum demanded computed power (M), from vt+1 , per user are obtained and fed into an empirical non-linear model mapping these metrics to a user-level video streaming QoE [80].After the non-linear model, the obtained QoE is discretized to align the metric with common QoE metrics, and transformed into a Mean Opinion Score (MOS).Finally, the service provider makes use of these MOS values to inform the network operator whether the Service Level Agreement (SLA) of the slice has been violated, which would incur a economic fee β per SLA violation.We need to further add the price of α per capacity unit incurred due to the reservation and operation of d t resources.We need to further add the price of reservation and operation of d t resources (α per capacity unit).In Figure 9, this cost is represented at the bottom.
Importantly, the expression of f M is not observable by the operator, and it is a quite involved expression.Indeed, the VNF Manager (VNFM) in charge of taking the anticipatory decisions can only access to information related to the current allocated resources and the past service traffic demand from the VIM.At most, it could also obtain some QoE information from the Network Data Analytics Function (NWDAF) [81] that interfaces with Application Functions (AF).Yet, the MOS contribution remains unknown.Ultimately, this use case sets forth a scenario where a network manager would not have the necessary system knowledge to manually design a solution.We evaluate the ability of AutoManager to learn the pipeline described above in the case where resources are reconfigured every 5 minutes, and we compare it against two benchmarks: • An Oracle predictor that returns the optimal allocation based on a perfect knowledge of the future demand and of the whole f M pipeline.• DeepCog, a state-of-the-art capacity predictor [8] tuned to obtain its best performance for this problem by setting its loss function parameter to β/α.In this case, the input to AutoManager consists of the past time series of traffic demand of the corresponding service.Thus, AutoManager is fed with a single temporal vector {x t−T , . . ., x t } which corresponds to the past T temporal samples of the traffic demand for the respective SP; the output ŷt+1 corresponds to the allocated resources d t to this SP.
Since the results are homogeneous across services, we only depict the results for the Facebook Live slice in Figure 10a.We show the results for several β/α ratios and for normalized costs that are relative to the minimum static capacity allocation that suffices to always accommodate the traffic demand.DeepCog is limited by its inflexible loss function even if we manually provide the values of α and β, while AutoManager learns more accurately the loss, reducing the subsequent cost and doing so autonomously.Figure 10b represents the fairly complex loss function captured by the loss meta-learning DNN of AutoManager: manually designing this shape would be hard even with perfect knowledge of the pipeline in Figure 9.

B. Admission Control and Resource Allocation in Sliced Networks
A correct development of network slicing would require an accurate anticipatory allocation of isolated resources to each of the slices, which allows the operator to reduce operating costs while ensuring the QoE requested and promised to each service.This anticipatory aspect opens a new possibility for operators to increase and optimize their profit, since the operators could now apply yield-management techniques that are well-known in other fields, such as applying overbooking.This slice overbooking capitalizes on multiplexing different slices in non-isolated resources to minimize the reserved resource while at the same time satisfying the (a priori unknown) QoEs [29], [82].Inspired by these new paradigms, we analyze a scenario of anticipatory admission control (AC) and resource reservation (RR) of network slices where the objective of the joint AC and RR decisions is to maximize the network operator's profit, which strongly depends on guaranteeing the agreed QoE performance.
We remark that QoE values (the final objective) are usually unknown to the network operator due to the complex QoS/QoE mapping that often depends on app-to-network adaptations.As we can observe, this scenario corresponds to the kind of notfully-informed situations for which AutoManager is designed: We have a performance objective (QoE) whose relationship with the management decision (resource allocation) and KPIs (QoS) is unknown to the operator.Yet, with some QoE samples to use as training data (which can be part of the SPoperator agreement, as both benefit from a better connection), AutoManager is able to obtain such unknown relationship, in a way that is explainable, transferable, and generalized.
1) Network Management Goal and Formal System: We consider that, for each slice, there exists a SLA that determines the performance objective in terms of the requested end-user QoE and specifies the economic compromises that both parts abide by.Specifically, the slice tenant would pay a certain amount for the agreed service, but this amount would be decreased in case of below-the-SLA QoE levels, and the operator would be forced to pay a compensation in case it delivers extremely poor performances.
Formally, we denote the traffic demand at slice s ∈ S at time t as d s (t), and its peak traffic demand as D s , and let c s (t) be the allocated resources to slice s.We further denote the normalized demand and allocation as ds (t) Ds and cs (t) cs (t) Ds , respectively.In case the slice is accepted, the tenant pays for the implementation of the slice an amount R s (t) = γd s (t), where γ is a constant factor ($/byte), which is paid if and only if a target performance level is achieved; should instead the users suffer from lower QoE, the operator would incur into a proportionally reduced payment by the tenant.
For our evaluations, we consider that the SLA requirement is met if the resources allocated to the slice are sufficient to serve all the corresponding traffic, i.e., if δ s (t) ds (t) − cs (t) ≤ 0. Otherwise, the operator revenues drop according to a sigmoid function of δ s (t), following standard models [83] in revenue management and prospect theory [84].Below a minimum QoE, the operator actually pays back the tenant a proportional fine.The maximum fee is upper bounded by a large value K s (t) proportional to the expected demand, such that K s (t) = βR s (t).Thus, the operator's costs for an accepted slice s are where the constants α and β dictate how the sigmoid function scales with the underprovisioning δ S (t).In addition to the cost (or revenue, if positive) from (8), two other values impact the profit of the operator: (i), the overprovisioning of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
resources implies an increase of OPEX, such that the operator incurs an OPEX cost −γδ s (t) if δ s (t) < 0, which grows linearly with the amount of unnecessarily reserved resources.Second, committing to allocate more resources than those available incurs a general performance drop and possible service outages, with a huge cost M in terms of, e.g., customer churn.The final performance cost associated to the slice management is where C is the total capacity of the system, and 1(•) is the step function that takes value one if the argument is less than 0, and value one otherwise: thus, 1(c s (t)) takes value one only if the slice is admitted and resources are allocated to it.
2) Solution Based on AutoManager: The crucial circumstance to consider here is that, in practice, the network operator cannot analytically know the relation between the slice resource allocation and the QoE of end users of the specific tenant.Such relationship is an extremely complex function, whose characterization depends on application-level data (e.g., user feedback) that the operator can not access to, and that depends on many operational parameters.This implies that the operator cannot know (8) a priori, which implies that the operator cannot model the whole performance cost in (9).Besides this, even in the hypothetical case in which such function was known, the AC-RR problem is an NP-hard problem, with the non-trivial trade-off between over-allocating resources (such that the QoE is ensured at the expense of increased operating costs) and under-allocating them (such that it can overbook the system).
In such difficult scenario, we use AutoManager to realize the AC-RR management task and determining (i) which slices are accepted and (ii) how many resources are reserved to the slices at the start of each orchestration interval.Indeed, once deployed in the system, AutoManager can learn the expression of the performance cost in (9), optimizing the prediction of cs (t) to minimize such cost.Note that the value of cs (t) determines both the slice admission control, via 1(c s (t)), and the allocation of resources, which matches its value if positive.
We outline the detailed integration of AutoManager in the network architecture for this use case in Figure 11(a).Our model sits in the Network Function Virtualization Orchestrator (NFVO) of the network management framework, as proposed by current standards [5].It can then leverage the Management Data Analytics Function (MDAF) [85] to access information exposed via the Application Function (AF) [86]: based on the operator-tenant agreement, such information can include live QoE measurements and revenue data, which allows AutoManager to apprehend the relationship between decisions and costs.AutoManager is then able to learn from experience the overall cost in (9) and forecast cs (t), ∀s ∈ S, to minimize it.
The input to AutoManager consists of the past time series of traffic demand for each one of the services.During training, we also provide the actual future traffic demand as an input to the PCE.Recalling the notation presented in Section IV-A, in this case we have that the i-th IPU is fed with the temporal vector {x (i) t−T , . . ., x (i) t } which corresponds to the past T temporal samples of the traffic demand for SP i; the output ŷ(j ) t+1 corresponds to the allocated resources cj (t +1) for SP j.
In order to provide an intuition of the high complexity of this problem, we depict in Figure 11(b) the expression in (9) for the naive case of a single slice.Despite the fact that we are considering the simplest version possible of the cost, and the only one that can be represented in three dimensions, the shape of the relationship between the management objective and the prediction is tangled and highly sensitive to the value of the traffic demand.Scaling to realistic multidimensional versions of the cost thus implies very strong meta-learning capabilities.
3) Benchmark: For the best of our knowledge, there does not exist any automated solution for this problem.Because of that, we compare AutoManager against a solution that follows the legacy approach presented in Figure 1(a).Specifically, we consider that each slice s ∈ S is handled separately by one predictor with a tailored loss function, which is responsible of forecasting the resources to be reserved for slice s (denoted by c s (t)) if the slice is accepted.The admission control is instead handled apart, and formulated as an optimization problem solved starting from all c s (t) predictions.The optimization program tries to find the best binary admission control variables x s (t), which take value one if slice s is accepted.Formally, max xs (t) s∈S R s (t)x s (t) s.t.s∈S c s (t)x s (t) ≤ C .(10) This is in fact the well-known and NP-hard Knapsack problem (KP).Hence, we name the benchmark Knapsack.The individual predictors of the benchmark are trained in a similar way as for the AutoManager implementation, with the difference that it only receives data for the corresponding slice.Therefore, the learning block of the benchmark only takes care of the resource prediction, but not the admission control, and it is thus less general and flexible than the solution based on AutoManager.

4) Results:
We make use of the traffic data from the 12 different applications in trafficApp.We assume that each application is requesting a dedicated slice s.
We first provide a qualitative illustration of the results in Figure 11(c).AutoManager (top plot) learns to keep the allocated capacity below the maximum capacity of the edge datacenter (gold) even if the actual total demand (dotted red) exceeds such value.When the maximum capacity is above the total traffic, AutoManager offers a very precise forecast of the resources needed to accommodate the demand.The bottom plot focuses instead on two slices (gray and red), showing their requested capacity (solid) and the allocated resources by our solution (dotted).During periods of low demand, after hour 72, both requests are precisely predicted and fully served.When capacity is scarce, before hour 72, the gray slice is often rejected (with zero reserved resources).Also, the resources  allocated to the red slice do not match anymore the request, but are slightly lower: here, AutoManager learns and exploits the sigmoid function in (8), which entails a negligible penalty if the QoE of the end users is reduced only slightly.Indeed, it is more profitable for the operator to gain a bit less on the red slice, and free up space for additional tenants.Table VIII demonstrates quantitatively these results by reporting the performance of AutoManager and the benchmark for different number of slices.We focus on non-trivial periods where the management task plays a role, i.e., when the slice demands exceed the capacity available at each Edge datacenter.Overall, AutoManager allows the operator to effectively increase its revenues through overbooking by up to 36% when serving simultaneously 12 slices, and to admit an average of 2 slices more than the benchmark.We recall that it does so without any prior knowledge of the system or of the objective, which shows its meta-learning capabilities and its practical advantages entailed in solving the network management task.

C. Time Complexity Analysis
Computational complexity is an important metric to evaluate the efficiency of anticipatory decision-making algorithms for network performance optimization, especially in the context of large-scale infrastructures and services with stringent latency requirements.In order to assess the overall complexity of AutoManager, we quantify its training time across its two main components, i.e., the predictor and the PCE.Let us define as N Pred (resp.N PCE ) the number of input features to the predictor (resp.PCE), as L Pred (resp.L PCE ) the number of layers in the neural network, as M Pred (resp.M PCE ) the average number of neurons in each layer, and as T Pred (resp.As both parts are trained together using a single backpropagation, the full training time of the model can be described by C AutoManager = C Pred +C PCE +c, where c is the residual time caused for some transformation between the output of the predictor and the input of the PCE. In practice, for the first use case, this took an average of 1920s±5s to train the full model and 1350s (resp.480s) to train the predictor (resp.the PCE).We remark that standard methods (i.e., DeepCog) require a training time similar to the predictor's training time alone.In fact, in AutoManager, the only additional complexity comes from the PCE, but it just increases the complexity in the same manner as adding more layers to a standard predictor would do.This is one of the main advantages of the proposed approach: it does not increase the required complexity apart from the extra size in total layers; there are not complex loops, elaborated network structures, or other artifacts that can increase the order of magnitude.
For the use case of Section VII-B, the training times are presented in Table IX, where the three columns of each row correspond to C AutoManager , C Pred , and C PCE , respectively.The difference between the first column and the sum of the other two columns corresponds to the residual time c.These results highlight the fact that the predictor's training takes most of the training time, and the percentage of training time that it consumes increases proportionally to the number of slices.It is something fully comprehensible as the predictor's architecture considerably varies with the addition of new services (IPUs are added) while only the number of inputs differs for the PCE.For this use case, the comparison is more Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.complex as our model enables this kind of design (multiple input flux) that are not feasible with traditional approaches.

VIII. CONCLUSION
We introduced an original approach to anticipatory network management that builds upon recent findings that custom loss functions can steer predictors towards substantially improve decision-making.Our proposed strategy, named AutoManager, goes one step beyond loss function customization, and proposes to meta-learn the loss instead of manually designing it.Furthermore, AutoManager can seamlessly incorporate available expert knowledge or expert-designed custom architectures, whose characterization of the problem may not be complete by themselves, but can be synergistically combined with AutoManager.We prove through a very comprehensive range of experiments, both in controlled environments and with real-world measurement data, that AutoManager yields improved network management decisions while removing human intervention in the loss function definition.It also provides interesting and desirable properties, such as enhanced explainability of the decision-making process through meta-learned loss exploration, and transfer learning capabilities via reuse of a trained loss meta-learner.We also show how the principle at the basis of AutoManager can benefit domains beyond networking, with practical examples in power grid management.

Fig. 5 .
Fig. 5. Loss functions L w t learned by the loss meta-learner DNN of AutoManager, under diverse objectives f M .

Fig. 6 .
Fig. 6.Loss function and error distributions for the top and bottom 10% of predicted values for Facebook from trafficApp under (a) MAE and (b) AutoManager, and for the energy demand from powerGrid under (c) MAE and (d) AutoManager.

Fig. 7 .
Fig. 7. Loss function and error distributions for the top and bottom 10% of predicted values, for the network resource allocation to Netflix traffic under (a) expert-designed DeepCog loss and (b) the proposed AutoManager, and for the power grid management with AutoManager, with a (c) two-dimensional and (d) three-dimensional representation of the learned loss.

Fig. 8 .
Fig. 8. Impact of the variance of the exploration noise variance for accuracy and learning convergence.

Fig. 9 .
Fig. 9. Emulation pipeline for the network operator OPEX, which depends on the unknown end-user QoE.

Fig. 11 .
Fig. 11.AutoManager for Section VII-B.(a) Implementation of AutoManager in the network architecture.(b) Sample of the final performance cost, when a single slice s = S is present in the network.(c) Illustration of the AutoManager operation.

T
PCE ) the number of training samples; then, the predictor neural network has a total training time complexity ofC Pred = O(T Pred • N Pred • L Pred • M Pred ), whereas the PCE has a complexity C PCE = O(T PCE • N PCE • L PCE • M PCE ).
functions but where the input variable is the same.In addition, during pre-training, each IPU could be trained with a different loss function, which is particularly pertinent in scenarios where different IPUs handle inputs that are different in nature.Conversely, when pre-training is not considered and AutoManager is trained as a whole, the IPUs are not directly trained with a specific loss function: the IPUs' output is just the input of the Aggregator, whose output is evaluated by the PCE to compute the loss.(ii) For PCE: since the PCE learns the unknown loss function that defines the predictor's performance-output relationship irrespectively of the used regressor, the trained PCE block can be employed to train any predictor that faces the same complex problem, and it can be also used as an initial state for the PCE in presence of different unknown functions that are expected to be similar.Concerning the AutoManager concept and training, the following important remarks are in order.
2) AutoManager naturally lends itself to support modular transfer learning, and it does so in a twofold manner: (i) For IPUs: it allows for separated pre-training of each IPU on legacy loss functions to reduce the training time.This modular structure improves the scalability of the model (cf.Section VI-A), and its portability, since each pre-trained IPU can be applied in other tasks with different loss 1) The meta-learning model outlined above and detailed next is able to approximate a non-differentiable objective M by a differentiable alternative f W (•), which is implemented by the PCE DNN upon training.This allows optimizing the predictor DNN under metrics that could not be directly used as losses, via a suitable approximation of the same.

TABLE I RESULTS
IN TERMS OF AVERAGE MAE FOR THE SIMPLE TOY EXPERIMENT FOR INTERTWINED PROBLEMS OF SECTION VI-A, FOR OUR MODEL AND ALL BENCHMARKS, FOR DIFFERENT NUMBER OF INPUT VARIABLES n in WITH DATA EXTRACTED FROM T R A F F I CAP P

TABLE II COMPARATIVE
EVALUATION SUMMARY.THE BEST AND SECOND BEST PERFORMING MODELS ARE HIGHLIGHTED FOR EACH OBJECTIVE

TABLE III COMPARISON
OF OUR PROPOSED MODEL AGAINST THE WINNER OF THE M4 COMPETITION COMPUTED ON THE M4 DATASET

TABLE IV MAE
MEASURED IN DIFFERENT ONE-STEP FORECASTING SERIES, WHEN A SAME PREDICTOR DNN IS TRAINED WITH A STATIC MAE LOSS FUNCTION AND WITH A LOSS LEARNED VIA AU T OMA N A G E R. THE RIGHTMOST COLUMN REPORTS THE PERCENT GAIN IN MAE SCORED BY THE AU T OMA N A G E R APPROACH OVER THE STATIC MAE LOSS.BOTH MAE AND AU T OMA N A G E R COLUMNS ARE SCALED BY 10 −2

TABLE VI TRANSFER
LEARNING RESULTS FOR CAPACITY ALLOCATION IN FOUR DATACENTERS, D1-D4, EACH SERVING DIVERSE TRAFFIC DEMANDS, FOR FOUR POPULAR SERVICES.VALUES ARE NORMALIZED BY 10 −1

TABLE VII PERFORMANCE
RESULTS FOR THREE DIFFERENT SERVICES (FACEBOOK, TWITCH, AND YOUTUBE), FOUR DIFFERENT DATACENTERS (D1, D2, D3, D4), AND FOUR DIFFERENT LOSS FUNCTIONS with those in Table V, transfer learning via AutoManager yields performances in new datacenters that are still better than or comparable with those obtained with the expert-designed DeepCog function.Also, the performance remains much better than that of legacy mismatched MSE or MSLE.For completeness, we complement the results from Table V for anticipatory network resource allocation from Section VI-D in Table VII, summarizing the performance across all the other services, i.e., Facebook, Twitch and YouTube, and for each one of the heterogeneous datacenters.Results are aligned with the ones presented in Table

TABLE VIII RESULTS
FOR USE CASE IN SECTION VII-B.WE REPORT THE FINAL Gain WITH Percent INCREASE OVER THE BENCHMARK, AND AVERAGE NUMBER OF SLICES Admitted.SLICES VARY FROM 2 TO 12

TABLE IX TRAINING
TIME FOR USE CASE OF SECTION VII-B FOR DIFFERENT NUMBER OF CONCURRENT SERVICES