AI-driven Orchestration for 6G Networking: the Hexa-X vision

—Mobile networks are adopting disaggregation and modularisation to support flexibility. However, large modular networks with a wide range of heterogeneous components have many degrees of freedom, making its M&O complex. The use of Machine Learning (ML) techniques is expected to improve the efficiency of the operation of 6G networks, by introducing data-driven approaches into their Management and Orchestration (M&O). In this paper, we review the current best practices of ML usage to support M&O, and we present the H2020 European project Hexa-X M&O architecture [1]. We then identify the main challenges ahead to fully embrace a ML-driven operation.


I. INTRODUCTION
Mobile networks are becoming increasingly difficult to manage due to their growing heterogeneity, complexity and due to the variety of stakeholders that 6G should support: different Mobile Network Operators (MNOs), cloud service providers, private networks, and highly localised networks such in-vehicular networks. The design and maintenance of future networks requires new approaches to optimise resource (compute, radio, energy, power, etc.) allocation. Artificial Intelligence (AI)/Machine Learning could enhance network operations by exploiting the huge amount of data that becomes available thanks to the Open Interfaces of the different network segments/domains. An ML-driven M&O is expected to improve the efficiency of the operation of Sixth-Generation (6G) mobile networks. The potential gains of using AI/ML have been recently addressed in studies such as [2]. AI/ML is also expected to enhance the life-cycle management of Network Functions (NFs), which could be placed anywhere in the network based on performance requirements, resource availability, or load change predictions, among other factors. Also, through efficient resource management and allocation, AI/ML algorithms could provide additional support to MNOs to dimension network slices, while respecting the requested Quality of Service (QoS).
A M&O processes is composed of several operations, i.e., life-cycle management of Network Slices, Network Services (NSs) and NFs, resource and services allocation, scaling, etc. and its automation has been a subject of intensive research for about a decade [3], [4]. In recent years, the research has been focused on using AI/ML to achieve this goal. In that context, it is worth mentioning numerous European projects [5], [6]. The topic is also a subject for standardisation bodies such as 3GPP, ETSI, and even industry fora, like TM Forum and NGMN. A comprehensive overview of these activities can be found in [1], [5]. In this domain, the main objective of the Hexa-X project is to make the most out of AI/ML technology applied to networks and to develop a methodology and architectural requirements for an AI-native network.
In this paper, we perform a fresh revision of the current status of AI/ML for M&O of 6G networks taking into account the H2020 European project Hexa-X considerations over this topic [7]. We first survey existing solutions that have been recently proposed or can be adapted for their operation in 6G networks. Following this survey, we then discuss the main challenges ahead that we have identified, that should constitute part of the future work in 6G.

II. CURRENT STATUS OF AI/ML FOR M&O
In this section, we describe the main ML-algorithms categories, considering how they could be integrated in M&O processes. Following the typical categorisation used in the literature [8], we consider Supervised Learning (SL), Unsupervised Learning (UL), and Reinforcement Learning (RL) algorithms, plus a specific learning model known as Federated Learning (FL).
A. Supervised learning SL algorithms [9] are able to compute a mapping function y = f (x) from examples, where x and y are n-dimensional vectors. For the algorithm to be functional, a so-called training phase is necessary. This training process lies on presenting to the algorithm a collection of x, y values, i.e., the training set, in an iterative way. Through the training set, the algorithm is able to devise the mathematical function mapping the x, y values.
Developing SL-based solutions for M&O processes typically requires having large data-sets [9] for the model training process. Typically, the MNOs or the Verticals provide a highlevel description of the expected NS functionalities and performance, which can then be used for design and development of the NFs composing the service itself. When SL is considered, real data from the MNO scope will be required to train the ML models and produce ready-for-production M&O algorithms. In this respect, dedicated data pipelines should be implemented connecting the development and the operational scopes (e.g., by means of continuous monitoring pipelines). However, this should be done considering the necessary security aspects and data privacy, e.g., relying on anonymisation techniques.

B. Unsupervised learning
As seen in the previous subsection, supervised algorithms work on data pairs: the ML model developer instructs the algorithm to which x maps each y in the training set, i.e., each x value in the training data set is labelled with a corresponding y value. Unsupervised Learning, instead, works without labelling, thus no x/y pairs are requested for training. UL is typically used for two main kinds of problems: • To provide a more effective representation of the original data set, targeting a specific application. This includes for example to reduce data dimensionality, to provide invariant representations or sparse representations. Several algorithms for finding new representations are available in the literature, including simple statistical methods such as the principal component analysis [10], sparse methods [11] or auto-encoding neural networks [12]. • For data clustering [13], where the data is grouped into n different clusters according to their features. Clustering methods can be extended to not only find related data in data structures, but also to create hierarchies. In the M&O scope, clustering can be used to improve anomaly detection: assuming that anomalies are qualitatively different from the regular behaviour data clusters, unsupervised algorithms can be trained to detect anomaly behaviours and raise alarms accordingly. UL techniques shares the same problem of SL ones regarding accessing data. This means that the necessary workflows considering data anonymisation and security should also be implemented between the MNO environment and the software development environment in the software providers scope.

C. Reinforcement learning
Reinforcement Learning is a learning process that aims at reinforcing positive behaviours or inhibiting unappropriated ones [14]. By the repeated application of positive and negative stimuli, the algorithm learns and modifies its behaviours. In ML this concept is typically represented as a closed Control Loop (CL) [15], where the ML model acts as an entity able to: (i) receive state information, (ii) perform specific actions on it, and (iii) receive the corresponding rewards (reinforcement signals) based on those actions. By assigning specific state values and by evaluating the reward signal for each action, RL models can iteratively improve their actions.
RL systems have proven to have good performance in application scenarios wherein the actions to be performed are well defined, such as in robotics [16] [17], or playing computer games with even better performance than humans [18]. For the above reason, future 6G M&O services could take advantage of these techniques, translating those actions into basic orchestration functions such as NF instantiation or scaling, among others. A practical example might be the application of scaling actions on certain edge NFs based on measuring a given enduser behavioural patterns. These behavioural patterns would define the state of the system, while reward signals may be derived from QoS metrics, and actions could be the scalein/out orchestration actions on certain NFs.
Once the software for the RL system has been designed, it can be deployed on its working environment to start the learning process by means of repeated interactions. However, in this RL approach there is not a clear separation between training and production stages, as it happened with the SL or UL models. This is probably one of the major challenges about integrating this kind of algorithms in M&O systems, as it must be assumed that the initial performance won't be good, since the system still needs to learn, and this learning happens directly on the environment on which the system should work. An approach that can be used to mitigate this problem is to start the learning process in a pre-production environment, and move the system to production once a good performance level is reached. As far as data sharing is concerned, RL models face the same type of challenges that have been mentioned for SL and UL models. However, since they may be directly deployed on the production environment without performing a previous learning stage, this can evade the problem of exporting data towards the development environment (although at the price of starting with a clearly sub-optimal model).

D. Federated learning
Federated Learning [19] is an AI/ML technique that uses local data samples to train ML algorithms on multiple distributed servers without sharing the actual data between them. This approach contrasts with the regular approaches based on the ML techniques described before, wherein training data sets are typically available on a centralized server. FL works on multiple local data sets exchanging only model parameters, such as the weights of neural networks, to generate a global model shared by all participant nodes. FL can rely on several ML algorithms (e.g., supervised, unsupervised, or others), so FL cannot be considered a learning paradigm by itself, but a way to implement other ML paradigms in a decentralised way. The main advantage from this is that FL enables multiple distributed stakeholders to build ML models without sharing sensitive data.
Similarly to the case of RL, FL models are designed to be trained right in the production environment, and may as well suffer from the problems regarding having to execute the learning stage right in that scope. As for RL, this can be mitigated by performing a pre-training stage, e.g., avoiding the system to trigger actual orchestration actions until the learning process has not reached a fair performance level.
The main benefit of FL systems lies in that they natively avoid the need of sharing sensitive data among different stakeholders. However, in a telco-grade environment, software vendors would be anyway required to design and deploy the FL system itself, e.g., to decide on the neural networks type and topologies, the federated/central nodes architecture, and how these elements would get and process data and communicate each other. This dependence on the software suppliers could be dampened by relying on existing generalpurpose FL frameworks that could be used directly from the MNO operational teams with little support from the software supplier side.
III. THE HEXA-X M&O ARCHITECTURE SUPPORTING AI-DRIVEN ORCHESTRATION AI/ML-based techniques have become a promising tool to provide data-driven decision making. As such, they are currently seen as enablers to manage the increasing complexity and scale of M&O operations in 6G networks. The scope of AI/ML techniques will cover several optimisation aspects and life-cycle actions regarding the services M&O, including resource allocation and slice sharing at provisioning time, service composition, scaling, migration, and re-configuration/reoptimisation of network services and resources.
More specifically, we expect AI/ML-driven M&O functionalities to include: • Enhance service M&O operations, such as the provisioning of NFs, including resource allocation and application layer configuration. • Automating network tasks, supporting data-driven and zero-touch approaches. • Provide predictive orchestration, considering data from different domains (e.g., correlating data from the infrastructure and the user plane). • Support proactive and dynamic self-optimisation of network slices. • Support the management of collaborative AI components across the network. • Support intent-based management processes by providing intelligence for reasoning regarding service requirements, network capabilities and external non-network factors. • Interpret and enforce sustainability policies, to reduce cost and energy consumption, as well as improving the service efficiency of the network infrastructure. In this context, the European H2020 project Hexa-X, in its Intelligent orchestration and service management for future Beyond 5G (B5G)/6G networks work package, has designed an innovative M&O architecture [1] that enables AI-based orchestration and supports the above-mentioned functionalities. Figure 1 depicts a simplified version of the structural view of this architecture. As it can be seen, different sets of network functions are considered, being grouped into different functional groups at the Network Layer (e.g., Management Functions, Core Network Functions, Monitoring Functions, and AI/ML Functions, among others). Some of these functions are explicitly intended to implement M&O Functions or to support them, being the AI/ML Functions one of those supporting sets of functions.
A relevant innovation in this architecture impacting on the deployment of AI/ML resources into the MNO infrastructure is the introduction of the Design Layer, representing M&Orelated operations involving third-party software providers, that in telco-grade environments would provide the AI/ML models and/or the related software components. As it can be appreciated in Figure 1, the communication between the Design Layer and the other layers is performed using an API Management Exposure, a cross-layer communication block which follows the Service-Based Model Architecture (SBMA) principles and the DevOps methologies to deploy the AI/ML models and resources [20].

A. M&O Architecture AI/ML Functions
This set of functions are considered as Complementary M&O Functions within the Network-Layer M&O block, as they might be used by the Management Functions to complement their operations with additional resources. In this regard, AI/ML Functions can act supporting what are considered the Primary M&O Functions, i.e., the service fulfilment, assurance, and the artefact management capabilities. The Hexa-X M&O system can be also used to deploy and orchestrate other AI/ML Functions outside the M&O scope itself to optimise the managed objects (network slices, NSs, NFs, etc.) across the different domains, such as the CORE, Radio Access Network (RAN) and the Transport Network.
AI/ML Functions consume the data from a data store that could be distributed or centralised. This data store is fed by data collectors through the Monitoring Functions block using the adequate probe following the monitored platform/domain/segment. Based on this data, Management Functions are triggered to execute the necessary orchestration actions (e.g., NFs scaling or placement actions). Monitoring Functions are intended here to integrate data from both: Network/Infrastructure Key Performance Indicators (KPIs) and Application/Service level data, in order to allow the AI/ML Models to correlate data from different domains and learn on patterns that perhaps are not self-evident. This also allows the AI/ML Functions to focus at specific layers, or adopt a joint and cross-layer scope (e.g., some may be dedicated to a Network Layer segment, like the RAN, taking short-term control decisions, or medium-long-term decisions for optimisation or resource allocation at service/slice provisioning time, while others may operate with a wider scope, covering the End-toend (E2E) service management).
The usage of AI/ML Functions in the future B5G/6G networks is justified by the increase in the amount and variety of data that could be associated with the orchestration processes. This increases the level of complexity, and AI/ML algorithms have proven their effectiveness when it comes to working with large and heterogeneous sets of data. In Hexa-X the following sources of complexity have been identified [1]: 1) Extreme Edge Integration: The integration of this domain is a challenge by itself due to the huge volume and heterogeneity of devices that will co-exist within it. AI/ML Functions will ease the M&O processes related to this domain using Big Data Analytics to address the characteristics of this new domain, and triggering the required orchestration actions [21]. 2) Networks Operations Management: Applying AI/ML algorithms to this context is known as AIOps [20]. It has to do with automating and improving activities (e.g., identifying events, incident cause analysis, or to collect and normalise high volumes of operational data, among others). 3) Intent-based networking: AI/ML Functions can provide the translation from high-level intents, e.g., describing the desired network services in natural language, to the corresponding low-level management operations [22].

IV. CHALLENGES AHEAD
In this section, we summarise the most relevant future challenges that may be considered by the industry and academia when addressing the application of AI/ML in future B5G/6G networks . We refer the interested reader to [1] for a detailed discussion.

A. Learning stage
As explained in Section II, AI/ML models need a learning stage. This learning stage must be introduced in the regular MNO work procedures, which is a challenge by itself. Some approaches, such as SL and UL, can be identified with the regular software development process where the training stage replaces or complements the traditional software programming processes. However, this also comes with the downside of providing data to external third parties, the service developers, that in the telco-grade environment are typically outside the MNO scope. This can be addressed in different ways: FL, represents an approach that has been specifically designed to address this problem; also, encryption and anonymization techniques can also be used to address this issue (e.g., by using homomorphic encryption algorithms [23]).

B. Overhead
Artificial Intelligence models (especially supervised and unsupervised models) often require long processing times for the training stage. Also, once trained, if the model is complex, it may require special hardware resources if agile real-time responses are required. This problem is usually solved using specific hardware resources (e.g., GPUs or FPGAs) or acceleration capabilities that can be required also at the edge and the extreme-edge domains. However, this can also increase infrastructure costs, mainly when those resources need to be deployed on a large number of nodes to properly absorb the traffic.

C. Data availability
Training AI/ML systems requires data which needs to be collected, stored, and dispatched where needed, all in a timely fashion. This probably will not be a big problem for Reinforcement Learning or Federated Learning models, since they only work with the data that is already available in the actual environment on which they are deployed. But supervised and unsupervised approaches could require huge amount of data from the different network layers, or even from external data sources (e.g., non-public networks or 3rd party infrastructure nodes).
One possible method to address the lack of data-sets is to use Generative Adversarial Networks (GAN) systems [24]. GAN are already successfully used in fields such as audio or video processing, where they are used to perform image or video synthesis. Based on a small dataset (a 6G dataset in our case), a GAN could produce a larger dataset made of plausible data.

D. Multi-stakeholder environments
As mentioned earlier, 6G networks, like Fith-Generation (5G) networks, will be very heterogeneous, and multiple parts of these networks may not belong to the same stakeholder, which has an impact on the AI algorithms used for orchestration. Firstly, some of the stakeholders may be malicious and send false data to the orchestration system, either to gain advantage or simply to disrupt the network. Therefore, algorithms are needed that are resistant to such adverse behaviour. Secondly, different stakeholders may be reluctant to provide their complete data set to the E2E orchestration system, as some or all of this data may be sensitive. As E2E management still needs to be performed, AI/ML algorithms need to be adapted to provide accurate results without having access to the raw data. Regarding this topic, FL systems and homomorphic encryption are both interesting research directions.

E. Verification and Validation
AI/ML systems help making decisions in diverse usecases ranging from self-driving cars, robot control, factory reconfiguration etc. To be able to trust the AI/ML decisions it is essential to verify and validate the AI/ML to ensure the system is functioning correctly in both normal and abnormal conditions.

F. Coordination of AI-driven management functions
AI-based management typically uses multiple CL based functions, each of them with a different KPI typically. Such uncoordinated behaviour of individual functions may lead to sub-optimal optimisation results or even chaotic system behaviour. Sometimes, one function's activity may impact another function's goal by re-configuring the environment (network, slice, etc.). In other cases, two or more functions may directly affect the same configuration parameter of the environment. The problem has been intensively researched in SON [25], but no satisfactory solution has been found. So far, the problem is pragmatically solved by timescale separation of different CL operations or assigning them different priorities. It is possible, however, to use AI/ML or a game-theoretical approach to address the problem. For example, hierarchical RL [26], [27] can be nicely used for both: CL functions implementation and coordination. However, the preferred solution should be agnostic to the coordinated functions that call for intent-based coordination. There is still much work needed on this topic at the architecture and algorithmic levels.

G. Explainability
Data structures generated by AI/ML models come, typically, in the form of collections of huge rational number's matrices. Unlike traditional software programs, this makes it difficult for humans to understand the inherent logic in these models, which are commonly seen as sorts of black boxes in which the logic of its operation is not easily explainable.
Explainability is one of the key features regarding AI identified by the Assessment List on Trustworthy Artificial Intelligence (ALTAI) EU initiative [28]. This is because, in case of incidence, it is always necessary to know why the M&O system took certain decisions. It is important to remark that M&O actions can have serious consequences on the QoS provided to the customer, so improper M&O actions may originate serious economic and reputational cost to the MNO, which obviously need to be properly explained. Beyond this, explainability is also needed as part of the regular network operations activities to provide to the operational teams the necessary information for them to understand the operational processes.
This challenge can be addressed using the so-called eXplainable AI (XAI) techniques, which can be used to find the underlying rules behind the already trained AI/ML algorithms. The applicability of XAI to telecommunications has been already addressed by the research community [29]. However, the XAI methods are still not in the scope of current M&O systems. The main reason is generally the lack of AI-driven Operation Support System (OSS)/Business Support System (BSS) solutions. Unfortunately, the M&O application of AI is complex, and the use of XAI is therefore not straightforward. The AI-driven OSS/BSS use a feedback CL, that means the AI-assisted decisions impact the whole environment. In this case a typical feedback-based control system is used. Moreover, due to the rich functionalities of OSS/BSS there will be many OSS/BSS operations that are AI-driven, using a different set of algorithms, and not decoupled. Each mechanism may use a different AI/ML model. In order to ease the use of XAI in this context, it is necessary to decompose the whole OSS/BSS system into smaller AI-driven subsystems to which XAI models can be reflected. It, thus, will lead to a kind of "local XAI" that will contribute to the "global XAI".

H. AI Inference Error Management
Even the most advanced state-of-the-art AI/ML systems are subject to errors. Some well-known examples are in the field of autonomous driving or in the image's recognition field. The problem is that this seems to be a problem intrinsic to AI systems, so with no easy solution: patching a specific failure (e.g., an error in recognising a specific pattern) does not guarantee that the problem has been solved in a general way, or could even introduce new complications, because the wide variety of possible scenarios in the real life is simply too complex.
But practical M&O systems able to deal with complexity and to make automated actions in the real world are needed, and those actions may have a relevant impact on the network performance and the deployed services. An important matter which shall be considered is that the services provider could incur in economic and reputational costs if failures may lead to degraded quality of services, which would negatively impact on the customers perception.
There could be different ways to face this problem. First of all, it is important to consider that AI/ML systems are not so different in this sense compared to regular non-AI software systems. Bug fixing is in fact a regular software maintenance practice, even with systems already in production. Of course, it is of paramount importance to minimise bugs and to have the appropriate workaround procedures always available in order to mitigate errors. The same rationale should apply also to AI/ML-based systems. However, another important recommendation is to have the human always in the loop. This is the "Human Agency and Oversight" recommendation from the ALTAI EU initiative [28], stating the need that final decisions should be always supervised by humans, while AI/ML systems should act just as recommenders. This should apply mainly to services where human life or security could be at risk, or when the estimated reputational or economic costs for the operator could be too high. In case of high risks human should always have the latest word. This measure can of course be relaxed for those services not affecting the people safety, or those not putting at risk the reputation of the MNO.

V. SUMMARY
In this paper, we have performed a short revision of the use of AI/ML for network M&O. Besides, we have introduced the Hexa-X M&O architecture, and explained how it envisions to introduce the AI/ML functionalities that will enable more efficient operations in future 6G networks. Finally, we have discussed and presented the challenges ahead derived from the requirements of integrating AI/ML in M&O operations.
ACKNOWLEDGMENT This work has been partly funded by the European Commission through the H2020 project Hexa-X (Grant Agreement no. 101015956).