Monitoring and Data Analytics for Optical Networking: Benefits, Architectures, and Use Cases

Operators' network management continuously measures network health by collecting data from the deployed network devices; data is used mainly for performance reporting and diagnosing network problems after failures, as well as by human capacity planners to predict future traffic growth. Typically, these network management tools are generally reactive and require significant human effort and skills to operate effectively. As optical networks evolve to fulfil highly flexible connectivity and dynamicity requirements, and supporting ultra-low latency services, they must also provide reliable connectivity and increased network resource efficiency. Therefore, reactive human-based network measurement and management will be a limiting factor in the size and scale of these new networks. Future optical networks must support fully automated management, providing dynamic resource re-optimization to rapidly adapt network resources based on predicted conditions and events; identify service degradation conditions that will eventually impact connectivity and highlight critical devices and links for further inspection; and augment rapid protection schemes if a failure is predicted or detected, and facilitate resource optimization after restoration events. Applying automation techniques to network management requires both the collection of data from a variety of sources at various time frequencies, but it must also support the capability to extract knowledge and derive insight for performance monitoring, troubleshooting, and maintain network service continuity. Innovative analytics algorithms must be developed to derive meaningful input to the entities that orchestrate and control network resources; these control elements must also be capable of proactively programming the underlying optical infrastructure. In this article, we review the emerging requirements for optical network management automation, the capabilities of current optical systems, and the development and standardization status of data models and protocols to facilitate automated network monitoring. Finally, we propose an architecture to provide Monitoring and Data Analytics (MDA) capabilities, we present illustrative control loops for advanced network monitoring use cases, and the findings that validate the usefulness of MDA to provide automated optical network management.

Although EON and SDN technologies can fulfill current capacity and dynamicity requirements, transport networks are expected to support the deployment of upcoming 5G mobile infrastructures in the near future; 5G mobile will extend far beyond previous generations and require an enhanced quality of experience for the final users with new services and improved network performance. To meet the goals of 5G, network infrastructures should provide increased levels of flexibility and automation, together with higher priority given to network optimization, security, energy consumption, and cost efficiency. In fact, disaggregation at the optical layer has been conceived to enrich the offer of available solutions and to enable the deployment of optical nodes that better fit optical network operators' needs.
As future network complexity increases, the main challenge for operators will be to promptly respond to variable network conditions while ensuring full availability and optimization of network resources. Nonetheless, current optical networks are incorporating a complex ecosystem of devices and sensors, which will produce a large amount of data that can be exploited to optimize a network in real-time. To cope with complex and time-variable 5G service scenarios, Machine Learning (ML)based algorithms [2] are being proposed to facilitate the network operation and predictive maintenance. ML algorithms, fed with real measurements, are able to accurately estimate the Quality of Transmission (QoT) of new lightpaths, to anticipate capacity exhaustion and degradations, or to predict and localize failures, among others (see, e.g., [3]- [8]).
Based on the above facts, network operators are looking with high interest to the opportunities that Monitoring and Data Analytics (MDA) can offer to their optical transport networks, as it emerges from applied research and standardization bodies. In fact, such solutions can be made available only after monitoring and telemetry protocols, together with data models, are standardized. There are multiple ongoing standardization efforts within several technology areas, where most of the proposals are based on three main principles: i) data modeling language that provides structured data models for technology and function specific data points, ii) management protocol for encoding and carrying the data model information, and iii) the operational process governing how the protocol interface is used and connections are managed. In addition to research activity working on network telemetry (see, e.g., [9]), practical industrial projects exist, including: OpenConfig (see openconfig.net) and the OpenROADM (see openroadm.org) efforts.
In this article, we review the operators' vision, as well as the capabilities of current optical systems and present three wide-scope use cases that require MDA-based solutions and whose application will bring clear benefits: i) network planning and provisioning with reduced margins, ii) dynamic network adaptation, and iii) lightpath degradation detection and failure localization. Next, the state-of-the-art of data models and monitoring and telemetry protocols is reviewed as well. With this in mind, several MDA architectures are proposed, and the pros and cons of each of them are highlighted. Finally, illustrative control loops for the considered use cases bring a clear and complete vision of the validity and feasibility of MDA in the context of optical transport networks.

The network operators' vision
The vision of network operators, regarding the deployment of MDA in their optical networks, mostly concentrates in three wide-scope use cases, as summarized in Table 1.
The first use case focuses on minimizing the system operation margins, e.g., linear optical signal-to-noise-ratio (OSNR), that are widely used in optical systems to ensure worst-case end-of-life QoT of the lightpaths. Before entering operation, all available combinations of modulation formats, fiber types, FEC codes, etc. are considered, and exhaustive numerical simulations and lab experiments are conducted to extract engineering rules to be used. This time-consuming analysis can be simplified by utilizing approximate analytical tools such as the Gaussian Noise (GN) model [10]. Both approaches lead to the estimation of QoT for the existing lightpaths. Nonetheless, these solutions are static by nature and based on conservative design principles, which lead to resource underutilization.
To reduce margins, analytical methods or ML-based algorithms can utilize the knowledge of the current network status, i.e., the configuration of optical devices (e.g., TPs, WSSs, ROADMs, etc.) and the characteristics of the optical fibers to estimate the QoT of new lightpaths to be established [3]. During operation, the SDN controller is in charge not only of the provisioning process, but also of adapting the network to traffic changes (it is quite common that packet traffic varies from day to night not only in intensity but also in directionality due to, e.g., data-centers activity) aiming at minimizing overprovisioning. In this second use case, the role of MDA is to derive models to accurately predict the traffic volume for the short term, in detecting whether the capacity of the lightpaths will be soon exhausted, etc. [4]. With such knowledge, the SDN controller can re-configure the network leveraging on the configurability of TPs, i.e., adapting the rate and spectrum, of already established lightpaths and creating new lightpaths in real time with significant CAPEX and OPEX savings.
The last use case concentrates on degradations and failures. All components deployed into an optical network suffer ageing over their lifetime, e.g., the amplifiers might decrease their gain, the filters might insert additional losses, the fiber might present several splices, etc. This leads to a slow, but continuous, decrease of the lightpaths' QoT. Early detection of lightpaths' degradation would allow tuning parameters within the TPs, e.g., by increasing the FEC overhead or by switching to a more robust modulation format [5]. When the severity of the degradation increases, localizing its root cause is of paramount importance for maintenance purposes [6], [7]. It is also possible to predict failures and proactively re-route the traffic [8], which allows a high resiliency of the optical network at the just-enough cost. To this end, dedicated optical protection is replaced with just-in-time optical restoration. Attenuation, dispersion and other fiber parameters, the noise figure of amplifiers, WSS passband, the sensitivity of TPs, etc. Those parameters can be used together with an analytical model to estimate the QoT of lightpaths accurately. ML-based methods to predict the probability that the QoT of a candidate lightpath will not exceed a defined threshold.

Dynamic Network adaptation
Leveraging on configurable TPs the allocation of just enough data rate for any connection at any time to cope with traffic dynamics at minutes or hours scale.
Better exploitation of network resources and potential savings by reducing the typical overprovisioning of static allocation.
Use of models to evaluate the expected QoT of a lightpath at any new TP configuration. Use of models for traffic analysis to evaluate traffic trends and periodicity.

Degradation and failure localization
QoT reduces over time due to network and device degradation (e.g., fiber cuts and repairs), ageing, or load increasing.
Degradation anticipation allows appropriately tune systems' parameters before alarm triggering. Localizing the element responsible for a failure facilitates network maintenance by planning a human intervention.
Predictive analysis based on QoT evolution. Localization based on the per-system analysis. Algorithms that find the potential cause of the failure.
Four aspects are particularly important and must be implemented to support the three use cases described above: i) which data may be obtained, derived or provided by the network devices and collected by the operators, ii) which are the key parameters to be estimated and the accuracy required, iii) identification of technologies that can be used to elaborate the information, and finally, iv) definition of the main limitations in terms of data availability, veracity, and frequency that exist and what is needed to overcome them.

Data availability
Considering the use cases defined in Table 1, optical devices need to be capable of performing measurements on selected points of the networks, named Observation Points (OP). For example, measurements could be obtained from DSP units within the TPs, as well as from specific monitoring devices installed within the network. Specifically, DSPs can provide measurements or estimations of power levels, fiber channel characteristics (e.g., accumulated dispersion, fiber nonlinear coefficient, polarization mode dispersion) and QoT-related parameters (e.g., linear OSNR and BER). Furthermore, monitoring devices, like cost-effective optical spectrum analyzers (OSA) and optical time-domain reflectometer placed at predefined locations of the network, can provide specific measurements of optical signals and fiber segments.
Among all available and derived data, the most relevant is the OSNR measured at the receiver, which is used to define the system margin of every lightpath. While the estimation in linear regime is straightforward, the GN model can provide a worst-case accuracy as low as ± 0.75 dB at the optimal power level or in the nonlinear regime. An accurate enough value of the system OSNR would enable strategies that can lead to optimal usage of the optical spectrum. MLbased algorithms can also contribute to estimating this and other parameters, like laser characteristic or amplifier noise figure.
It is clear that operational data (i.e., the network topology, the route of lightpaths, etc.) is of paramount importance to realize all above use cases, as they allow to correlate measurements and events; such operational data can be collected from the SDN controller. In addition, lightpath provisioning activity can also be collected from the SDN controller and used for traffic modeling. Other parameters can be available as well, like traffic forecasts that can be used to further optimize network operations or to predict failures.
Finally, by deploying low-cost monitoring devices, environmental parameters could also be exploited and eventually correlated for optimal network operation.

Considerations about MDA-based system
Besides data availability, it is also important to consider their accuracy to define the sufficient amount of data to be collected and stored, as the accuracy depends on the amount of data that is considered in the MDA system for decision making. For example, if a system would operate in pure linear regime, the pre-FEC BER could be enough to estimate the actual OSNR and then the relative system margin. However, real networks do not always operate in a full linear regime, and therefore, the pre-FEC BER may result in being unsuitable to always provide an accurate prediction of the instantaneous OSNR margin. Consequently, enough data need to be stored to achieve a pre-defined accuracy, especially under low or zero-margin network operation.
Another key factor is the update frequency; an instantaneous collection of monitoring data could produce negative effects, so it is important to determine the right frequency for data collection. For instance, once a lightpath is established, and until there are no substantial changes in the network, there is no need to update the fiber channel values. Contrarily, parameters such as amplifier power levels require a higher update frequency, although old values could be discarded if the individual amplifier works properly. Overall, all these data will be ultimately used by the MDA system, which might also incur in saturation or in drawing sub-optimal decision in case of overwhelming or contradictory data. Different strategies can be envisioned to solve this issue: i) using thresholds, which are simple but inaccurate; ii) experience and physical knowledge, which could lead to evaluation errors in case of not predicted scenarios; iii) by designing an intelligent MDA system that can decide based on physical conditions what data should be analyzed and consider possible dependencies.
Finally, it is worth pointing out that the main challenge (and limitations) occur in multi-vendor scenarios. In this context, a proactive MDA system could anticipate issues before they happen and issue the proper recommendations provided that the MDA system is aware of the configuration of all involved nodes at any time.
In conclusion, the opportunities that MDA opens go far beyond a monitoring data collector and storage platform. The analysis of the collected data can discover knowledge and use it to proactively self-configure and self-tune the network in a cost-effective (near) real-time manner by adapting resources to future conditions. Therefore, thanks to the application of data analytics to monitored data, observe-analyze-act control loops can be enabled, where outcomes of such analysis can be used for event notifications together with recommended actions to the SDN controller (Fig. 1). Last but not least, useful models can be estimated from monitoring data to feed planning tools in order to compute optimal solutions for the expected future conditions.

YANG DATA MODELS AND PROTOCOLS
YANG is a data modeling language standardized by the Internet Engineering Task Force (IETF) and designed to operate with the NETCONF protocol for network configuration and management (see IETF RFC 6020). YANG enables: i) human readability and simplified troubleshooting operations compared to protocols relying on bit encoding; ii) hierarchical structures of data models; and iii) extensibility and modularity through augmentation mechanisms and sub-modules. A YANG data model is represented by a tree structure where nodes are defined by: names, data types, data values, or a set of child nodes and lists. In the last years, in the context of optical networks, YANG/NETCONF has emerged as a candidate solution to provide automated control of network elements having common and vendor-neutral standardized models [11]. Several standardization bodies, like the IETF, and working groups, e.g., the aforementioned OpenConfig and OpenROADM, have released vendor-neutral YANG models for devices as X-ponders, optical amplifiers, and ROADMs. However, the related YANG models are significantly different, with relevant incompatibility issues. Although efforts are on-going to converge towards commonly adopted models, multiple versions of drivers, software implementations and SDN controller and monitoring customizations are expected in the near future, potentially delaying the adoption in heterogeneous and multi-vendor networking scenarios.
For monitoring purposes, YANG relies on state (read-only) types providing the actual values of the considered system parameters. The SDN controller is able to retrieve YANG-defined parameter values by exploiting NETCONF messages either periodically (e.g., every 15 minutes) or asynchronously (e.g., in case of events) through notification messages. However, NETCONF messages are not particularly efficient for monitoring (especially when the data collection period is short, e.g., one minute) let alone for telemetry (e.g., when a continuous stream of data is provided). Thus, other protocols have been proposed for monitoring and telemetry purposes; the most relevant are: i) IP Flow Information eXport (IPFIX) (see IETF RFC 3917), ii) gRPC (see grpc.io), and iii) Apache Thrift (see thrift.apache.org); see a brief description in Table 2.  Data transfer efficiency is comparable to that of gRPC.
With these protocols available, the selection of the collection period is not limited to 15 minutes anymore, and it can be reduced to, e.g., 1 second [9]. Note that the shorter the collection period, the shorter the event that can be detected, as well as the shorter the time to detect degradations. However, reducing the measurement period increases the amount of data that has to be collected, stored, and analyzed. Then, an approach to reduce the amount of data is to rely on monitoring using collection period of minutes and activate telemetry on demand to get insight, by analyzing a continuous stream of measurements, when and where needed.

MDA ARCHITECTURES
In this section, we present and analyze several architectural approaches to bring real MDA capabilities to the network (see Fig. 2). Specifically, three architectures are considered depending on where data analytic capabilities are enabled, namely: i) centralized, ii) distributed, and iii) hierarchical.  The centralized architecture (Fig. 2a) consists in detaching the monitoring repository and the data analytics system, if any, from the NMS to create a separate specific centralized MDA controller that can interface the SDN controller and other systems within the control plane (see, e.g., Ciena Blue Planet). To keep the MDA architecture simple, let us consider that its only mission is to expose an interface to collect monitoring and telemetry data from the network devices. Measurements are stored in a (big data) repository, and data analytics algorithms can be devised to discover knowledge to be used to predict and/or to detect anomalies and degradations before they negatively impact on the network performance. Such predicted events can be notified to the SDN controller together and include a recommended action to guide the SDN controller; the recommended action is a suggestion that the SDN controller can follow or ignore and apply its own policies. As an example, in some cases BER degradation can be predicted ahead of time in a lightpath before any threshold is exceeded by analyzing the BER evolution as measured at the receiver; this is notified to the SDN controller together with a recommended action after analyzing several alternatives, including change of the modulation format (also via probabilistic shaping), re-route of the lightpath (e.g., to avoid some links); or also to increase, if possible, the amount of overhead used by the FEC. The notification to the SDN controller might trigger a reconfiguration, hence closing the loop and adapting the network to the new conditions. The centralized MDA architecture presents some limitations; for instance, the time to detect an anomaly or degradation is related to the update frequency. Therefore, to reduce the detection times, the amount of data to be conveyed to the MDA controller needs to be increased accordingly. Another issue is related to the control of monitoring; specifically, to activate telemetry on-demand once an event has been detected.
To overcome these problems, the distributed architecture (Fig. 2b) includes MDA agents in charge of collecting measurements from a single node, while keeping the MDA controller centralized [13]- [14]. The MDA agent exposes two unified interfaces toward the MDA controller for collecting data and monitoring configuration; in addition, specific interfaces for data collection and monitoring control allow the MDA agent connecting with the network device. The data analytics capabilities deployed close to the network nodes enable local control loops implementation; measurements can be analyzed locally, and configuration can be tuned and adapted to changing conditions. However, the co-existence of two controllers, the SDN and the MDA, in charge of configuring network devices, might create conflicts, so it would be desirable to clearly separate responsibilities among them.
The distributed architecture includes a dedicated MDA agent for every node in the network, which might present some limitations when disaggregated optical network nodes (e.g., TPs and ROADMs) and monitoring devices are deployed within the same central office (CO) [14]. For this reason, the hierarchical architecture (Fig. 2c) includes a per-CO MDA agent that collects measurements from every network device in the CO and exposes a single set of interfaces toward the MDA controller. In this case, measurements from one device can be analyzed in the CO MDA agent and configuration can be tuned to another device within the same CO, thus minimizing the intervention of the MDA controller. The per-CO MDA agent could (or not) replace every node MDA agent thus, reducing systems count.
The strengths and weaknesses of the analyzed MDA architectures are summarized in Table 3, where the features of each architecture include those of the previous. Table 3 Strengths and weaknesses of several monitoring and data analytics architectures.

Architecture Features Strengths Weaknesses
Centralized  Includes a centralized MDA system with a data repository for monitoring/telemetry data where data analytics can be applied.  Monitoring and telemetry activation and deactivation is managed by an external system, e.g., the NMS.
 Data analytics results can be used for network self-adaptation to changing conditions.  Interfaces with the SDN controller and NFV orchestrator can be easily standardized.
 Different monitoring / telemetry protocols need to be available at the MDA controller.  The amount of data to be collated from the nodes increases exponentially to keep low reaction times against anomalies or degradations.  Configuration tuning is not supported.  Network slicing is difficult to be supported.

Distributed
 Allows data analytics to be applied within the MDA agents, close to the network nodes. Control loops can be implemented locally at the node level.  Monitoring and telemetry activation / deactivation is managed by the MDA controller.
 Supports configuration tuning [12].  It reduces data to be conveyed to the MDA controller since patter recognition can be done in the MDA agents.  MDA agents expose one single monitoring and telemetry interface to the MDA controller.  Supports network slicing [13].
 A configuration interface needs to be defined between the MDA controller and the agents.  More complex MDA controller as more features are added, like monitoring and telemetry control, and configuration tuning.  CO control loops are not supported.
Hierarchical  It includes a per-CO MDA agent that connects to all the nodes in the CO.
 Control loops can be implemented locally at the node, as well as at the CO level involving more than one node.  Appropriate for node disaggregation scenarios, where monitoring devices can be installed in one node, but configuration tuning needs to be done in a different node [14].  It reduces the total number of agents and the number of interfaces toward the MDA controller.
 Requires more complex MDA agents to consider complex relations among nodes.

ILLUSTRATIVE CONTROL LOOP IMPLEMENTATION
This section illustrates how the use cases introduced in Section 2.1 can be implemented. To this end, let us assume a disaggregated scenario, where COs are equipped with TP nodes and ROADMs and the hierarchical MDA architecture is selected. Apart from the MDA, the control plane includes an SDN controller in charge of configuring the optical network, a planning tool running optimization algorithms for provisioning and in-operation network planning purposes [15], and an NMS for human operators to manage the network.
Additionally, it is worth highlighting that if external systems, such as planning tools, may require access to data stored in the MDA controller upon request, it is necessary to define additional interfaces. The data that is then available, as part of the MDA, is not simply the raw measurements being streamed from the network devices, but also estimated data and derived knowledge generated by ML algorithms.

Lightpath provisioning with a reduced margin
In this first use case, we focus on the provisioning of lightpaths minimizing the system margin, which can be derived from the OSNR and / or the TP's pre-FEC BER threshold according to the transmission scenario. OSNR estimation at optimal lunch power and in nonlinear regime requires data from monitoring the optical channel and the devices configuration, which we assume that are already available in the MDA controller (labeled 0 in Fig. 3a). Besides this information, also parameters related to network infrastructures (such as fiber types and link lengths) are required. These might be collected, e.g., from the SDN controller, the NMS and inventory systems. When a lightpath set-up request arrives at the SDN controller (1), the latter relies on the planning tool for the computation of the route, spectrum allocation, modulation format and other parameters that contribute to minimize the system margin while guaranteeing its QoT (2). In order to compute an optimal solution and meet the objective function criterion, the planning tool needs to access data from the MDA controller (3); once a solution has been found, it is sent back to the SDN controller (4). Here, (at least) three possibilities might exist: i) the lightpath can be established, and an optimal configuration has been found; ii) the lightpath can be established provided that the configuration of other lightpaths is first changed, and iii) no solution has been found. In the second case, the planning tool returns the optimal configuration found for the requested lightpath, together with a (reactive) recommended action for the SDN controller to modify the configuration of a subset of already established lightpaths; in this case, the SDN controller might request the human operator to confirm the re-configuration through the NMS (5). Finally, in the case that the lightpath can be established with or without network re-configuration, the SDN controller configures the network devices accordingly.
Further optimization can then be achieved by observing the QoT of each established lightpath, aiming at identifying possible transmission adaptions (e.g., FEC, modulation format) leading to margin reductions closer to the predefined target values.

Dynamic network adaptation
In the previous control loop, the planning tool issued a recommended action for re-configuration because of a previous request from the SDN controller (we named them reactive). In this and the next use cases, the MDA controller will issue recommendations to the SDN controller as a result of observing what is happening in the network and aiming at anticipating the most relevant events. In this context, we refer them to as proactive recommendations.
As for the case before, we assume that data are already available within the MDA controller (labeled 0 in Fig. 3b). ML algorithms running in the MDA controller can use the measured packet traffic volume to determine a traffic model for the traffic between every origin and destination CO. Such traffic models can be used to compare the expected traffic against the provisioned capacity and therefore, when the measured or the expected traffic for the near future is close to the allocated capacity, the MDA controller issues a notification to the SDN controller including a recommended action to reconfigure the allocated capacity (1).
In these dynamic cases, the SDN controller might inform the human operator through the NMS (2) and then, request the planning tool to compute the optimal capacity configuration for the detected event (3). For such computation, the planning tool needs data from the MDA controller, e.g., the expected traffic matrix, e.g., for the next hours (4) [4]. Such traffic matrix can be computed assuming the maximum or the 95 th percentile traffic volume expected for every origindestination pair. With such a traffic matrix, an optimization algorithm running in the planning tool can compute the optical capacity allocation and respond to the SDN controller (5). Finally, the SDN controller implements the reconfiguration in the network (6).

Lightpath degradation detection and modulation format adaptation
For this use case, let us consider the lightpath is established and being monitored, where BER measurements are collected by the MDA agents connected to the end TPs (labeled 0 in Fig. 3c). A data analytics algorithm running within the MDA agents can be in charge of detecting BER trends to anticipate QoT degradation [5]. In the case of QoT degradation detection, a decision can be locally made without the intervention of the MDA controller. For instance, modern TPs are capable of identifying the modulation format of the received signal by means of DSP. Therefore, a change in the modulation format employed for a lightpath can be initiated in one of the transmitters and the end TPs will automatically realize of such change and carry out the same in the opposite direction (such local control loop is not shown in Fig. 3c). However, in disaggregated multi-vendor scenarios, both ends could need to be simultaneously reconfigured. To that end, the MDA agent sends a notification to the MDA controller (1) that evaluates the capabilities of both TPs and evaluates the possibilities. The degradation detection together with a recommendation (e.g., change the modulation format to a more robust one) is sent to the SDN controller (3) that implements it in the devices, might be after checking it with the operator in the NMS (3-4).

SUMMARY
We have provided the network operator vision for automating management of advanced optical network infrastructure, key requirements and current enabling optical technologies.
For this article the role of MDA in optical networking has been studied through three wide-scope use cases covering the main network operations: i) network planning and provisioning, ii) dynamic network adaptation, and iii) degradation detection and failure localization, where clear benefits have been unveiled. Interestingly, current networking devices are already capable of performing measurements that support those use cases. Additional data can be collected by installing specific monitoring devices at predefined locations.
A review of the currently ongoing standardization activities revealed that different initiatives are working towards modeling optical components and adopting different solutions. In addition, several protocols can be used for monitoring and telemetry purposes. From the control plane perspective, it is not clear the support of SDN controllers to the MDA functions more than just collecting monitoring data. In view of that, a specific MDA system has been proposed and three different architectures, from centralized to distributed, were analyzed, where an MDA controller is defined in the control plane working together with the SDN controller. Finally, illustrative control loops supporting examples of the selected use cases have been shown.
As a final remark, although the technologies supporting MDA in optical networks are ready, there is still a significant amount of discussion required within the relevant standardization forums and industrial OpenSource projects, to leverage this work fully.