Minimizing resource protection in IP over WDM networks: Multi-layer shared backup router [invited]

Optical networks and Internet Protocol/Multi-Protocol Label Switching (IP/MPLS) based networks have been traditionally designed and operated by separate departments of network operators. Likewise, both layers have always represented different business areas for system providers, which maintain different product lines for each of them. However, network operators have an IP/ MPLS over WDM architecture. The main drivers for this unique transport network are the cost savings achieved by simplifying the network infrastructure. Thanks to the advent of reconfigurable optical equipment and a multilayer control plane, 1 + 1 node protection at the IP layer in each location is no longer required. This work compares two resilience strategies on multi-layer network dimensioning: dual-plane protection and a multi-layer shared backup router (MLSBR). Based on the results of this paper, a MLSBR provides a significant reduction (up to 24%) on the required IP equipment in comparison with the IP protection approach.


I. INTRODUCTION
n recent years, the explosion of landline and mobile broadband leads to an unprecedented growth of traffic in telecommunication networks, with sustained cumulative annual growth rates of 40%-60%. This trend is expected to continue in the future, due to the widespread deployment of fixed and mobile broadband access and the emergence of new traffic intensive applications such as Cloud Computing and High Definition multimedia applications [1]. In this highly demanding environment, operators are concerned about the capability of current network architectures to Manuscript  support the scalability required by the Future Internet, especially in terms of cost per bit. The revenues of the operators are not growing exponentially anymore and tough competition lowers the final customer bill. A scenario with continuous cost increases Fig.1, while the revenues remain almost flat could eventually impact on the operators' margins. In other words, current network architectures require constant investments, which are not justified by the corresponding profit increase. As a result, operators demand a new network model to improve the Return of Investment (RoI) and to ensure long-term profits (Fig.1).
During the last decade, core networks have experienced a clear trend towards simplification. MPLS technology applied ATM concepts to the emerging packet switching paradigm and MPLS became the current standard in packet core networks. More recently, WDM/OTN networks with GMPLS control plane emerges as the next generation transport network, combining the scalability of WDM technologies with the dynamicity provided by a control plane. Moreover, even though IP/MPLS and WDM/OTN still represent two significantly different domains, a critical mass of experts is working towards further integration as the next natural step in network architecture evolution [2]. Authors in [3,4,9] demonstrate that significant CAPEX savings could be obtained by a rational combination of optical and electronic switching for transit traffic.
Although there is a trend to migrate towards an IP/MPLS over WDM architecture, there is still a separation of the IP and optical management layers, which leads to highly redundant and un-coordinated protection schemes. A common resilience technique strategy applied in some network operator's deployments (i.e. Telefónica), consists in combining both, protection and restoration mechanisms, for each layer. Moreover, in such deployments, each IP link is designed with peak load link utilization around 30-50%, to ensure enough capacity in the IP network in case of recovery if a failure occurs in the transport or in the IP layers. Each layer carries out its own protection mechanisms without information exchange between them. Each connection used to provision an IP link in the transport network is protected using a dedicated 1+1 protection scheme, and each IP router and card is duplicated to protect from single failures. For the rest of this article, this technique is referred as the Dual Plane Multi-layer network survivability is a topic of high interest for the research community. First multi-layer mechanisms were proposed in [5] for ATM over SDH/WDM architecture. However, as the network architecture has changed, new studies are done based on current IP/MPLS over WSON networks. The scope of the research in multilayer networks is wide: new metrics to decide how to recover from failures are defined in [6], while authors in [7] focus on CAPEX reduction in an IP/MPLS-Over-WSON network. Authors in [8] present routing mechanisms suitable for multi-layer restoration. Multi-layer restoration has been an extensively researched topic in the past few years. In [10][11] more cost-efficient architectures are presented as an alternative to current dual-plane protection. However, they have not been sufficiently explored. Building on advanced optical layer capabilities and multi-layer control, authors in [12] presented the MLSBR concept and compared the availability of MLSBR and the dual-plane protection scheme.
The idea behind MLSBR consists on having extra shared backup routers to restore the traffic in case of a failure of an IP router. This technique is compared with Dual Plane approach, where two IP planes are created in order to deal with node failure. This paper presents comparative results to quantify the savings that can be obtained by applying this novel approach. The present document extends the techno-economic study of MLSBR carried out for the Spanish Telefónica core network in [13]. The impact of the number of shared backup routers on the overall savings is quantified.
This article is organized as follows: the network dimensioning process is explained in Section 2. Next, Section 3 describes the Multi-layer Shared Backup Router use case and explains its availability in comparison with 1+1 protection. Section 4 presents the CAPEX analysis for a realistic scenario. Finally, Section 8 concludes the article and proposes open issues to have an efficient solution, which can be deployed in an operator's network.

II. SURVIVABILITY MECHANISMS
The network survivability mechanisms are directly related with the mean time between failures (MTBF) and mean time to repair (MTTR) parameters. In fact, ideally, a network with really high MTBF would not require survivability mechanisms because the network would never fail. As in real world there are failures in the equipment as well as other sources of outage, such as maintenance activities and software upgrades, different protection and restoration mechanisms are proposed.

A. Protection
Protection mechanisms are based on including extra equipment to have backup resources in case of failure. Protection means that backup resources are disjoint from nominal ones allowing recovery of traffic when a failure appears in the nominal resource. An accurate network planning work must be done to over dimension the network in a correct way.
There are different protection schemes defined based on the resources utilization. 1+1 schemes consist on splitting the traffic between the resources (50% each). 1:1 mechanisms use all the resources in the nominal path, while the backup path uses no resources. N:M schemes operate the same way as 1:1 schemes, but with M options to recover N resources. Fig.2. depicts the explained protection models. Backup resources in protection schemes are predefined by the operator to protect the nominal ones and they cannot be used by another network resource to recover traffic.

B. Restoration
The restoration concept appears to reduce the cost of protection schemes. Protection schemes keep unused resources (partially or completely). Restoration makes use of the resources to drive traffic, but it does not reserve resources to recover any traffic. This procedure needs to compute the new restoration path towards destination online, when the failure occurs.
As an advantage, restoration allows a more efficient resource usage in the network. However, the network behavior is less predictable, making the network planning process more complex.

C. Multi-layer Restoration
The idea behind multi-layer restoration is to extend the restoration mechanism to a multiple layer network where multiple technologies are involved in the restoration process. Since protection and restoration schemes are defined in scenarios where all the nodes, links and paths are in the same layer, the network operators use combination of protection and restoration mechanisms in each layer independently. Two layers reacting separately with their own resilience mechanisms create resource inefficiencies. In some cases, failures cannot be recovered, because of the lack of inter-layer communication and coordination.
The typical example of a failure, which is not possible to recover with single layer survivability mechanisms, is the failure in the inter-layer connections. A failure on the fibre or the cards connected between two layers may start single layer actions, but in many cases the system is not able to recover from that failure. A solution to this type of failures is to use other resources in both layers to reach the same endpoints. However, as there is no multi-layer coordination, the network is unable to realize such possibility.

III. DUAL-PLANE NETWORK DIMENSIONING
Current core networks consist on a multi-layer topology formed by IP routers and Reconfigurable Optical Add-Drop Multiplexers (ROADMs). IP nodes are connected to the optical nodes to transmit the IP demands over the optical topology. Dual Plane network dimensioning strategy against node failures consists of splitting the demands in two different equivalent planes, which carry 50% traffic each. Each plane is dimensioned to carry 100% of the traffic, so in case of node failure the other plane can absorb all traffic in the network. Fig.3. illustrates this approach with a sample network scenario with nominal traffic between adjacent nodes. To avoid any single point of failure, the connection of each core router to the optical mesh is done through different ROADMs. Similarly, in case of a failure in the ROADM, the traffic is rerouted using the backup core router, like in the IP node failure case.
To allow traffic recovery in less than 50ms, all the demands affected by a failure are rerouted through the backup router using FRR, as shown in Fig.4. The traffic is duplicated in current network dimensioning for dual-plane, because it requires to have enough capacity to carry all traffic in case of a failure in one plane. This single layer dimensioning approach for node failures does not gain from the existence of an underlying optical layer. Thanks to the optical layer, it is possible to reach any node in the network if a new lightpath is created and reconnect to any remote router in case of failure. However, path provisioning may take up to 1 min, due to the equalization process.

IV. MULTI-LAYER SHARED BACKUP ROUTER USE CASE
MLSBR use case consists on providing backup routers, which are available in case of a node failure. We assume that there is an optical mesh connection access, transit and interconnection routers. As previously described, the whole IP nodes must be duplicated in order to solve IP router failure. Let us assume a hierarchical architecture with three levels, as shown in Fig.5. This structure is typical to many IP networks. Let us call the lowest level in the hierarchy access routers, transit routers to the second level and interconnection routers to the higher level.
In this example, the transit routers are duplicated to recover to a transit failure. When using MLSBR, a set of Shared Backup Routers (SBRs) are available so, when there is a failure in the transit routers, the failed transit router configuration is copied and new connections are created to the access and interconnection nodes. This scheme is presented in Fig.6.
Let us remark that the recovery time using dual-plane protection is faster than applying MLBSR, because MLBSR requires optical connection setup to the backup router (which will takes minutes to set up), as well as the time to configure the backup router with the configuration of the failure router -which again could take a few minutes. As previously mentioned, it is assumed that there is an optical   mesh with spare capacity, between access and transit nodes. However, as demonstrated in previous work [12], the network availability when using MLSBR approach is better than traditional dual-plane protection.

V. MLSBR AVAILABILITY ANALYSIS
This section highlights the availability of this approach based on our previous work in [12]. To measure how "survivable" a network is, the availability concept is defined as how long a user can access to the services provided by the network. Equation 1 presents the availability parameter definition.
Network status is defined in terms of the presence of failures in IP/MPLS nodes in the network. As there is an optical mesh, any failure in the optical layer is recovered by the optical mechanisms [12]. The analytical study is done using Markov's chain model where the node reachability status is defined as follows: • Failure: 1 • No failure: 0 The transitions between states of the Markov's model are defined based on the MTBF and MTTR parameters:

A. One region case
First, the problem is defined for a one-region scenario. A region is defined as a group of access routers, which are connected by a pair of transit routers (Fig.5.). In this scenario, the MLSBR and 1+1 protection schemes behave the same, because there are no more regions that can be reached via the optical mesh. As in the one-region scenario, there are two IP/MPLS transit nodes capable of driving the whole region traffic (1+1 protection 50% capacity dimensioning), the states can be defined as follows: • 2 Active routers: No service affected • 1 Active router: No service affected • 0 Active routers Services affected The model with transition between states for the one region case is depicted in Fig.7.
Applying Markov's model, the resulting expression for availability is presented in equation 4. The result in this case is the same for the MLSBR approach than the 1+1 protection scheme, because the availability resources for both cases are the same. Two transit routers at 50% capacity can drive the traffic in case of one failure, but in case of double failure, the service is affected.

B. N regions case
The problem is generalized using equation 6, where N is the number of regions with duplicated transit routers and k is the relation between MTTR and MTBF (equation 5). With this expression, multi-layer restoration network availability can be calculated.

C. Availability comparison of the schemes
Error! Reference source not found. shows the MTTR for protection and MLSBR schemes assuming a MTBF of 3 years in the IP routers for the scenario presented in Fig.5. with seven locations for transit routers. Based on the results, MLSBR allows increasing the MTTR for the same availability. This means that OPEX can be reduced using this protection scheme.

VI. IMPACT ON CAPEX REDUCTION
The MLSBR concept is proved in the Core Telefónica Spanish Network (Fig.8). It shows the optical national mesh with the transit and interconnection routers. The transit routers are shown co-located at the same location of some optical transit nodes. Fig.6. MLSBR scheme in a hierarchical network. Fig.7. One-region case.  The CAPEX study is only focus on the IP equipment, ports and chassis. It is assumed that there is spare capacity in the optical layer to carry out all the optical restorations required to connect access routers with SBRs.
This network has the structure the exposed in Fig.5. It is composed by 6 interconnection routers (TR15-TR20) and 14 transit routers (TR1-TR14). The layer 3 connectivity, between transit routers, is a ring and star with the core in Madrid. It has been modeled 3 point of connection between transit and interconnection layers, two in Madrid and one in Barcelona. Interconnection and transit routers in Madrid and Barcelona are collapsed on two routers in Madrid and Barcelona.
Let us highlight that for this study only the two upper levels have been taken into account for the numerical results. One port is required in the transit and interconnection routers. The savings in the access level depend on the number of connections between the access and the transit routers (which in turn are dependent on the traffic volume and the capacity of the ports). With this technique just an extra port in the access routers is required to avoid a single point of failure (instead of one for each connection between access and transit routers in the 1+1 protection schema). However, these savings are independent of the number of SBRs. This numerical study has used the traffic demands of 2012 and a traffic growth per year of 35%, in order to evaluate the same network in five years (2017). The network dimensioning is done using the dual-plane protection approach using the 20 IP routers shown in Fig.8. For the MLSBR mechanism, the dimensioning process is done just for one plane (i.e. the 7 odd-numbered transit routers and only 3 interconnection routers). In both cases the optical layer infrastructure remains exactly the same. IP layer is dimensioned with a maximum occupation of 80% in case of any failure in the network. The number of SBRs can vary based on how many node failures the network is protected. Depending on the number of SBRs, the number of IP ports is obtained using MLSBR. The results of the IP-ports savings of compare the dual-plane protection dimensioning approach versus the MLSBR approach are presented in Fig.9.
In light of the results, it is seen that the savings by introducing two SBRs it is obtained almost 24% of savings in the number of IP ports needed to deploy. The percentage decreases as the number of SBR grows, but the savings are conserved in 2017. If there were 7 SBRs, there would be the same number of IP routers than in the dual-plane protection case, as there are seven transit routers.
Finally, let us highlight that not only the ports, but also the chassis of the routers are reduced. In this scenario, there are 14 transit nodes and 6 interconnection routers. If we have 5 SBRs, MLSBR reduces from 14 transit routers to 12 (7 to carry the traffic and 5 for backup purposes). This means a reduction of 14.28% in routers. The maximum savings for this scenario are when 2 SBRs are using, which leads to 35,7% savings. Instead of using 14 transit routers, 9 routers are used (7 routers for normal operation and 2 backup routers).

VII. OPEN ISSUES
Even though from an analytical point of view this mechanism can offer savings to the network operators, there are some open issues that must be solved to see this solution deployed in real network scenarios.

A. Role of a Multi-layer SDN controller
A multilayer SDN controller is an effective alternative to solve routing and path computation in a multilayer scenario composed by an IP/MPLS network over an optical WDM circuit-based transport network. This element has been validated to operator in IP/MPLS services provisioning in [13] using IETF ABNO architecture. An SDN controller must program the back-up paths so any router in case of failure knows which UNI path to establish. In addition, the SDN controller may need to program the backup router differently depending on which transit router has failed (for example, the routing metric of the links should mimic the metric of the original links). Even though the solution is close there is no demonstration of MLSBR.

B. Reachability information
Nowadays, the inter-layer TE-Links for multi-layer scenarios are still configured manually (and not autodiscovered). In addition, the existence of these links is not disseminated to remote nodes. The most reasonable solution   TR11   TR12   TR13   TR14   TR3  TR4   TR5   TR6  TR1   TR2   TR7   TR8   TR9   TR10   Add/drop Node   OXC   TR15   TR17  TR16   TR18   TR19   TR20 to have an automated process is to disseminate the reachability information (for example LMP) among border routers through the optical network. This scenario requires new functionality in the border routers, to announce themselves as reachable through the optical mesh and to discover remote border routers also reachable through the optical mesh.

C. Addressing
Assignment of IP addresses to newly created interfaces for a restoration is a relevant issue as that should be performed in the network planning phase. One possible solution would be to configure a pool of IP ranges available to the IP interfaces and let the SDN controller to assign and free them based on a certain policy. Alternatively, it is possible to use unnumbered interfaces.

D. Standard interface to configure the IP/MPLS routers
The advent of multi-layer control plane may make easier the configuration of the MPLS and the GMPLS equipment. However, the establishment of a path in an IP/MPLS network with a photonic mesh is not only a transport layer process. Once the path is set-up with the multi-layer control plane, the IP routers must be configured. Although there are efforts to standardize an interface to configure the IP routers [14], there is not a standard solution yet. From the router, UNI could trigger the LSP creation. Another option is to use PCEP to configure the LSP and the router using NetConf/YANG or CLI [14].

E. Optical Restoration Time
To reduce the MLSBR time, it is mandatory to improve the optical restoration mechanisms. Currently, channel equalization implies that this time can be in the order of one minute in real networks. This means that during this time using this mechanism the network can have traffic cut. This mechanism is useful for best effort traffic, which can assume traffic losses. However, new research in optical restoration is reducing this time.

F. Reverting back to normal
When the failed router is fixed, the network must revert back to using it, but most operators will not allow for another traffic outage during this process. Therefore, one needs to come up with a gradual process in which the backup router and the now recovered nominal router coexist and links are gradually removed from the backup router and transitioned back to the nominal router.

VIII. CONCLUSIONS
This article presents an evaluation of MLSBR mechanism in a real network operator scenario. The work shows the higher availability of this mechanism in comparison with current dual plane protection scheme. Besides, the article develops a use case obtaining the CAPEX savings for an operator. Based on the findings of this article, MLSBR can reduce up to 24% the number of IP ports in the network and it can increase the MTTR. This means that network operators can reduce their CAPEX and OPEX using this approach. However, there are some requirements that have to be fulfilled to deploy this solution in the network.