BGP Anycast Tuner: Intuitive Route Management for Anycast Services

IP anycast has become a vital technology for DNS and CDN operators alike. Yet, while big operators have their tools to monitor and configure anycast routing, most of anycast networks are still configured manually. In this paper, we introduce a new approach to anycast management. Our solution is based on active measurements combined with traffic engineering. We propose the concept of a "BGP Cookbook" that allows operators to forecast the effects of routing policy changes over their services. We also introduce a web-based interface, called "BGP Anycast Tuner", that allows operators to gain insight into their service’s performance and provides easy management through automation. We evaluate our approach by implementing a prototype running in a testbed composed of 12 anycast sites covering 5 continents. We demonstrate our tool in two different use cases: discovering and fixing a sub-optimal anycast routing issue, and shifting traffic between continents, which is useful during service disruptions.


I. INTRODUCTION
IP anycast is a technology that allows for the same IP address to be announced from multiple global locations. Each announcement is generated by one anycast site, and whole solutions relies on inter-domain routing to allocate traffic between sites [1] [2]. IP Anycast is used in Content Delivery [3] [4] [5], distributed denial-of-service (DDoS) mitigation [6] [7] [8], and by the Domain Name System (DNS) infrastructure [9] [10] [11]. Nowadays, it is possible to identify hundreds of anycast prefixes being used across the Internet [12].
Anycast routing divides the Internet into "catchments", where each site attracts a share of clients. To do that, IP anycast relies on BGP, the default inter-domain routing protocol on the Internet [13]. However, this division is not perfect, and clients sometimes wind up at faraway sites [14] [15]. To help operators map catchments, VERFPLOETER [16] is the state-ofthe-art tool. Just mapping catchments, though, is not enough. Operators want to control how many clients, or even which clients, go to each site for performance and security reasons. This can be done by changing BGP routing at anycast sites.
While large anycast operators deal with catchment control by adding more sites, most of anycast operators needs to perform routing adjustments to better support their users. One example of small anycast operator are the DNS cc-TLD (country code top-level domain), who frequently owns less than twenty sites by anycast prefix. Small anycast networks adjust catchment by doing routing changes manually. This limits the ability of an operator to explore the best configurations. Moreover, such manual tasks are prone to human errors with potentially serious consequences [17] [18] [19] [20].
To help operators address this problem, we present BGP ANYCAST TUNER, a measurement-based tool that automates the configuration of anycast sites, the administration of catchments, and the analysis and visualization of anycast management. Through an intuitive web-based interface, operators can observe the predicted effects of routing policy changes, select an appropriate configuration, and deploy it automatically.
The main contributions of our work are: (i) we establish a systematic approach to measure the distribution of clients of anycast services under different routing conditions, (ii) we present and release an open source tool, BGP ANYCAST TUNER, to manage anycast services using a routing COOK-BOOK; and (iii) we evaluate our prototype in a real-world anycast testbed.
The remainder of this paper is organized as follows. Related work is discussed in Section II. In Section III, we review anycast routing and we establish the necessary requirements for anycast network management. In Section IV, we present our solution, BGP ANYCAST TUNER, followed by Section V, where we explain details of our implementation and evaluation on a real-world anycast testbed. Finally, we draw conclusions and future work in Section VI.

II. RELATED WORK
Anycast has been extensively studied since it was first proposed back in 1993 [1]. Early studies focused on the first large-scale application of anycast to the DNS [9] [21] [22], and on architectures for creating a global anycast service [3] [4]. More recently, anycast was used as a security tool [23] [6] [24]. Hesselman et al. [25] study how to provide a control plane for DNS top-level domain (TLD) operators to increase security and stability of TLDs, while Rizvi et al. [8] describe a method for building a "response playbook" for the use of BGP to shift anycast traffic when under DDoS attacks.
On anycast routing and performance, Ballani et al. were the first to study anycast performance metrics [26]. Wei et al. [27] show that while anycast routing is generally stable, a small number of routes sees persistent instability. De Vries et al. [28] developed VERFPLOETER to measure anycast catchments and later deployed it in one of the largest global anycast CDNs [29]. McQuistin et al. [30] analyze the impact of routing on anycast CDNs. Recently, Wei et al. [31] proposed a system to detect latency issues through an anycast/unicast probing.
Although there is no specialized management for anycast networks yet, Wiefferink [32] evaluated how current tools can help in the analysis of anycast, while Costella et al. [33] built a solution to visualize anycast catchments.

III. REQUIREMENTS
In this section we discuss the challenges and requirements of managing anycast services. First, however, we present fundamental concepts that help understand the proposed solution.
By using anycast, services are hosted on multiple servers using the same IP address. Figure 1 depicts an anycast service where three sites (S n ) share the address 10.0.0.1. Each client (bottom part) tries to reach the server IP via distinct paths according to AS (clouds) interconnect relationships (straight lines). Under the hood, BGP is responsible for selecting the routing path for each destination according to a set of metrics and policies. The color of each AS shows its preference in reaching the 10.0.0.0/24 prefix. Each site has a so-called catchment, which is the set of clients that reaches it. The catchment of the green site (S 1 ) consists of client 1, the red site (S 2 ) attracts no clients, and the blue site (S 3 ) attracts clients 2 and 3.
An operator observing the anycast network in Figure 1 may conclude that: (i) client 2 is likely better served by the red server (S 2 ) than the blue one (S 3 ), and (ii) in order to make the red server (S 2 ) attract client 2, routing must be reconfigured.
There are many cases where the catchment of an anycast service may be less than ideal for some clients [31]. To modify client catchment -i.e., to make a client end up at a different anycast server -BGP Interdomain Traffic Engineering (BGP-TE) [34] would be required. While Figure 1 is just a toy example, to perform this for a real anycast network with tens, or even hundreds of sites, the use of management tools is essential. Anycast operators need to know, for example, how clients are distributed across the anycast service, what would be the optimal catchment for a particular client, and how traffic engineering affects the catchment distribution.
Our goal is to make operating an anycast service easier by introducing management tooling. In the remainder of this section, we introduce requirements for such management tooling. Table I   R1: Catchment Mapping -To understand the traffic distribution and observe deviations from the baseline, it is essential to map the catchment of each anycast site. Anycast catchments can be measured both passively and actively. Passive approaches rely on analyzing incoming flows to anycast sites. Active approaches, in turn, generate traffic from vantage points (VPs) in typically two approaches: using application clients such as RIPE Atlas [35], or by probing the Internet and observing the answers [28].

R2: Traffic Engineering(TE) for Catchment Control -
Although the anycast operator does not have control of the routing path that each client uses, it is possible to use BGP attributes to influence the routing preferences at the AS-Level. The most common used attributes are AS-Path prepending [36], and community strings [37]. Anycast management tooling must support easy manipulation of routing policies for all anycast sites and measure its impact on the service (R1).
R3: Support for "Pluggable" Metrics -While catchment control is a common goal of operators, the motivation for exerting such control varies. For example, a common go-to metric to measure User-Experience is delay, such as roundtrip time [38]. However, other goals and metrics should also be supported. Consider for example an operator who needs to control in which geographic region clients end up, in order to comply with a country's laws. In this case, client metrics should include IP Geo-location [15] [39]. Another example is when the objective is to handle DDoS attacks; in this situation, metrics should consider site metrics, as interface loss rate and CPU loads [6] [8].
R4: Automated Discovery of TE Options -As previously discussed, anycast operators may use individual site TE (traffic engineering) options to modify the catchment distribution. Such options is tied to how the site is connected (e.g., number of peers and policies). More sites and richer connectivity provide ranges from dozens to thousands policies. By combining all options, the operator has a complex problem to solve: which BGP policy should be used and which are its effects on site catchments? To deal with this complexity, one requires an automated method to measure all the BGP policies available per site, and map their side effects in term of client distribution.
R5: Fast, Atomic Configuration Deployment -Manual routing configuration of each site does not scale for anycast networks. Problems, such as DDoS attacks, make it necessary to quickly act when deploying a chosen routing configuration. Therefore, a key requirement for an anycast management system is that a new routing configuration must be deployed in a fast and atomic way on all sites that together comprise the anycast service.
R6: A Simple and Scalable Management Interface -Most anycast services range in size from a few to tens of different sites. A requirement for a management system therefore is that its interface must be able to accommodate these differences in scale. Furthermore, a management interface should be intuitive for operators, both in visualizing the catchment distribution (helping operators understand how routing affect catchments) and in changing routing policies to achieve certain goals (e.g., shift traffic away from certain sites, improve performance for certain clients, prioritize sites with spare capacity).

IV. APPROACH
In this section, we describe our approach to manage anycast client distribution, how we test routing policies, optimize this process, and provide this information to operators.

A. Overview
We propose a scalable methodology to map the effects of client distribution over anycast sites given a BGP routing policy. Figure 2 shows a schematic overview of our approach. To bootstrap the process, it is fundamental to determine all the available BGP policies per site. This information is stored in a BGP Routing Policy Database, and it is used to deploy the configuration. The process Deploy Routing Policy is responsible for applying a routing configuration on the sites of the anycast service. Next, we perform the Catchment Measurement. Data from this measurement describes the client distribution per site at each deployed BGP policy. We summarize this distribution in a COOKBOOK that describes the effect of a particular policy on the anycast service. Finally, an Operator (top-right) can choose which policies the system evaluates, and select an appropriate one to achieve certain operational goals.

B. The Routing Policy Database (RPDB)
An anycast service typically consists of multiple sites, each with their own external connectivity and different options to influence routing. We map all these possible BGP routing scenarios in a centralized repository named Routing Policy Database. This database reflects the BGP peering agreement established between each site and the respective upstream provider and/or Internet eXchange Point (IXP). Table II illustrates the concept for one site in our example network from Figure 1. This table shows six BGP routing policies that can be applied on site S 3 . Column Peer describes the BGP neighbors of the site, i.e., where the routing policy will be deployed. Prefix describes the routing prefix that is advertised in conjunction with the Routing Policy. For example, in the first row (Baseline), no routing policy is used. From rows 2 up to 3, AS-Path prepending is used, and in the last two rows the BGP community strings noPeer and noExport are used [40].

Policy Id
Site Peer Prefix Routing Policy   Figure 1.
Operators must decide upon which policy will be added to RPDB. More combinations means more TE options and more fine-grained traffic control. Too many combinations, however, can lead to time being spent on low impact anycast-TE options. The operator is responsible for establishing a trade-off between policy combinations and useful results to store in the COOKBOOK.

C. Deploying Routing Policy
With RPDB in place, the next step is to deploy one routing policy (one row from Table II). While deploying a policy, it is important to observe best practices on inter-domain routing, such as BGP convergence time and route flapping [41].

D. Catchment Measurement
At this point, the anycast sites are configured with the selected BGP policy. Next, we need to map the client distribution and other metrics affected by the deployed configuration. Which other metrics to consider depends on the goals of the operator, but that can include, for example, geographic location of clients or traffic load on each site. In our approach, we choose the methodology of De Vries et al. [28] because it is an active measurement, has larger coverage, and delivers a more impartial view than the other options available.

E. The BGP COOKBOOK
After deploying each routing policy and measuring its impact, we store the results in a database named the BGP COOKBOOK. This name derives from the idea that we consider each routing policy as an ingredient used to build different traffic engineering recipes. While in RPDB we store all possible routing policies available, in the COOKBOOK we just store those whose results are relevant in the eyes of the operator (manual), or following parameters established by the operator, such as the percentage gain (in terms of traffic shift) compared to previous stored recipes (automated). Table III shows a possible COOKBOOK for our example from Figure 1. Let's assume that the operator wants to distribute the clients evenly over the sites. It is important to remember that, in the baseline situation shown in Figure 1, S 3 attracts two clients while S 2 does not attract any.

Policy Id
Routing Policy Metric Qt Catchment  Figure 1.
In this COOKBOOK, we find our normal state (Baseline) and three distinct anycast-TE strategies: 1xS 3 , saying one prepend on S 3 P 3 will result in clients C 1 and C 2 shifting from the Blue Site to the Red one; noPeerS 3 , forecasting the metric Qt Catchment will be equalized if we deploy a policy called noPeer (BGP community no-peer) on site S 3 ; and the -3xS 2 , indicating reverse prepend or negative prepend as a composite policy to bring all clients to S 2 through prepending on all sites except S 2 . As can be noticed, noPeerS 3 is the policy aligned with the operator's goal.

F. Process Optimization and Limitations
While the process described so far yields a useful COOK-BOOK of relevant routing configurations, there are two important concerns regarding scalability: (i) the time required to deploy and collect metrics for each anycast-TE recipe; and (ii) the combinatorial explosion resulting from an increase in the number of anycast sites.
In our prototype (Section V), we can check around 60 distinct routing policies per day. If we build the entire COOK-BOOK for our example at  Table III, we would spend around 52 hours (5 3 × 25 min). If we expand our toy scenario to 10 sites and 7 policies per site, a naïve test would take thousands of years.
However, instead of testing all possible combinations, we can choose an explore-and-test approach to significantly reduce the number of tests. The policy selection can be optimized by considering one of the following approaches: • Customer Cone: Based on number of customers of each AS [42], we were able to reduce nearly 99% of individual peer tests on IXPs. We selected just to apply policies on ASes with large cone sizes as provided by AS-Rank [43]. • Sparse Test: For each class of test, we can divide all combinations into smaller chunks and we try the best and worst cases on each chunk first. If results are not good, we stop (e.g., we try 3-prepends and compare with site baseline before trying 1 or 2 prepends). We could reduce in 83% the test on individual routing policies. • Representativeness: We do not try policies for peers that account for less than 1% of a site's catchment. • Human-in-the-Loop Factor: An operator knows best when it is good enough to stop testing more policies.
Although with lower impact, there is also room to optimize the other time-consuming part of our process, measuring the catchment. Our chosen measurement system uses a hit list to probe at least one address in each routable /24 prefix on the Internet. We can reduce time by reducing the list in two ways: • Hitlist shrinking: Probing just a single address per prefix. As most prefixes on full routing table are larger than /24 (e.g., of size /16), we reduced the hit list in 87%. • Hitlist dimming: Probing a single address per AS. While this reduces visibility into the largest ASes, which may not have homogeneous routing policies for all of their prefixes, the view on smaller ASes remains precise, and the number of measurements was vastly reduced by 98%.

G. The BGP Anycast Tuner Interface
The BGP policies and metrics summarized in the COOK-BOOK aim to help an operator chose the best routing policies to achieve a certain goal. However, choosing the best solution can be a complex task when hundreds of policies are available.
A closer examination of our approach shows that the results we stored in the COOKBOOK are mainly about attracting more traffic to a site, or reducing it. Also consider the fact that anycast is a "closed system", i.e., if one site attracts more traffic, other sites will lose such traffic. With this in mind, it is possible to think of each individual site having a limited number of options within a band of possibilities ranging between positive (attract more traffic) and negative (reduce traffic) values, and if one site is changed, all other sites will change as well, compensating the difference. This concept reminds of an old-fashioned "sound tuner", leading us to chose this as a metaphor for our graphical user interface.
BGP ANYCAST TUNER is a prototype graphical interface that brings together all parts of our methodology. It provides a simple and intuitive interface for operators and presents the distribution of clients over the anycast sites for the different pre-determined BGP configurations from the BGP COOK-BOOK. Figure 3 presents the BGP ANYCAST TUNER graphical user interface. The figure shows the client distribution for each site using a histogram (in blue), and sliders underneath the histogram. Using these sliders, operators can increase or decrease the catchment of a site using a set of predetermined settings indicated by "notches" on the slider. These notches correspond to specific BGP policies, the effect on catchment we previously determined using measurements. As the example also shows, these notches are not evenly distributed across the slider, reflecting the fact that the flexibility of a site to shift clients to or from other anycast sites varies. In other words, it reflects the fact that some BGP policies result in a larger shift in catchment than others. In addition, some sites have more degrees of control because of their relationships with their BGP neighbors and traffic agreements. Also notice that the sliders for the sites are linked, i.e., if the slider of one site is moved, all other sliders move as well to reflect the fact that a change in catchment for one site automatically means the catchments of all other sites change as well.

V. PROTOTYPE AND EVALUATION
In this section, we evaluate our approach against the requirements discussed earlier in Table I. For this evaluation, we prototyped our approach and implemented it on an anycast testbed infrastructure. In the subsequent subsections, we describe our testbed, the results we obtained in terms of traffic engineering, and show how the prototype behaves in two scenarios. The prototype code can be found in our repository 1 .

A. Prototype Environment and Implementation
We built our prototype on top of our anycast testbed TANGLED [44]. This testbed consists of 12 anycast sites across the globe. It have a good diversity of peers, being connected to IXPs, cloud providers, commercial transit providers, and academic networks. Such similarly to industry environment supporting a full-range of routing agreements, make it a challenging testing ground for our approach.
We note that, although the prototype was implemented in the TANGLED anycast testbed, the implementation method is not tied to its specific topology, sites, or network resources. Furthermore, our goal for the evaluation is not to quantify or measure results specific to the testbed, but rather to validate that our approach works and meets the requirements. Thus, we restricted the choice of routing policies in our evaluation process, avoiding combinations of policies and sending the same BGP announcements to all peers at each site. 1 https://github.com/SIDN/BGP-Anycast-Tuner Our prototype consists of four parts: (i) a routing policy deployment tool based on ExaBGP that is used to deploy the selected routing policy on all the sites of the anycast service, (ii) an anycast-TE orchestration tool that deploys a chosen recipe from the COOKBOOK on all sites and coordinates the catchment measurement with VERFPLOETER, (iii) a statistical analysis tool for the catchment measurement data that persists resulting metrics to the COOKBOOK, and finally (iv) two dashboards for the operator, one to visualize catchment data and one for load tuning (the latter forms the core of our approach and brings together all of its components).
As part of the deployment process, we wait for BGP to settle for 10 minutes. This amount of time is presented in the literature [45] [46] [47] as a necessary routing and forwarding convergence time.

B. A Cookbook for Tangled
As the first step of our evaluation, we build a COOKBOOK for the TANGLED testbed. To make the evaluation easier to explain, we show just a subset of 6 anycast sites and 13 distinct routing policies per site (the entire cookbook has 78 policies). Every line in Table IV represents a measurement in our testbed and the respective percentage of /24 networks handled by each site for each BGP policy. We highlight the row with our service baseline (regular BGP announcement). We also highlight the site LHR and sort the table in descending order, from maximum load (64.58%) to minimum load (0%).
Observing our baseline, we note that 7.22% of clients went to the site located in Paris (CDG); 15.28% went to Washington D.C. (IAD); 23.38% went to London (LHR); 15.97% to Miami (MIA); 20.98% went to Porto Alegre (POA), and 17.16% went to Sydney (SYD). In the Anycast-TE column, we use mnemonics (e.g., -2xLHR) to describe two negative prepends for site LHR (the negative notation means two prepends on all sites except LHR). Other examples of mnemonics are 1xLHR to describe a single prepend for the London site, and LHRnoPeer and LHRnoExp to express that the BGP community noPeer and noExport were used at London.
The measurement data in Table IV predicts the number of clients at each site for each BGP configuration. Operators can use this table to configure their anycast service depending on the desired load on sites. Ordering entries by LHR load, as we did, reflects an operator strategy to select how much traffic he/she wants to bring to or remove from LHR. In anycast services with many sites, however, this process may be confusing and error-prone. To address this challenge, we introduced BGP ANYCAST TUNER to automate this process and visualize it in an intuitive manner. Effectively, ordering the table illustrates how the slider for LHR in the BGP ANYCAST TUNER interface shown in Figure 3 is populated.
In the next subsections, we present two scenarios in which we applied our approach in the TANGLED testbed.

C. Scenario 1: Catchment Visualization and Troubleshooting
This scenario shows how our prototype meets requirements R1, R2, R3, R4, and R6. The scenario illustrates how an operator can detect and repair a case of sub-optimal routing.  To better visualize catchment information for operators, we developed an interactive web interface (Figure 4). This interface allows operators to investigate site catchment and detect client distribution anomalies, such as identifying which anycast sites serve a specific country or AS.
In this case, Figure 4 shows which anycast sites are preferred by US clients (R1) in the baseline situation ( Figure 3). We select US clients based on geolocation of the source IP addresses hitting each site (R3). The leftmost (green) box shows all TANGLED sites and the US being selected as catchment source. The middle box (blue) shows that 40.28% of US customers prefer the US-MIA site, 33.11% BR-POA, 16.02% US-IAD, 8.60% UK-LHR, and 1.98% are distributed over other sites. At this point, an operator may wonder why so many clients from the US get routed to a site in Brazil. The operator then selects BR-POA in the middle box, which triggers a list of autonomous systems to be shown in the rightmost box (red). This rightmost box shows the top 5 ASes from the US hitting BR-POA, showing that 38.3% of them are from AS7922 (Comcast), 9.49% from AS20115, 5.81% from AS10796, 5.34% from AS20001, and 4.54% from AS11427 (the last four ASes all belong to Charter Communications). An operator may then wish to remediate this issue and bring the US traffic hitting sites in Brazil back to the US. By using BGP ANYCAST TUNER, operators can experiment with different routing policies, something that is a tedious and time-consuming manual process without the intuitive interface of BGP ANYCAST TUNER on Figure 3 (R2, R6).
While further policy refinement may be needed to optimize the last fraction of clients getting misrouted, this scenario shows the how BGP ANYCAST TUNER can make the task of troubleshooting anycast performance easier for operators.

D. Scenario 2: Moving Traffic
In this scenario, we demonstrate how we accomplish requirements R2, R4, R5, and R6, by showing how an operator can move as much traffic as possible to a specific location.
Here, we mapped a set of policies to deploy specific action. This mapping allows operators to move traffic just moving one slider in Figure 3. For example, if the slider for Paris (CDG) is moved all the way up, as much traffic as possible is attracted to this site. The sliders for the other sites also change and the histogram shows that the majority of clients will now move to the catchment of the site in Paris. Doing this, it fulfils requirements about mapping (R1), control (R2), and an easy-to-use interface (R6).
Notice that the prototype interface also provides a dropdown menu on the left-hand side of the interface in Figure 3. This drop-down menu allows operators to pick a goal, e.g., "Bring traffic to Europe". These goals correspond to preset slider positions and thus to a preset combination of BGP policies among the ones presented. This feature alows a fast traffic engineering deployment, thus fulfilling (R5).

VI. CONCLUSIONS
The use of anycast is becoming increasingly common and important; every year more service providers use it to improve the robustness of their services and the user's quality-ofexperience. In this paper, we introduced a new approach to address these problems. In our approach, we register all routing policies available at each site in an anycast service. We then measure and map site catchments for each routing policies. This mapping can predict the most optimal traffic engineering scenarios to be implemented on anycast sites.
We then prototype a set of tools to assist operators to monitor and manage catchment of their anycast network in realtime. The core of our approach is an intuitive web interface that we call the BGP ANYCAST TUNER. This interface eases the burden of managing an anycast service for operators. We tested and showcased the prototype on a global anycast testbed spanning 12 sites on 5 continents. In two case studies, we show how the BGP ANYCAST TUNER can be used to identify suboptimal routing of traffic (to a remote site) and how traffic on a live anycast service can be shifted to a desired location with a single mouse click.
We publish all tools under an open source license in a public repository 1 . As a next step, we are working with the operations team of SIDN, the operator of the .nl ccTLD, to deploy our approach on their production anycast service.