Networks rely increasingly on Internet Exchange Points (IXPs) and carrier-neutral interconnection facilities that enable dense localised peering connectivity to handle the massive traffic exchange between clients and servers.
IXPs provide layer-2 Ethernet switches to interconnect edge routers of IXP members, while co-location facilities offer physical space for networks to deploy equipment and establish direct cross-connects.
Today, there are over 640 IXPs and more than 2,600 facilities in the world. The largest IXPs have over 700 connected members with hundreds of thousands of peering interconnections among them (see a list of IXPs here).
Given the high concentration of interconnections, the uptime of peering infrastructures is crucial for overlay Internet applications. Facilities and IXPs strive to meet Service Level Agreements (SLAs) of ‘five nines’ (five minutes downtime per year), and ‘four nines’ (50 minutes downtime per year). However, outages still occur due to power failures, human errors, attacks, and natural disasters.
The geographic agglomeration of the peering activity leads to tight interdependencies between IXPs and co-location facilities, while practices such as remote peering extend the reach of local infrastructures to global scale. Therefore, failures can have cascading effects that mask the outage source and hinder accurate detection. Consequently, operators often lack monitoring capabilities for infrastructures outside their network perimeter and resort to mailing lists, online forums, or social media to understand the causes of interruptions.
Kepler is a new tool that automates the localisation and monitoring of outages at IXPs and interconnection facilities, by using publicly available connectivity and routing data.
Detecting outages by measuring routing paths
To understand the challenges in detecting and localising infrastructure outages in routing data, consider how a facility outage in the example below (Video 1) is reflected on paths:
Video 1: Example of how a facility outage is reflected on paths
When Facility 2 fails, the traffic between AS1 and AS2 switches to Facility 1 but the AS path remains the same.
The backup path shifts away both from the failed facility and the IXP. Having only AS2 as a vantage point (VP) doesn’t suffice to pinpoint the exact source of the outage. But if we also monitor the paths through AS4 we can observe that the IXP is still available.
To detect changes in the traversed infrastructures, we need to compare the routing states before and during the outage to find the affected hops.
Therefore, Points of Presence-level (PoP) outage detection requires measuring routing paths from diverse vantage points, at high frequency, and at the granularity of infrastructure-level hops.
Note: we use the term ‘outage’ to refer strictly to the status of connectivity over the affected infrastructures. From an end-user perspective, this could also be a degradation of service with little or theoretically no impact.
Passive BGP measurements satisfy the first two requirements, but BGP encodes AS-level paths. On the other hand, traceroute measurements reveal IP-level hops that can be mapped to IXPs and co-location facilities (see Mapping Peering Interconnections to a Facility, On the Geography of C-Connects and Detecting IXPs in Traceroute Paths Using traIXroute). But the high measurement overhead is virtually prohibitive for continuous probing.
To tackle these challenges, Kepler deciphers infrastructure-level data encoded in the BGP Communities attribute.
The BGP Communities attribute
BGP Communities are 32-bit numerical values used by AS operators to attach arbitrary information on BGP advertisements. Communities offer flexibility in defining complex and dynamic routing policies.
Between 2010 and 2016, the visible number of ASes using BGP Communities more than doubled, and the number of unique community values tripled to more than 50,000.
A popular application of BGP Communities is to tag the interconnection point where a network received a route advertisement. Figure 2 shows how AS13030 uses the Community values 13030:51702 and 13030:4006 to annotate the facility and the IXP where prefix 18.104.22.168/24 is received by AS20940.
Figure 2: Diagram showing how AS13030 uses the BGP Community values 13030:51702 and 13030:4006 to annotate the facility and the IXP where prefix 22.214.171.124/24 is received by AS20940
The community attribute values are not standardised, therefore their interpretation requires documentation sources. Many operators document their community values in Internet Routing Registry (IRR) records, or on their web pages, but typically not in machine-parsable format.
Kepler combines web mining techniques with Standford’s Natural Language Processing platform to automatically compile a BGP Communities dictionary, that includes 5,284 interpreted Communities by 468 ASes and 48 route servers and covers 288 cities in 72 countries, 172 IXPs, and 103 facilities. While 468 ASes is a small fraction of the total ASes, it includes all but two Tier-1 ASes and most major peering ASes.
Figure 3: Map showing the location of BGP Communities ASes
As shown in the above map (Figure 3), the majority of the BGP Communities (66%) tag a location in Europe, followed by North America (24.5%). Only 2% of the communities cover locations in Africa and South America. Importantly, the interpreted BGP Communities are present in about 50% of all BGP IPv4 updates.
How Kepler works
The system is initialised by obtaining a stream of BGP data through BGPStream to extract BGP updates annotated with interpreted BGP Communities. By continuously monitoring the BGP messages, Kepler establishes a baseline of paths that consistently traverse the same PoPs. Then, it monitors the baseline of stable paths to capture PoP-level changes through explicit BGP withdrawals, through changes in the PoP-tagging community values.
Routing updates are binned in time intervals to correlate path changes. For each interval, Kepler calculates the fraction of paths that continue to traverse the baseline PoP and raises an outage signal if, for a certain PoP, this fraction falls below a threshold.
Outage signals can have different types of triggering events:
- Link-level signals are caused by changes to an AS-link that transit a large number of prefixes, for example, de-peering.
- AS-level signals indicate changes in the availability of a densely connected AS at a specific location, for example, disconnection from an IXP.
- Operator-level signals are used when all ASes under the same organisation (sibling ASes) are affected.
PoP-level signals involve multiple AS links with disjoint near-end and far-end ASes and organisations. Kepler infers a PoP-level incident if at least three operator-level incidents occur in the same time bin at the same PoP.
Kepler validates the occurrence and duration of outages via periodic traceroute measurements from sources and destinations that have been found to cross the affected PoP in RIPE Atlas and CAIDA’s Ark paths and checks whether they still traverse them. When over half of the paths return to the baseline, the outage is inferred as restored.
Increasing signal resolution and signal disambiguation
The majority of BGP Communities annotate routes at city-level granularity. To achieve infrastructure-level detection, Kepler uses a co-location map of:
- ASes to IXPs
- ASes to facilities, and
- IXP to facilities built based on PeeringDB and data in AS websites.
The co-location map is used to de-correlate the ‘fate’ of ASes during a city-level outage signal, according to their connectivity at facilities in the same city. The co-location map is also used in disambiguating outage signals.
Figure 4: The physical connectivity between two ASes can involve multiple PoPs, while BGP Communities only identify the nearest-end PoP (highlighted in green). A failure in any of them will trigger a signal at the near-end facility
Kepler determines the outage source by correlating the affected ASes with their presence at common facilities. If there are concurrent signals for multiple infrastructures in the same city, the signals are collapsed to a single IXP-level or city-level incident.
How it performs
Kepler can detect outages in facilities that have at least six different members that can be located through BGP Communities, three at the near-end of a link, and three at the far-end.
Figure 5: About 50% of IPv4 and 30% of IPv6 paths in 2016 were annotated with at least one location-encoding BGP Community and thus were usable by Kepler. Moreover, Kepler’s Communities consistently tag over 35% of the IPv4 and 28% of the IPv6 AS links across every BGP snapshot.
Of the 1,742 facilities in the co-location map, 1,209 have fewer than six members; for another 130 there are less than six trackable members. Therefore, Kepler can track 403 facilities (23%), meaning that the detected outages are a lower bound. However, Kepler covers 180 out of 183 (98%) facilities with at least 20 members, which are the most prominent interconnection hubs.
Using Kepler to analyse historical BGP between 2012-2016, we found 159 outages among 87 facilities and 41 IXPs. The number of outages remained relatively stable over time, fluctuating between 10-15 outages per half-year, with the exception of the second half of 2012 due to the visible impact of hurricane Sandy (see this RIPE Labs article for more details).