Philipp Richter

When the Internet Goes Down: Tracking Edge Outages at Scale

Philipp Richter

8 min read

0 You have liked this article 0 times.
0

Uninterrupted availability of the Internet has become increasingly critical these days, not just for end users but also for service providers who need to meet Service Level Agreements (SLAs). Yet, outages affecting end-user connectivity are widespread, whether they be unintentional fibre cuts, natural disasters, cyber attacks, or intended Internet shutdowns by governments for political reasons.


Large-scale outages are relatively easy to detect and are often reported on various news sites and social media. More difficult to detect are the local, small-scale, more scattered outages (such as those caused by a power failure affecting a small neighbourhood). However, as shown in the present work, thousands occur every day across various locations.

These outages are typically not visible when studying the Internet’s control plane (the routing information exchanged between networks). Thus, detecting these potentially small events is similar to finding a needle in a haystack: only broad measurements in time and topological scope can find them.

In our new paper, Advancing the Art of Internet Edge Outage Detection (IMC 2018), my colleagues at MIT, University of Maryland, Akamai and I present a new approach to passively detect Internet edge outages: by leveraging access logs of a major Content Distribution Network (CDN) and tracking anomalies in the access patterns of end users.

The never-sleeping Internet: baseline activity

During our study, we analysed logs collected from a major CDN with more than 200,000 servers in 130 countries and 1,700 networks, serving trillions of requests from users around the globe on a daily basis.

Figure 1: The logs collected for the study contained hourly counts of requests to the CDN from individual IPv4 /24 address blocks

Interestingly, for many address blocks, we saw requests to the CDN every single hour over long time periods — 24/7 activity. The figure below shows such an example, where we never saw fewer than 130 active IPv4 addresses in the /24 prefix in any hour during this month.

Figure 2: Graph showing the number of hourly active IPv4 addresses for a sample /24 address block over the course of a month. An active IPv4 address here means that it contacted the CDN at least once in that hour

Upon inspecting some of the requests, we found that — besides a huge number of user-triggered requests for content (for example, web and video content) — there was a sizeable number of requests that are not human-triggered.

While surprising at first, this can be explained by the increasing number of always-on devices in our homes: smartphone or Smart TV apps and widgets periodically update information (weather, stock market, calendar) and a variety of software installations issue frequent update requests. Thus, having a number of devices connected to your WiFi will cause the steady request pattern in the CDN logs. We term this observation baseline activity.

Detecting disruptions

Baseline activity is an ideal signal for outage detection: it is (i) largely independent of human-triggered activity, but is (ii) dependent on a functioning network.

We next developed a technique that detects disruptions in baseline activity on an hourly basis, instances where the constant CDN contact from devices from address blocks is temporarily absent or significantly reduced. We can only detect disruptions that last at least one hour.

We used a sliding window to calculate a baseline value for each block and hour, and detect significant disruptions (dips) from this activity. This technique allows us to track millions of address blocks in more than 12.5K networks around the globe. 

Figure 3: A sliding window was used to detect significant disruptions (or dips) in activity across more than 12,000 networks

We refer to our paper for more details on the detection technique, calibration, cross-validation, and our global coverage.

A global view of disruptions

We ran our disruption detection mechanism over CDN logs that spanned one entire year (Figure 4) from which we were able to make a number of interesting observations.

 

Figure 4: Partial and entirely disrupted /24 address blocks from more than 12,000 networks detected hourly from March 2017 to March 2018

Micro-disruptions

Globally, there are always disruptions and edge outages. About 0.2% of the monitored address space is disrupted in any given hour! Many of these disruptions are small in scale, affecting end users in specific ISPs or geographic regions. They can be caused by a variety of factors ranging from fibre cuts and power outages, to failures in individual ISP networks.

Major external events

Major external events, most notably Hurricane Irma in September 2017, often cause large-scale Internet outages in multiple providers. Such events and their representation in our dataset allow for assessing the reliability and resilience of Internet access in the face of natural disasters. But outages due to natural disasters are only the tip of the iceberg of what we found; we also observed large-scale outages in individual networks, which can be caused by major misconfigurations, Denial of Service attacks, or even be the result of intended Internet shutdowns for political reasons.

Scheduled maintenance

Another intriguing pattern was a weekly recurring ‘jump’ in detected disruptions, excluding the week between Christmas and New Year’s Eve.

We located the physical location and local time of disruptions and found that disruptions are more likely to occur on Tuesdays, Wednesdays, and Thursdays shortly after midnight. These times correspond precisely with the scheduled maintenance interval of major ISPs.

We found that for many ISPs, the majority of all disruptions start and end within their advertised maintenance interval. This is an important observation when it comes to pinpointing Internet edge outages to actual reasons. A service outage during scheduled maintenance can have different significance with respect to SLAs and regulatory reporting, as compared with outages caused by unplanned events, such as natural disasters.

We refer to our paper for a more detailed study of network and timing aspects of the identified disruptions.

Scheduled maintenance more likely cause of disruptions than natural disasters

To illustrate the effect that just a single natural disaster — Hurricane Irma — and scheduled maintenance have on the overall number of detected disruptions, we selected the seven largest ISPs offering broadband Internet in the United States. Of all /24 address blocks belonging to these ISPs, we wanted to know how many were only disrupted during Hurricane Irma or during scheduled maintenance windows (Monday to Friday, midnight to 06:00).

ISP A (Cable) ISP B (Cable) ISP C (Cable) ISP D (DSL) ISP E (DSL) ISP F (DSL) ISP G (DSL)
% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%
% /24s only disrupted during Hurricane Irma 11% 1% 2% 23% 1% 0 3%

 

Table 1 — Distribution of address blocks with disruptions detected among the seven largest ISPs offering broadband Internet in the United States only within a maintenance window (Monday to Friday from midnight to 06:00 local time) or during the week of Hurricane Irma (9-15 September 2017).

For all but one of the ISPs, most disrupted address blocks (up to 75%!) were only affected during the scheduled maintenance window. This observation has important ramifications when it comes to identifying root causes of outages and their eventual impact for SLAs and policymaking.

Further, two of the ISPs were severely affected by Hurricane Irma. Looking at ISP A, we can — just by leveraging timing of disruption events — provide likely explanations for almost 80% of all disrupted address blocks!

Looking ahead

With our detection mechanism in hand, we plan to further explore the root causes of disruptions and Internet outages, if and to what extent they correlate with external events (power outages, weather, disasters and censorship), how many users they affected, and what their recovery time is. Stay tuned!

 

0 You have liked this article 0 times.
0

You may also like

View more

About the author

Philipp Richter Based in Cambridge, MA, USA

I am a post doc researcher in the Advanced Network Architecture group at MIT and a research collaborator with Akamai Technologies. My current research centers around developing data-driven approaches to measure Internet reliability, resilience, and security. Prior to joining MIT, I earned my PhD (proof picture) from TU Berlin, advised by Anja Feldmann. In the summer of 2015, I was a research intern in the Custom Analytics Group at Akamai in Cambridge, Massachusetts. In the summers of 2013 and 2014, I was a visiting researcher at ICSI in Berkeley, California. I am broadly interested in methods to mine and understand data at scale with an emphasis on measurements assessing structure, performance, and security of the Internet. In my PhD work I explored the phenomenon of IPv4 address space exhaustion and its consequences for the Internet and its stakeholders. My research was awarded with a Best Paper Award at ACM IMC 2016, an IRTF Applied Networking Research Prize 2017, and selected “Best of CCR” in 2015.

Comments 0