When withdrawing an IP prefix from the Internet, an origin network sends BGP withdraw messages, which are expected to propagate to all BGP routers that hold an entry for that IP prefix in their routing table. Yet network operators occasionally report issues where routers maintain routes to IP prefixes withdrawn by their origin network.
We refer to this problem as BGP zombies. It is also known as stuck routes, or ghost routes. We've analysed the appearance and behaviour of these zombies using RIPE Routing Information Service (RIS) BGP Beacons. These beacons announce and withdraw a particular set of IP prefixes at predetermined time intervals. RIS beacons announce a prefix every two hours and withdraw it after two more hours. The RIPE NCC maintains a list of current RIS routing beacons.
A BGP zombie refers to an active routing table entry for a prefix that has been withdrawn by its origin network, and is hence not reachable anymore. Hereafter we also refer to zombie ASes and zombie peers for ASes and BGP peers whose routers have BGP zombies. We refer to all zombies that correspond to the same prefix and appear during the same two-hour time slot as a zombie outbreak, the outbreak size is the number of zombie ASes.
Across the 27 monitored beacon prefixes, we usually observe more than one zombie outbreak per day but their presence is highly volatile. We also discovered that BGP zombies can propagate to other ASes. For example, zombies in a transit network inevitably affect its customer networks.
For beacon prefixes, the detection of zombies in RIS peers is straightforward. We keep track of the visibility of beacons for all RIS peers and report a zombie for each routing table entry that is still active 1.5 hour after the prefix was withdrawn. The 1.5 hour delay is set empirically to avoid late withdrawals due to BGP convergence, route flap damping, or stale routes.
Figure 1: Visibility for beacon 18.104.22.168/24 from all RIS peers on 9 and 10 September 2017
Figure 1 illustrates the visibility for beacon 22.214.171.124/24 from all RIS peers on 9 and 10 September 2017. Peers are listed on the y axis and time is represented by the x axis. From 12:00 to 18:00 UTC, all peers behave as expected. At 12:00, RIS peers announce the availability of the beacon prefix and maintain an active route to the prefix until 14:00. One peer from RIS route collector rrc19 withdraws the prefix a bit late (14:19), but this is not considered as a zombie because the prefix is withdrawn reasonably quickly. However, at 18:00 three peers have not withdrawn the beacon although this prefix is not reachable at that time. This zombie outbreak ends at 20:00 when the beacon is re-announced. A similar zombie outbreak appears at 22:00 for the same three peers.
During the first zombie outbreak (18:00-20:00), we found other zombies for the same three peers but another beacon (126.96.36.199/24) not shown in this graph. The 25 other beacons are withdrawn as expected at that time. For the second outbreak (22:00-00:00), we found no other zombies. These observations give an early glimpse of the relationship between outbreaks for different prefixes. Zombie outbreaks for different beacons can be related but are usually independent.
From zombies observed at RIS peers for 6 month of data, we compute the zombie emergence rate, that is the number of times zombies are reported for each peer and each beacon normalised by the number of times beacons have been withdrawn during our measurement study.
This metric corresponds to the likelihood of the pair <peer, beacon> to cause a zombie. Figure 2 below depicts the distribution of the values obtained with our dataset. We observe only 6.5% <peer, beacon> pairs that don't cause a zombie during our entire measurement periods. However, zombies are generally uncommon for RIS peers: 50% of the <peer, beacon> pairs have zombie entries for less than 1.3% of the beacon withdraws (the average value for IPv4 prefixes is 1.8% and 2.7% for IPv6).
Figure 2: Distribution of the values obtained with our dataset for IPv4 and IPv6
To understand the relationship between zombies detected across the various beacons, we compute the number of outbreaks that happened simultaneously, but for different beacons (see the Figure 3 below).
Figure 3: Number of outbreaks that happened simultaneously for different beacons, for IPv4 and IPv6
For 23% of instances where we detect IPv4 zombies (35% for IPv6) we found zombies only for a single beacon. For IPv4 we also found multiple instances (25%) where we detect simultaneous zombies outbreaks for all monitored beacons. The rest of the distribution is uniform, meaning that we observe little correlation between outbreaks on different beacons. These observations reveal that usually outbreaks emerge independently across different prefixes, yet in certain cases some peers altogether miss withdraws for all monitored beacons.
Using a graph-based machine learning method we also infer zombie ASes that are not directly peering with RIS (see our research report for more details on this machine learning technique). By manually looking at the results we noticed certain patterns in the outbreaks. Our hypothesis is that the number of zombie ASes is usually related to the transit networks affected by zombies.
Figure 4 illustrates two outbreaks where we detected zombies in large transit networks. On the left hand side we can see an outbreak where the zombie AS with the highest hegemony score is Init7 and all ASes downstream are also affected by the outbreak. The graph on the right hand side shows another outbreak where we inferred a zombie in a Tier-1 network, Level(3). As Level(3)'s customer cone is larger the outbreak propagates more widely. This results in about half of the RIS peers having zombie routes through Level(3).
Figure 4: Zombie outbreak in two large transit networks (on the left Init7, on the right Level3). The blue dot indicates the origin AS, green dots show normal peers and the red triangles indicate zombie peers.
Zombie root cause
While detecting BGP zombies with RIS beacons is straightforward, we faced significant challenges in pinpointing the root cause of observed zombies. Given the erratic patterns observed in our study and the investigations conducted with network operators, we believe zombies are mainly the results of software bugs in routers, BGP optimisers, and route reflectors. The systematic identification of zombie root causes on the Internet has proven to be very challenging, even for operators, as it requires timely and detailed information from a complex and occasionally misbehaving infrastructure.
If the fraction of zombie routes in the wild is in the same order of magnitude as what we see for RIS beacons, this can have interesting consequences that would merit further research. For instance, in the case of large route leaks, zombie routes could add significantly to the complexity of mitigating these incidents.
Zombies in the wild
Our study focuses only on RIS beacons as we know their withdraw times a priori. However, these results cannot be easily extrapolated for any routed prefix. We could infer zombies for cases where a prefix is withdrawn in a short period of time for most, but not all route collector peers, and it remains difficult to distinguish this from a routing configuration change intended to limit the visibility of a prefix. Furthermore, in the case of large zombie outbreaks, which are of prime interest, one may confuse the few observed withdraws with a local routing issue. We plan to address these challenges in future works. A rigorous method for detecting zombies in the wild would allow us to estimate the overall impact of zombies on routing tables and to provide network operators with tools to effectively identify zombies.
For more details, please refer to the full research paper: R. Fontugne, E. Bautista, C. Petrie, Y. Nomura, P. Abry, P. Goncalves, K. Fukuda, E. Aben. "BGP Zombies: an Analysis of Beacons Stuck Routes", Proceedings of PAM'19. Pueto Varas, Chile. March 2019.