RIPE NCC and Duke University BGP Experiment
Background on RIS Experiments
As part of its mission, the RIPE NCC works with other members of the Internet technical community to contribute towards the secure and stable operation of the network. The RIPE NCC Routing Information Service (RIS) has a long tradition of supporting Internet researchers.
Since 2002, the RIS has announced a set of beacon prefixes. These prefixes are announced and withdrawn at predictable times, to assist in propagation and flap dampening research. In 2007, the RIS was the second network in the world to start announcing a prefix from a 4-byte AS Number. This helped operators test their 4-byte AS capabilities and allowed us to measure the effectiveness of the transition mechanisms for 4-byte AS Numbers.
The announcements made by RIS are also a vital part of the De-bogon Project, with RIS measuring the visibility of former bogon prefixes. We have also done measurements on traffic attracted after announcing 1/8, work later extended by APNIC.
A research group at Duke University in the United States approached the RIPE NCC for help with experimental research. This group is working on a secure Border Gateway Protocol (BGP) design, in which optional transitive attributes are used to propagate some of the certification information. In order to estimate the feasibility of such a design, they asked the RIPE NCC to announce a route resembling their design from the RIS network.
The design of BGP allows routes to have an attribute that is not recognised by the BGP implementation. If this attribute is set as transitive, it is passed to other routers, without intermediate routers understanding what it actually means. This aspect of the protocol has been key for the transition to 4-byte AS Numbers.
This ability of the BGP protocol allows some implementations to support a new feature, while others do not yet understand the contents of the attribute. In the design proposed by the team from Duke University, upgraded routers add certification information and verify certificates from other routers, without affecting the rest of the Internet.
As the researchers did not have their own AS Number or address space, they provided the RIPE NCC with a patch to Quagga, the BGP software used by RIS. This allowed us to run the experiment from our infrastructure. We checked the patch for security or protocol problems.
In addition, all announcements were sent through another Quagga instance, so that any protocol violation would be noticed before the announcement went to the Internet.
Issues Encountered During the Experiment
To run the experiment, we installed a custom Quagga instance announcing the route through the RIS collector connected to the Amsterdam Internet Exchange (AMS-IX) and Groningen Internet Exchange (GN-IX). We started the announcement at 08:41 (UTC) on Friday, 27 August 2010. It was originated from AS12654, using the prefix 18.104.22.168/24.
The attribute used by the RIS had never been announced on the Internet before, although it was in accordance with the BGP specification.
The announcement was withdrawn, as planned, at 09:08 (UTC). Shortly after, we discovered that the experiment had caused a negative impact on Internet operations that lasted for approximately 30 minutes.
We immediately started an investigation, using input from the affected operators. The investigation indicated that the attribute had triggered a bug in some Cisco router models, which corrupted the announcement and sent this on to other routers. Their peers recognised the corruption, and dropped the peering session.
We provided Cisco with all of the information that we had collected and they released a security advisory the same day. The data collected during the announcement was preserved for processing by the researchers from Duke University.
Impact of the Experiment on the Internet
The following is an analysis of the impact of the experiment, using the data provided by the RIS and other RIPE NCC services.
The graph below shows the rate of updates (changes in routing) seen by RIS around the time of the experiment. We can see up to 20 times as many updates, indicating massive instability in the routing system.
Looking at the data for each Remote Route Collector (RRC), we can see that the effects of the experiments were much stronger in some specific locations. The collector in Vienna registered many times more updates per peer than all other collectors. This may indicate that this region had a higher amount of affected routers.
Knowing that the experiment had a significant effect on the routing system as a whole, we've attempted to look at how much of the Internet was actually affected. A first step is to look at prefixes being withdrawn from the Internet. We have measured this around the time of the experiment and used three reference sets for comparison.
The graph shows the percentage of prefixes on the Internet that became invisible for a certain period around the time of the experiment. There is a large variance in the dataset, with the values for very short outages in the reference sets affecting between 0.04% and 0.13% of all prefixes on the Internet. Overall though, and especially looking at outages longer than 30 minutes, the values during the experiment were up to three times higher than usual. We conclude that the experiment caused an additional 0.5% of the prefixes to become completely unreachable, and to be unreachable for a longer period than they would have under normal conditions.
Another way of looking at how much of the Internet was impacted is to look at the number of unstable prefixes. For this measurement, we consider a prefix unstable if we see more than 100 updates in a 5-minute period.
The graph shows that under normal conditions, less than 0.1% of the prefixes on the Internet are unstable. The experiment caused this to hit a peak of 1.4%, which amounts to almost 4500 prefixes, about nine times more than usual. For reasons unknown to us, this spike quickly fell to about 0.8%, and stayed there for the remainder of the experiment. About 20 minutes after the experiment, most prefixes returned to normal.
The effect of the experiment on major DNS servers was very limited. The RIPE NCC DNS Monitoring Service (DNSMON) monitors DNS servers for the root and many Top Level Domains (TLDs) from probes worldwide.
None of the root servers were affected. Minor problems, like a few dropped queries for a few of the probes on just one or two of the DNS servers, were observed in about 15 monitored domains, including the .com domain. We believe that users would not have noticed this. For 63% of the domains monitored by DNSMON, no extra queries were lost.
Noticeable problems were seen for the Slovenian and French TLDs, .si and .fr. In the case of .fr, two DNS servers became almost completely unreachable. However, the other five name servers for the TLD showed no effects, so this will not have caused anything more than some additional delays for users.
The experiment caused a massive increase in routing instability, but with different strength in different locations. It caused about three times more prefixes to have periods of invisibility, for longer periods. In total, up to 1.4% of the Internet was affected by instability around the time of the experiment.
The DNS servers for vital Internet infrastructure, such as the root and TLDs were not widely affected.
Disruption to the routing system was limited to a relatively small subset of Internet traffic, and the event drew attention to a software bug for which the vendor has now issued a patch. Through a coordinated effort, the situation was quickly recognised and corrected by network operators and those conducting the experiment.
The disruption caused is regrettable, and future experiments conducted with the cooperation of the RIPE NCC will need to meet far stricter internal guidelines, including comprehensive impact assessments, prior announcements with sufficient lead time for Internet operators, and the responsible handling of detected vulnerabilities.