The weekend after the recent RIPE 65 Meeting in Amsterdam, we experienced a network outage that affected a number of services. Please find below a detailed report and analysis.
On Saturday, 29 September at 12:15, our monitoring system alerted us to network issues. Within minutes, our critical incident procedures were put into place and the 24-7 teams swung into action.
Because we were unable to access our local network, initial announcements about the outages were made on social networking sites. Details were then added to the main RIPE NCC website and emails sent to various mailing lists shortly afterwards when local access was restored. Engineers worked throughout the night at the RIPE NCC's offices, remotely, and at our colocations to solve the issues.
The following services were affected during the incident. All times are UTC+2:
- Email: Intermittent access from 12:15 on 29 September - 12:00 on 30 September
- www.ripe.net : Some users may have seen intermittent problems from 12:15 on 29 September - 02:00 on 30 September.
- RIPE Labs : Some users may have seen intermittent problems from 12:15 on 29 September - 02:00 on 30 September.
- DNSMON/TTM: First outage seen at 12:30 on 29 September. Running normally as of 10:00 on 30 September.
- RIPEstat : Some users may have seen intermittent problems from 12:15 on 29 September - 02:00 on 30 September. The "visibility" plugin had intermittent problems until 17:00 on 1 October.
- RIPE Database : Unavailable from 12:15 on 29 September - 20:20, when the network stabilised.
- RIPE NCC Access , the LIR Portal and Certification/RPKI : Log in to RIPE NCC Access and all dependent services, including the LIR Portal and Certification/RPKI, was inaccessible from 20:10 on 29 September until 11:44 on 30 September.
The following services were not affected:
- DNS (authoritative nameservers)
- RIPE Atlas
After thorough investigation and analysis, we concluded that the outages were caused by a misconfigured load balancer.
We have four load balancers in our network spread across three colocation sites. One of these colocations has two load balancers which are in active standby mode in two groups for resilience. We discovered that there was a misconfigured port on a switch on one of the standby load balancers.
The ports had been configured as edge ports instead of point-to-point ports. We run rapid Spanning Tree and if a port is configured as an edge port, Spanning Tree will never shut down that port and the port will keep forwarding packets.
Our analysis showed that a network storm was triggered when the active load balancer failed over to the standby, which had the misconfigured port and consequently put a loop on the network.
This, in turn, flooded the network which caused the network attached storage (NAS) to be unreachable for many of our servers. We fixed the loop by removing the standby load balancer and forcing the network onto the other load balancer in the group, which resulted in the NAS becoming available again.
We could then start to reboot the servers to bring them up after they lost their NAS disks. As the RIPE NCC operates around 100 servers, it took some time to get them all running again and this meant that there was some delay before many of our services could be restored.
The misconfigured port has been present since the load balancer was installed at the colocation two months ago.
We have, as yet, been unable to find out what triggered the initial fail over. Logs have been sent to the vendor for analysis and their investigation is still ongoing.
The switch has now been reconfigured and the affected load balancer has been added back into the network.
Furthermore, we are working on the implementation of an out-of-band management network. This will enable us to fix issues remotely without the need for engineers to travel between our three colocations, thereby significantly reducing our response times.
We are fine-tuning and extending our incident response procedures to improve on incident handling and to improve on internal and external communications.