With RPKI experiencing a huge growth in its deployment and becoming a key part in the operations of the Internet, we have been investing a significant amount of resources in making its infrastructure resilient, secure, and highly available. Recent outages have raised concerns about the maturity level of this relatively new technology and we have taken these incidents very seriously. We are now in the process of making long-term, significant improvements to our technical infrastructure, operational procedures, and software development processes to meet the growing expectations of the technical community worldwide.
RPKI is growing up
When I joined the RIPE NCC back in 2012, RPKI was still in its infancy. The IETF standards were still under development, the number of certificates and configured ROAs was very low, and no large transit providers had deployed Route Origin Validation (ROV). Within the RIPE NCC, RPKI was taken very seriously as something that could become a critical part of the Internet, but at that stage it was still viewed to some extent as an exciting project that a group of talented engineers could sink their teeth into. This meant that scalability, redundancy and resilience were not top priorities, as we were still laying down the system’s foundation. It was also unclear whether RPKI would really take off as a solution to routing security problems.
The situation has changed significantly in the last two years. The number of ROAs and certificates is increasing sharply, big transit providers are deploying ROV, and many different organisations are developing RPKI Validators. And as RPKI adoption has skyrocketed worldwide, concerns have been growing in the technical community about the maturity of this relatively new technology within the five RIRs.
The need for resilience
We have been following these recent developments closely. By the beginning of 2019, it had become obvious that we needed to reassess our technical infrastructure, operational procedures and engineering capacity, as the original design and infrastructure had not taken these emerging requirements into account. Our priority was then set to increase the resilience and security of the RPKI Trust Anchor and Certificate Authority, in order to have a system that can be fully trusted and relied upon by network operators.
All of these activities were encompassed within an RPKI resilience project that we introduced late last year. The goal of this deep review is to assess whether our current architecture and operational procedures are fit for purpose, and to make changes where they are not. The expected outcome of this project is an RPKI Trust Anchor and Certificate Authority that is secure, reliable, highly available and with transparent and trustworthy operational procedures so that the system follows the industry standards expected from a mission-critical service. The outline of this project was presented by Nathalie Trenaman in the Routing Working Group session during RIPE 79, and the feedback from the community has been incorporated in our project plan.
This work is being done with strong collaboration from our Executive Board. A joint APNIC and RIPE NCC board meeting in September 2019 identified the need to work together in this initiative because RPKI is a global service and no single RIR can do it alone. A significant budget allocation for the resilience of RPKI was included in our Activity Plan and Budget 2020. We also aligned our activities in the NRO-ECG (Engineering Coordination Group among the five RIRs) to make sure our planning is done together, and lessons learned are shared among the five RIRs.
Project outline and current progress
An initial assessment of the current situation was performed back in August last year, which identified the key risk areas in our infrastructure and software development procedures, together with mitigation strategies for both. This assessment has been refined over time and mitigation strategies implemented. The current project plan is very broad, including evaluation of the following areas:
- Technical infrastructure (high-availability, scalability, redundancy)
- Security (penetration and vulnerability testing)
- Cryptography (IETF RFC compliance)
- Operational procedures (e.g. key signing)
- Legal framework (CPS and T&C)
The RPKI technical infrastructure has been assessed by our engineering team. We looked into our software architecture, monitoring, scalability, configuration management, redundancy, and also on less tangible but equally important things like knowledge about RPKI within our engineers and software development processes. Some risk areas were identified as a result, and improvements have been made (or are in the process).
A complete security assessment of our RPKI Trust Anchor and Certificate Authority was carried out in Q4 2019, together with a third-party that is specialised in cyber-security. Part of this assignment was to increase the security awareness of our engineers, as developing secure software is just part of our DNA. Some vulnerabilities have been found and all of them have been either fixed or mitigated. We are currently evaluating the need for routine assessments (yearly, for instance) and also which organisations are best suited to assist us there.
Another very important aspect refers to trust in the system. RPKI is a way to cryptographically prove the ownership of IP addresses and AS Numbers through a certificate issued by a Certificate Authority, and then use this certificate to sign statements about the origin of these resources. This cryptography chain is hierarchical, having the Trust Anchors at the top of the tree. This means that the value of such statements is directly related to the trust in the five RIRs’ Trust Anchors. If this trust is broken, the entire system is compromised, in the same way that money issued by a central bank is nothing more than coloured paper if that economy is in deep financial trouble.
Therefore, having a fully reliable Trust Anchor becomes imperative, as there might be attempts from bad actors to try to break the system and compromise its integrity. Here we are looking into two main areas: IETF RFC compliance, and transparency and reliability of our operational procedures. We are contacting third-party companies to perform an audit on these areas. Two different companies will be needed, as each area requires a very specific type of knowledge that will not likely be found in a single organisation. Both these initiatives are intended to give the technical community confidence that we are doing what we say we are doing, and also having operational procedures that are published, transparent, and have appropriate checks and balances.
The last element refers to our legal framework. As RPKI deployment increases and becomes a key part of the global Internet infrastructure, organisations relying on these services want to have a clear understanding of legal liabilities in case things go wrong. This means having a strong legal framework that provides clarity about the boundaries of these liabilities, and that also protects the RIPE NCC from events outside our control.
Earlier this year, a series of incidents happened with our RPKI infrastructure. We published post-mortems providing a full account of each one of these incidents, and Nathalie Trenaman has published a RIPE Labs article that goes deeper into the technical details and the lessons we learned from them.
We are taking these incidents very seriously and fully acknowledge the impact they have on the technical community. Importantly, these incidents highlighted some wrong assumptions that we made in our assessments last year and have forced us to take a step back and reconsider where we are. We have since started an internal task force that will assess our technical infrastructure once more, this time involving not just RPKI, but related services such as the Registry, SSO and the RIPE Database. The root cause of each one of the incidents has been, or is in process of being, fixed, and we are looking deeper at our overall infrastructure and seeing which immediate actions need to be taken to mitigate the risk of additional outages. The result of this task force will provide input to the RPKI resiliency project.
We should never waste a good crisis. While these incidents were painful and exposed a suboptimal setup in our existing infrastructure, they have also provided an opportunity to learn from our mistakes and to take the right steps towards our goal. And we are fully embracing this opportunity. Members from many of our engineering teams are involved in this task force, and we are not only learning from our mistakes but also looking at other successful mission-critical RIPE NCC services like the RIPE Database and K-root, both of which have an excellent track record in terms of availability and resilience, to learn from our experiences there too. Additionally, we are hiring extra consultants with previous experience building our core RPKI infrastructure, and we’re also investigating which improvements can be made in our software development processes (DevOps and QA for instance). A fresh pair of eyes will help us to challenge some of our assumptions and see things from a different perspective.
A solid future ahead
RPKI is growing up. And we want it to continue like this. Together with all the different stakeholders, the RIPE NCC will play its role in helping with its successful worldwide adoption. We will do that by providing our technical services, our expertise and training to serve the community in the best way we can. We believe the key to success is collaboration, and we want to continue doing our work aligned with this value.
As one of the RPKI Trust Anchors, we commit to do our utmost to provide a secure and reliable service to the technical community. Over the next couple of years, the RIPE NCC will continue to invest the necessary resources into providing a world-class service that can be fully relied upon. And we are going to do this with strong collaboration with you and the wider Internet community.