The RIPE NCC provides mission critical services to the Internet community which require a solid technical foundation. This article explains how we plan to use cloud infrastructure as a means to that end. We outline some of the risks that must be accounted for, and describe the target architectures for our RPKI repositories and the RIPE Database to show how this migration will work.
Why Go to the Cloud? What are the Risks?
To provide the availability, security and resiliency that services like our RPKI repositories and the RIPE Database demand, we have two options. Either we scale up our existing infrastructure considerably, or we take advantage of the extensive and mature infrastructures offered by cloud providers. Over the past few years, we have been exploring the cloud option, beginning with some of our smaller services. As our experience with cloud infrastructures and our confidence in their capabilities has grown, we have started to investigate how they can be used to support our mission critical services as well.
After looking at this very carefully, we have decided it’s time to move, and we are planning to deploy operational instances of our RPKI repositories and the RIPE Database to Amazon Web Services (AWS). Our main reasons for this approach:
- Organisations like AWS offer infrastructure at a global scale, along with mature security, privacy and availability features. Delivering all of this ourselves would take a significant chunk of the RIPE NCC’s resources – both in terms of initial investment outlays as well as ongoing operational costs.
- With the cloud comes an ability to quickly spin up or scale down services as needed. This is exciting, as it means we can become much more agile and dynamic as an organisation.
- It also means that our engineers can worry less about maintaining infrastructure and instead focus on what they do best – keeping our core business front-and-centre.
But such a move does not come without exposure to new risks, and these need to be properly acknowledged and considered. One concern we have identified is the potential for core functions like RPKI and the RIPE Database to become dependent on the infrastructure of third-party cloud providers. Another concern is around ensuring that our services remain uniformly accessible across the whole of our service region, especially in countries or regions that are engaged in political disputes with the countries our cloud providers are based in (whether now or in the future).
The challenge is then to see if these risks can be mitigated – and we think they can. Our solution is to ensure a fully-redundant infrastructure that uses multiple providers or retains an on-premise failback element. That way, in the unlikely event that our primary cloud provider becomes unavailable, we remain in control and able to continue operations as normal.
Goals and Requirements
The primary goal of our cloud strategy is to ensure availability, security, resiliency, and low latency for our mission critical services. We need to achieve this without relying on any one company for our infrastructure, and without compromising the accessibility of our services.
Concerning availability, our target is 99,999% availability (“five nines”), which is a common target in the Internet services industry. To achieve this, we will need redundancy on data centres all around the world.
Taking all of this into consideration, we have identified three main requirements:
- We cannot depend on any one cloud provider. No mission critical service can rely on a single provider. We must provide a fully-functional redundant infrastructure, either on our premises (i.e. Equinix data centres in Amsterdam) or on a secondary cloud provider (preferably European).
- A maximum downtime of one hour in the event of catastrophic failure on the main cloud provider. If the infrastructure of our primary cloud provider should ever become unavailable, the failover to on-premise infrastructure or the secondary cloud provider must happen in less than one hour.
- Services must be accessible by all, including in the event of sanctions or political disputes. In the event that certain members or countries are unable to access our services due to the sanctions compliance obligations of our main cloud provider (or for any other reasons), our redundant infrastructure must be able to serve those requests.
As mentioned above, our plan is to deploy both our RPKI repositories and the RIPE Database to AWS. For the secondary infrastructure, we have chosen a different path for each service: the RPKI repositories will be deployed to a secondary cloud provider, while the RIPE Database will continue to rely on our existing on-premise infrastructure for now.
Our target architecture consists of our core and publication servers deployed to our existing data centres in Amsterdam (Equinix). Pulling from our publication servers, the rsync and RRDP repositories will be deployed to two different cloud providers, each fully independent from the other.
In AWS, we will start with multiple availability zones (multiple data centres) in a single region (Ireland), which should give us 99,99% availability. This step alone will provide a significant improvement to our current situation, as RRDP is currently running on a single node in AWS, and rsync is running on two nodes in Equinix. We aim to complete this phase by Q3 2021.
We will then expand to an additional region, bringing us to our targeted 99,999% availability. This is planned for Q4 2021. The final step will be deploying to a secondary cloud provider, which we are planning to do early next year. In the meanwhile, our existing infrastructure in Equinix will remain as a backup.
We will then alternate between AWS and the secondary cloud provider using DNS. Both providers will be actively serving production traffic at all times. Failover to a single environment requires taking out an environment from the DNS. We are also considering using anycast where possible, offering a resilience architecture similar to the one used by DNS root servers.
If one of the cloud providers should ever become unavailable in a certain area or country, there might be intermittent errors initially, but as the DNS will alternate between different providers, one of them will respond. If the problem should persist, we will temporarily remove the affected provider until a permanent solution can be found.
For the RIPE Database, our target architecture is similar to what we are planning for the RPKI repositories. The key difference is that instead of using a second cloud provider, the secondary infrastructure will be hosted in our two data centres in Amsterdam.
Our first step will be to deploy in AWS using multiple availability zones in a single region, which we plan to complete in Q3 2021. Once the concept has been proved, we will expand this to multiple regions.
Our internal environment remains as-is. At a later stage, we will evaluate whether this environment needs to be scaled up. In any case, this must be able to serve all requests, even in the event that the entire AWS infrastructure is unavailable for a prolonged period.
Regular switchover from one live environment to the other is straightforward, as the primary database is hosted in AWS (writes are directed to AWS, and replicated back to the RIPE NCC for reads). We aim to ensure that switch-overs can happen within 30 minutes and we plan to perform them often – as there will be no impact on users and it is good practice for a failover, as it verifies that the process will work if we need it. We will use AWS as the preferred live environment, as it provides better availability.
Failover (if the current live environment goes down) will happen by updating DNS to point Whois services to the remaining (standby live) environment. Failover is read-only if AWS is down and we aim to failover within one hour. We assume any AWS downtime will be short, and we can move back to AWS when it recovers. If there is extended AWS downtime, or if we decide to leave AWS, then we will nominate an internal server to act as the new primary database. We will reconfigure any other servers to replicate from that primary, and we will then accept updates once this setup has been completed. Whois updates can be handled by either environment, but the backend application always writes to AWS.
We are now in the process of putting these plans into action. We will take a phased approach, both for the RPKI repositories and the RIPE Database, and we will inspect, communicate, and adapt as we go – responding to how these architectures work in practice and to concerns raised by our community.
The next steps will be improving the resiliency of our RPKI core and publication servers, as they will remain on our existing infrastructure in Amsterdam. These improvements are planned to take place in 2022.
Do you have any input or ideas about our plans? We would love to hear from you – you can comment below or we’ll see you on the mailing lists. We will also be presenting on our work at RIPE 82 in the RIPE NCC Services Working Group.