Our draft cloud strategy framework is an attempt to bring everything together in a way that sets out some boundaries, identifies critical elements, and indicates where we need to be strict vs where we can afford to be a little more relaxed. This should hopefully support more clarity regarding how we are approaching the use of cloud providers and provide a solid basis for future discussions when we look at moving specific elements to the cloud.
Background
We recently stepped up our engagement on the use of third-party cloud providers to support key RIPE NCC services. This was triggered by a stronger-than-expected response at RIPE 82 which, in hindsight, shouldn’t have been all that surprising. After the meeting, we recognised that we needed to start this discussion over, beginning with a summary of what we’d heard from you.
That summary formed the basis of a further discussion on the RIPE NCC Services Working Group mailing list, after which we said we would go away and draft a set of principles and a strategy framework to share with you in July.
Last week, we went over this with the WG in an interim session. The response there was positive, and we think we’re now ready to publish the full framework for your feedback – which is what the rest of this article will cover. So, with the backstory covered, let’s get started.
Principles and Requirements
Cloud Principles
One of the first things we did when taking a fresh look at this was to think in terms of the underlying principles. We ended up with the following list. We don’t expect these principles will seem particularly new to members of the RIPE community. This is really more of a restatement or a description of an implicit understanding that has existed for most of the time we’ve been working with you.
1. The RIPE NCC solicits input from the RIPE community for all services that 1) are critical for the operation of the global Internet, or 2) directly affect the operations of our members or the RIPE community.
Requirements for these services are discussed in an open community process with guidance from the appropriate RIPE working group. We publish implementation and deployment plans and seek input from the community from an early stage until successful deployment. We regularly report on the performance of our services and conduct reviews with the appropriate working group.
2. The RIPE NCC has full authority and responsibility for the design, deployment and operation of its services.
This is standard corporate governance. The RIPE NCC Association is a legal entity that assumes full responsibility for its actions and therefore needs authority regarding what it does. In an association like ours, this authority is granted by the membership and comes to us via the board it elects to provide oversight and direction to our staff.
3. The RIPE NCC must remain neutral
We have the responsibility to operate our services on a neutral and impartial basis for the benefit of all members, who are often in competition with one another.
4. Integrity of RIPE NCC services must be maintained
We are trusted by the Internet community to keep our services available in the face of geopolitical, economic and regulatory threats. We are accountable to the community to protect the security and integrity of the data and services we are entrusted to manage.
5. Open standards should be used
We will prefer open standards and open technologies. Where open standards are not viable, we will prefer industry standards over proprietary interfaces.
Requirements
Based on what we heard from you over the course of our recent engagement, we identified a series of requirements that we need to meet in order to provide our services effectively.
1. Ensure resilience, accessibility, availability, and low latency for our services
This is a key requirement. Providing stable and effective services is a core function and we must be able to do this well.
2. Minimise vendor lock-in
The need to keep switching costs to a minimum was one of the most repeated concerns that we heard from you. As much as possible, we need to avoid becoming dependent on vendor-specific features or too deeply entangled in the proprietary environments of various providers. Preferring open standards and technologies can help us to achieve this.
3. Avoid dependence on any single cloud provider
We can’t rely on any single third-party to run mission-critical Internet infrastructure. We should favour a distributed architecture that avoids single points of failure and circular dependencies between the cloud infrastructure and RIPE NCC services.
4. Engineers can innovate and improve the quality of our services
While this may come as a surprise to some, the RIPE NCC is not made of magic and, just as with any other company, our resources are not infinite. We only have so many engineers and there are only so many hours in the day – making the best use of both to create value for our members and the community is important.
5. Comply with laws and regulations
There is not a lot of wiggle-room here. We have a strict vetting process to ensure that we comply with all applicable regulations, such as EU sanctions or GDPR. We should publish details around this vetting for the community.
6. Ensure the security of our services
This is another hard requirement. As with our legal compliance above, we should share details about our vetting process to support confidence and trust that we’re getting this right.
7. Prefer providers in our service region
We saw a strong preference from the community that we use local providers. This is something that we support, with the caveat that we need to consider this alongside any trade-offs in terms of the other requirements above.
Acknowledging Tensions and Trade-offs
Before we continue, it is important to recognise that as we seek to maximise the principles and requirements outlined above, there will be costs attached and trade-offs will be necessary. For example, an absolute requirement that we only use providers in our service region might come at a cost to our first requirement of ensuring resilience, accessibility, availability and low latency for our services. Similarly, maximising this first requirement of service quality might cost us in terms of avoiding dependence on a single cloud provider or avoiding vendor lock-in. There are tensions here as well, notably between the first two principles – #1 that we seek the community’s guidance while #2 having full authority over our services. This almost sounds like a contradiction, and this cloud discussion is a good example of how this can get out of balance.
The point is that we shouldn’t fool ourselves into thinking we can have everything or that things will always run smoothly. Instead, we should keep in mind that these trade-offs and tensions exist and discuss them openly and in good faith.
It’s in this context that a comment from last week’s WG session seems worth referencing here: while it’s good to define principles and requirements, we should take care to avoid painting ourselves into a corner. It is not in anyone’s interest if we end up choosing low-quality solutions simply because they are the best fit for the criteria we’ve agreed with the community. We should apply a measure of sanity and go back to you if something’s not working out or changes are needed.
Strategy Framework
Now that we’ve laid out some principles and requirements, let’s look at our draft strategy framework. It is important to be clear that this is still at an early stage and will be discussed further, both with the RIPE community and our Executive Board.
This framework is an attempt to bring everything together in a way that sets out some boundaries, identifies critical elements, and indicates where we need to be strict in terms of our requirements vs where we can afford to be a little more relaxed. This should hopefully also allow more clarity regarding how we are approaching the use of cloud providers and provide a solid basis for future discussions when we look at moving specific elements to the cloud.
To start, we have defined three different levels of strictness for each of the requirements we identified (Strict, Heightened and Standard). We then identified what each level means for each requirement, which you can see on the table below. Some requirements, such as ‘Comply with laws and regulations’ apply equally across all levels and so lack any differentiation (the final four rows on the table).
Requirement | Strict | Heightened | Standard |
---|---|---|---|
Ensure resiliency, accessibility, availability and low latency of services | Uptime > 99,999% | Uptime > 99,9% | Uptime > 99% |
Minimise vendor lock-in | Only use bare-metal or VMs | Managed services can be used but only with open standards | No restriction on managed services but keep track of switching costs |
Cloud provider independence |
Fully distributed architecture No downtime allowed |
Stand-by backup infrastructure required Fail-over within one hour |
Ability to spin-off a new instance within 48 hours Maximum outage of 48 hours |
Enable our engineers to improve product quality and innovate | Applies to all levels | ||
Comply with laws and regulations |
Applies to all levels Details of legal vetting process should be published |
||
Ensure security of our services |
Checks according to level Details of infosec vetting should be published |
||
Prefer providers in our service region | Applies to all levels |
So far, we’ve described how we will interpret our requirements on a scale from strict to more-relaxed. The next step is to look at how we map specific services against this framework. Here, we have been thinking of our services in terms of two categories:
- Global Internet Services: required for the Internet to function properly (e.g. RPKI)
- Core RIPE NCC Services: critical for the RIPE NCC, but will not have a noticeable impact on the wider Internet if offline for a short period (e.g. LIR Portal)
Further, services within each of these categories can have differing levels of criticality (meaning, the importance of these services either to the operation of the Internet or the RIPE NCC). Looking at criticality, we have identified three levels:
- High: outages have a direct operational impact
- Medium: outages have an operational impact within a few hours
- Low: we can afford to be more forgiving regarding outages
The table below then indicates how we map the strictness of requirements above to specific services depending on their criticality. We have included examples of services to make this a little more concrete. It's important to note that these examples are merely illustrative at this point – we intend to define the criticality of specific services with you at a later stage as part of this work.
Criticality | High | Medium | Low |
---|---|---|---|
Global Internet Services | Strict (RPKI) | Heightened (RIPE Database) | Standard (RIR statistics) |
Core RIPE NCC Services | Heightened (Registry software) | Standard (LIR Portal) | Standard (Meeting registration software) |
Next Steps
Now that we’ve published this draft, the ball is back in your court. We hope what we have presented here will help the discussion to progress. With that in mind, please let us know what you think. Of course, detailed feedback is helpful – but so are brief expressions of support. We would like to hear from as many voices as possible. You can comment below or on the RIPE NCC Services Working Group mailing list (ncc-services-wg@ripe.net).
The chairs of the working group have been kind enough to schedule a second interim WG session for 6 September where we can discuss this framework in more detail. Our senior management and engineers will be at the session to hear what you think and answer any questions. Following this, we will update our draft strategy based on your feedback, before we present it to our Executive Board at its meeting in September.
Until then, we hope to see you on the mailing list!
Comments 5
Comments are disabled on articles published more than a year ago. If you'd like to inform us of any issues, please reach out to us via the contact form here.
Harry Cross •
I must take a moment to congratulate the NCC on realising that the first proposal was flawed and needed to be re-worked with more stakeholder engagement. This does sound like a much better document, but I would like to see a requirement for any service on the standard strictness to have a documented failover/second provider and a released plan on the migration process over to that secondary should it be needed.
Serbulov Dmitry •
It is very good and important step in working. I am agree with this document.
Niall Murphy •
Hi folks -- I am, alas, not really any longer in the RIPE community, so take my observations with a dash of NaCl. But as a person who's been working in Cloud providers for the past ~16/17 years I have some observations. 1) There are many valid reasons for using cloud; there are also many valid reasons not to. I think there would be a strong argument for RIPE, maybe above other institutions, to be cautious about using cloud. There are many reasons for this, not all of them completely about optics. With the best will in the world, US leadership of the major providers will be responsive to US concerns, not necessarily European concerns, and in terms of actual revenue (which, I am sad to reveal, is a major driver of prioritisation) it seems likely to me RIPE won't necessarily get the attention it deserves generally, if not during an actual outage. We should think carefully about trading this independence away, though that independence is certainly a cost we pay for. 2) The reasons I see written down talk about resilience - which I am inclined to think is mostly a red herring, there's plenty of ways to get resilience without cloud - but also a lot of implicit cost arguments. Is the impulse to control costs something that the membership at large would be happy prioritising over other concerns? Has this been discussed openly? (Apologies for my ignorance here.) Furthermore, is everyone going into this aware that cloud providers structure costs to incentivize lock-in? (AWS's famous network transfer costs, which Corey Quinn has spoken about in detail previously, are probably the most famous example here, but there are others.) Even a "we want to be cloud-provider agnostic" framework won't necessarily help you to model the hidden traps here successfully. 3) I don't know what the future of RIPE holds. I'm not sure if it's envisaged that it shrinks scope from its current situation, stays the same, or grows scope. But handing over the work of a significant subcomponent of service operation is a strong argument for shrinking. That's maybe more cost efficient - probably members are very happy if their bill goes down. But it's also an organizational risk, in that the expertise of the organization is hollowed out, and more centralization is facilitated. I suppose, ultimately, it's a question of what the members are happy with authorising. NRM
bert hubert •
Hello! I've previously written some words on this subject, for example here: https://berthub.eu/articles/posts/how-tech-loses-out/ I applaud the thorough look RIPE is taking at things. But there is something I miss strongly. The goal in life is not to do everything yourself. A goal is resilience, which is discussed well. But a very important goal is also to maintain capabilities. It is entirely ok to get a lot or even "most" stuff somewhere else. But it is not ok if this comes at the loss of capabilities. Knowing how to run vital infrastructure is key. This ensues that RIPE is a credible cloud negotiator, for example. It also means that when things go wrong, people have hands on experience in fixing things. In addition, because RIPE is so core to the Internet, they do have to maintain a feel with how actual Internet platforms are being run, "down to the metal". It would not be good if RIPE only saw an AWS console and over the years started thinking that that console was the Internet. Actual routers, switches and servers are the Internet. Also at AWS by the way, it is not being run by unicorns. So in addition to the worthwhile considerations above, I'd suggest writing down that there will always be a certain set of services (beyond K-root) that are being run in house, even if this turns out to be more expensive in the long run. Retaining a core set of key services means that capabilities will not disappear over time, and that RIPE will continue to know, down to its bones, what it means to run an important bit of Internet. Additionally, it would be great if engineers would not have to fight for this continually. It should be broadly supported policy, and we should not have engineers having to justify their existence and capabilities. If someone wants to outsource a key bit of stuff that was chosen to be operated in house, the response should be: what outsourced service will we then insource again? As long as RIPE maintains a sufficient capability level, and continues to actually run key parts of the internet in a very hands on fashion, it is fine to use third parties as well where this makes sense. Bert
Vesna Manojlovic •
Please join today's "open consultation" // feedback session about RIPE NCC's "cloud strategy" at 4PM CEST. Link to zoom is here: https://www.ripe.net/ripe/mail/archives/ncc-services-wg/2021-September/003469.html