RIPE Atlas Anchors - Operating System Refresh

RIPE Atlas anchors give us a reliable set of vantage points from which to observe the Internet, but keeping them operational over time raises interesting challenges. Here, we give an update on a recent upgrade we carried out for anchors and lessons learned along the way.

RIPE Atlas anchors serve as a set of fixed points against which we’ve been measuring the state of the Internet for over a decade now. Hosted by organisations with robust network infrastructure - e.g., IXPs and data centres - anchors act as stable targets for measurements from RIPE Atlas probes while also performing connectivity measurements toward all other anchors at frequent intervals.

The results of these measurements are a valuable resource for anyone looking to investigate outages, power cuts, hijacks, and other events that interfere with the running of the Internet. For example, when submarine cables in the Baltic Sea were cut in late 2024, RIPE NCC researchers were able to look back at the mesh of anchor measurements from around the time of the event to analyse how the local Internet remained resilient.

That said, carrying out the upgrades and patches needed to make sure anchors keep functioning in this useful role is an ongoing task for our operations team, and there are certain features of the anchor setup that can make the process particularly demanding.

System refresh

The end of life of CentOS 7 in July of last year spurred the upgrade of RIPE Atlas anchors to a new operating system version. Operating system upgrades can usually be done in a variety of ways, but the setup for anchors imposed certain limitations on how we were able to proceed - some of which we were able to work around, and some that turned out to be irremediable.

Upgrade constraints

Upgrading the operating system of close to 800 machines is, on its own, no small task. But it's a very particular set of physical constraints that come with anchors that made the process especially fragile.

The first constraint we ran into is that fact that anchors are externally hosted with no out of band interface that could be utilised for management. This meant we were limited to SSH and had to rely on host contact if something went wrong.

This is a good moment to thank all the RIPE Atlas anchor hosts we contacted during the upgrade process for answering our calls for help!

Second, the new operating system version (Oracle Linux 9) was built for the x86_64-v2 microarchitecture, which forced us to deprecate the v2 hardware anchors and reach out to virtual anchor hosts running an unsupported CPU type. 310 anchors in total were decommissioned, which is far from ideal - but it's important to note that, thanks to the cooperation of our hosts, this didn't lead to a drop in anchor numbers.

Third, the minimum network installation requirements of EL9 are effectively double that of CentOS 7, making our only option for remotely upgrading the vast majority of anchors with 3GB or less of RAM impossible. EL9 will run on less than 3GB of RAM, but there's no way to do a pure netinstall with less than that amount.

Initial upgrade attempts

We used Ansible to collect information about the host and prepare the installer environment. For the hosts that had a supported CPU type but not enough RAM for a network installation, we had originally trialled a procedure utilising kexec (and then grubby to set the CentOS 7 network installer as the first boot option as kexec was unreliable). This allowed us to reboot the anchor back to a CentOS 7 installation where we could reconfigure the machine with an additional "installer" partition. And in turn, this meant we could provide a local filesystem to store the OL9 installer resources (kernel, initrd, stage2 file), reducing the RAM requirements needed for the installation itself and accounting for Anaconda's marking its source partition as "in use" and thus unable to be changed.

As you might imagine, the resulting procedure is quite a perilous one, with multiple reboots and installations, not to mention the significant amount of time it takes for each anchor. As a result, it would have been unrealistic to upgrade the majority of anchors this way, and as we didn't want to lose any more hardware Anchors, we looked for an alternative.

There are likely other ways of upgrading under these constraints, but since our team has the most familiarity with the Anaconda installer - and it is open source software after all - we decided to take a look to see what was actually taking up so much RAM.

Lorax and the NCC optimised Installer

Auditing the installer's contents quickly showed where all this RAM was going to. Firmware alone consumes over 500 MB , with Wi-Fi and graphics drivers accounting for the most utilisation. All of the graphical installer dependencies (fonts, icons, supporting packages, etc.), openscap, the glibc language packs, and SELinux support make sense to include in a generic installer environment, but wasn't being utilised at all in our case. So, without all of this extra software, we could shrink the size of the stage2 file significantly.

We reached out to the Anaconda folks in the Fedora Matrix room and mailing lists and were graciously pointed to a tool called Lorax from the Weldr project. Lorax turned out to be the exact thing we needed to create custom installers. By editing the runtime-install and runtime-cleanup templates to remove anything we didn't need and customising the branding to match our target OS vendor (this is especially important for installations on UEFI), we were able to reduce the size of all the EL9 netinstall resources by nearly 50%!

# Stock OL9 installer
 
102M  initrd.img
885M  install.img
13M   vmlinuz
 
# RIPE NCC OL9 installer
 
70M   initrd.img
493M  install.img
13M   vmlinuz

This not only got rid of the need for the repartitioning procedure,¹ making our remote upgrade more reliable, but sped up the process altogether, and we can now say that all RIPE Atlas Anchors are running OL9.

Outcome

Having set out with the task of upgrading close to 800 anchors, 310 of which had to be decommissioned along the way, we actually saw an overall gain in the number of anchors connected to RIPE Atlas between between start and end of the upgrade procedure. All technical challenges and the methods we employed to overcome them aside, the continuing growth in the number of anchors deployed is a sure sign of success, and a clear indication of the ongoing willingness and support of our hosts.

Once again, we would like to thank all anchor hosts for their help and patience during this process, as well as the Anaconda and Weldr project folks for their work and assistance. If the idea of becoming a host sounds interesting to you, please take a look at our page on what's involved in hosting a RIPE Atlas anchor. Finally, as always, we welcome any feedback from the RIPE Atlas community in the comments below, or you can even start a discussion over on the RIPE NCC forum.

Notes

We did end up keeping the 2G /installer partition on OL9 hosts for future OS upgrades. We would recommend a similar partitioning layout if you have hosts with constraints similar to anchors.

RIPE Atlas Anchors - Operating System Refresh

Sean Mottles

System refresh

Upgrade constraints

Initial upgrade attempts

Lorax and the NCC optimised Installer

Outcome

Notes

About the author

Comments 1

RIPE Atlas Anchors - Operating System Refresh

Sean Mottles

Share

System refresh

Upgrade constraints

Initial upgrade attempts

Lorax and the NCC optimised Installer

Outcome

Notes

Share

About the author

Comments 1