Robert Kisteleki

RIPE NCC Measurement Data Retention Principles

Robert Kisteleki
Contributors: Paul de Weerd

6 min read

0

RIPE Atlas and RIPE RIS both provide a wealth of Internet measurement data invaluable to both Internet researchers and network operators alike. But that's not to say that questions about the cost and value of storing this data don't come up from time to time. Here, we open up the discussion on whether change is called for.


RIPE NCC measurement systems, in particular RIPE RIS (Routing Information System) and RIPE Atlas, have been collecting passive and active measurement results (see details below) since 1999 and 2010, respectively. Both systems have been storing and making these results available to the whole of the community since their inception either as downloadable files, APIs, or both.

Even though the very idea of ever deleting historical public data - and, quick side note: none of the data we're talking about here is personal data - is frowned upon by some users (particularly researchers), over time we see more questions about the costs, value and reasons for keeping all this data around. Therefore it is worth reviewing whether this is a practice the RIPE NCC should continue doing as before, or should we make changes.

RIPE Atlas

As of the end of 2023, RIPE Atlas is collecting a grand total of about 15,000 measurement results per second, or about 1.3 billion each day. The majority of these are accessed within a couple of weeks or months, but we see older results being accessed as well.

There are a couple of dimensions to this data. One is one-off vs ongoing measurements:

  • The vast majority of measurements (90%+ as of the end of 2023) are “one-offs”, meaning they are executed once, and the involved probes deliver their results as fast as possible. The general use case is to discover what’s happening “right now”.
  • The majority of results delivered belong to ongoing measurements. Here each involved probe measures with the predefined frequency and delivers results accordingly. The general use case is to keep on monitoring something, with the ability to look at how the values changed over time.

The second dimension is whether a measurement is public or non-public:

  • The majority of the measurements are public (this is the default setting as well). These can be reused by different users, perhaps for different use cases. For example, if someone is already running measurements targeting Google, these can be viewed, used and shared between different users and even different time frames.
  • Some measurements are marked as non-public, because the users doing this prefer not to expose this information (yet). Reasons for this could be that target infrastructure that is not public (yet), or is temporary and it is used for testing only, etc. The owner of the measurement can change a non-public measurement to be public later on.

The third dimension is measurement type: pings, traceroutes, DNS and NTP queries, TLS checks, etc.

At the end of 2023 teams in the RIPE NCC are currently working on changing the storage infrastructure behind RIPE Atlas to be more efficient and cost-effective, mostly by moving the older, less frequently accessed data to cheaper solutions (perhaps cloud based). We are trying really hard to make this change effectively invisible to the end users but this is still ongoing work, and while choosing a cheaper solution usually comes with some casualties, we’re trying our best to minimise these.

We are also working on renewing the infrastructure used to store recent data to make that also more efficient, but otherwise essentially the same technology stack as we use now.

RIPE RIS

The RIS project started to collect BGP data in 1999. We have deployed RRC’s (RIS Route Collectors) that peer with various players on the internet to provide us with their view (either in full, or partial) of the Default Free Zone (DFZ). The data we collect is made available to end users through various means:

  • Dump- and update files, which contain a snapshot of the BGP landscape at the time of the dump, and the updates that we saw between those snapshots, available through public FTP
  • RIPEstat widgets that provide various ways of inspecting the data, making insights interactively available to the community
  • RIS Live which allows users to get a near real-time stream of the BGP updates that we collect, filtered to their liking
  • RISwhois giving users the ability to query which ASN originates what prefix

Today, we receive about half a billion BGP update messages per day (close to 6,000 per second on average), containing about one billion route updates. This dataset currently weighs in at roughly 50 TB of compressed dump and update files, with 80% accounting for the data collected in the last five years. For the RIPEstat use-case, we make the data available in a variety of ways which takes up about 800 TB of storage space.

Data Retention Principles Proposal

From the RIPE NCC’s perspective we’d like to propose the following simple principles:

  1. The ultimate goal is to retain as much historical data as possible in a financially sustainable way.
  2. Collected results for RIPE Atlas and RIS will be available to all users - similarly to what the services offer today. However, the access methods to this data, in particular the specific protocol that is used to gain access to data and the Service Level Objective (SLO) for time-to-access, will be tailored to take into account the age of the data and the frequency of access to it. In particular, older or infrequently used data may be slower to access, or move out from the systems (or their APIs) that originally collected the data, implementing a tiered access.
  3. The different services can have different tiers, different cutoffs and different access methods for these access tiers.
  4. Users should be encouraged (as much as possible) to make authenticated requests when accessing data from the services. Data transfers from unauthenticated users may be pooled together and share common limitations (e.g. an aggregate bandwidth limit applied across all of them).
  5. Results for non-public RIPE Atlas measurement will be stored and retrievable (by the owner of the measurement) for at least one month.

The teams developing and operating RIPE Atlas and RIS are open to supply reasonably calculable statistics in order to support discussions on the above principles.

Please send us your feedback

Please provide your feedback on the RIPE NCC Forum or on the RIPE Measurements and Tools Working Group mailing list.

0

You may also like

View more

About the author

For many years I have been the leader of the Research and Development team at the RIPE NCC leading a dedicated team of thinkers to support the RIPE community by providing network research, data analysis and prototype tool development and services including RIPE Atlas and RIPEstat. As of 2023, I'm working as a principal engineer in order to assist the CTO and the RIPE NCC's information services.

Comments 0