Google BigQuery is a data warehousing platform with an SQL query interface on top. This is a powerful tool that allows us to rapidly filter, query, and process RIPE Atlas measurement results, improving our ability to explore this dataset. We are now making RIPE Atlas measurement results publicly available via Google BigQuery.
At RIPE 79 we discussed our experimentation with the Google BigQuery platform for the exploration and analysis of RIPE Atlas measurement data. We're happy to announce that we're moving ahead with public access to this data through Google BigQuery!
One benefit of the RIPE Atlas platform is that it collects a lot of measurement results. One drawback of the RIPE Atlas platform is that it collects a lot of measurement results! Every day, the platform performs over 145 million traceroutes, 190 million DNS queries, over 500 million pings, and various other measurements such as NTP and HTTP. jq, grep, and awk can go a long way, but maybe not that far.
Even with all those results, the RIPE Atlas API is keyed on measurement IDs and it is not trivial to perform arbitrary searches across the collected results without retrieving excess data. Aside from simple filters, we leave it to you to provide the compute power to process the data. Not everybody has access to space or compute power to manage this efficiently.
Modern tooling can get us further than this. BigQuery has become one such tool with a ridiculous amount of power behind it: given a well-defined schema, we're able to slice the RIPE Atlas data across any arbitrary dimension and combine it with other datasets at will. Compared to writing custom scripts for every question we can dream up, the time to insight with BigQuery can be considerably lower.
From today, we're granting access to public RIPE Atlas measurement results via Google BigQuery. You can find more information on this service by visiting this page in our RIPE Labs tools section, and additional documentation and examples in our GitHub repository.
We're initially offering two datasets: samples and measurements, each with six tables for the six main measurement types. The samples dataset is a 1% sample of one week of data, to get you started and show you what the data looks like. The measurements dataset is all public results, up-to-date, initially starting from 1 January 2020. We intend to backfill some data older than this, too.
The data we're storing will look familiar to anybody already familiar with the structure of RIPE Atlas measurements. We're documenting the data and how to use it in this GitHub repository.
We're planning on augmenting this data with additional, smaller, datasets; for example, public probe metadata and RIS data. Watch this space for more.
What Will Not Change
By adding Google BigQuery as an access method to public RIPE Atlas measurement data, we're definitely not removing the RIPE Atlas API, nor do we intend to. The API is still the source of truth. The RIPE Atlas daily dump service is also still available for bulk downloads.
Our intent in offering this service is to gain more experience on what people out there find useful. We want you to go use it, and we think it's pretty stable. Things may break and incoming data may dry up. When that happens, we will endeavour to fix it.
If you report something that is clearly broken, our intent is to respond to the issue not later than the next business day. The severity of the issue will determine how long the fix takes, but our intent is to keep this going.
Additionally, we may intentionally break things. We're still learning about the best ways to make use of this platform, and we're receptive to what could be improved.
Issues and Feedback
If you have feedback, or you have trouble, or you've spotted a problem, please email us at email@example.com!
Terms and Conditions
Although the public RIPE Atlas measurement data is as publicly accessible as it's ever been, do consider your usage of that data; it's still covered by the RIPE Atlas Terms and Conditions.
Our model for this is that we're paying for the storage of the data, but we cannot pay for an arbitrary volume of queries from users. You cover the costs for the latter. Note that on personal accounts you have a limited budget to play with, and if you're working for an educational institution there are credits you can apply for.
Costs are primarily computed against the amount of data you have to pull from storage: whether you run a query that takes 1 second or 1 day to complete is not relevant. For more on how to get onto the platform and how to manage costs, consider reading:
Google's list pricing is available here. The ballpark value I keep in mind is that to query 1TB of data will cost approximately €5. For reference, one full day of traceroute data -- selecting all columns -- is currently around 265GB, i.e., perhaps €1.30 to query. These costs are approximate, of course, and you can build up costs quickly. Keep an eye on your budgets, test against our samples tables, and set quotas where it makes sense.
Public access of our data through this platform will be new for many in the community. We're interested in feedback on this, and in particular how we can improve it!
The email address to contact us on is firstname.lastname@example.org.
We're also open to providing curated datasets from our data that fulfil common use cases: for this, we really need to understand how people use this type of service. If you start to use it and find yourself running the same queries over and over, please email us (or write a Labs post about it)!