Another Look at RIPE Atlas Probe Lifetimes
The RIPE Atlas measurement network is a dynamic system. Every day, new probes connect to the RIPE Atlas infrastructure. At the same time, probes that were connected can disconnect and probes that were disconnected may reconnect. The number of probes reported on the RIPE Atlas homepage as connected and available for measurements therefore varies from day to day and even from hour to hour. Over the years, the general trend has been upwards, with more and more connected probes each month. However, in recent months, the number of connected probes stopped growing. Even though probes continue to be distributed, we have not seen more than 9,400 connected at the same time since March 2016.
In the article " A Visual Impression of Probe Lifetimes ", we took a first look at the lifetime of probes according to their hardware version. This high-level, visual inspection did not reveal notable differences between different types of the current (version 3) probes. With no signs yet of growth picking up again, we took another, more detailed look at RIPE Atlas probe lifetimes. Specifically, we were interested in the following aspects:
- What is the rate at which probes first connect to the RIPE Atlas infrastructure?
- When did probes disconnect?
- What are the chances of disconnected probes coming back online?
- What is the availability of probes during their active life?
- In all of the above, do we observe different behaviour with different probe hardware versions?
To answer these questions, we looked at information provided by the probes API at the start of August. We extracted the status for each probe, as well as time stamps for their first connection, latest status change, and total time connected to RIPE Atlas.
Rate of probes coming online
In figure 1 below, we show the number of probes that first connected to RIPE Atlas as a function of time, grouped in bins of approximately one month (30 days). The height of each bar represents the total number of probes that connected each month. The different colours indicate the different probe versions. We can see how, over the years, different versions dominated. Version 3 probes equipped with a Verbatim USB stick are currently the main type of probe being distributed.
Figure 1: Distribution of first connection times
From 2013 onward, the total rate has been hovering around roughly 300 first connections per month. With a 212 and 162 first connections in the last two months, respectively, June 2016 and July 2016 are below average. However, an occasional dip in new probes connecting is not uncommon; it also happened in July 2015. The real message from figure 1 is that, apart from the most recent month, we do not see anything unusual in the rate of new probes joining RIPE Atlas - including no decline starting around March 2016, when we stopped seeing growth in the number of total probes connected. Therefore, the reason that the total number of connected probes is not increasing despite 1,254 new probes connecting in the past five months must lie in a high number of disconnections (probes temporarily or permanently leaving RIPE Atlas).
Conclusion: Until July 2016 , new probes continued to connect at a steady rate.
Duration of disconnections
Figure 2 below looks at the duration of the disconnected state. For all probes that were disconnected on 31 July 2016, the histogram shows when they last disconnected. As before, the different colours denote the different probe types. We can see how, starting around July 2015, the number of probes that disconnected each month (and did not reconnect since) steadily increased. It reaches a peak of 1,047 during July 2016. However, this does not mean that all of these 1,047 probes are permanently lost - short-lived disconnections are not uncommon and happen, for example, when a probe needs to renew its DHCP lease, when a probe experiences connectivity issues, or when a RIPE Atlas controller reboots. In the dynamics of the system, with many probes installed in home networks, it is normal to consistently have a small percentage of active probes disconnected from the infrastructure at any given instance. The key question is how many of the currently disconnected probes can be expected to come back and how many may be leaving RIPE Atlas permanently.
Figure 2: Distribution of last disconnect time
Figure 2 also shows that most probes disconnecting in the last eight months are version 3 probes, both with a SanDisk and Verbatim USB stick. Of these, SanDisk accounts for the largest fraction, which is understandable because the majority of installed version 3 probes are equipped with that type of USB storage. However, it indicates that the stagnation in growth in the number of simultaneously connected probes cannot only be caused by the switch to another brand of USB stick. The version 3 Verbatim probes may be the dominant hardware type making a first connection in the past nine months, but they are second when it comes to the number of disconnections.
Conclusion: The rate of lasting disconnections has been increasing since mid-2015, after being more or less stable for a year. The number of disconnections in the last two months is very high, but we expect many of these to be temporary.
Probability of reconnection
To quantify the chances of probes reconnecting, we look at all past connection/disconnection events in RIPE Atlas history. For each probe, we convert the series of events into a series of time intervals (with granularity of a day) during which the probe was disconnected. Figure 3a below shows the cumulative distribution of the median disconnect times: for each point, the value on the y-axis represents the fraction of the probes that had a median disconnect time less than the value on the x-axis. We can see that the distributions for different hardware types are close together. For 95% of the probes, the median disconnect lasted less than 50 days; i.e. for 95% of the probes, half of the temporary disconnections did not exceed 50 days.
Figure 3a and 3b: Cumulative distributions of median and maximum probe disconnect intervals
Figure 3b shows the cumulative distribution for the maximum disconnect interval. Version 3 Verbatim probes appear to be doing best here; they have the smallest fraction with long-lasting temporary disconnections. However, that is largely due to the age of these probes; a probe that first connected in the past three months cannot possibly have experienced a disconnection exceeding 100 days. To a lesser extent, that also applies to version 3 probes with SanDisk USB sticks. Therefore, to make an assessment of what to expect from disconnected probes, we look at the older version 1 and version 2 probes. Here we see that about 60% of the probes experienced a maximum disconnection interval of 30 days or longer. About 20% had been disconnected for 100 or more days before coming back.
Translating these results back to figure 2, the distribution of last disconnect times, we can expect at least 40% of the 1,047 probes that disconnected during the month of July to come back online in August. Depending on stability, they may again experience disconnections at a later time, but they are not fully lost for RIPE Atlas. The ones that disconnected in May or June are not beyond hope either. But the prospects for probes that disconnected in earlier months and years aren't too good, with the likelihood of one of these reconnecting being quite low. For this reason, we classify probes that have been disconnected for more than 90 days as "abandoned".
Conclusion: Probes that disconnected in the past few months have a good chance of coming back online. The statistics do not point to different behaviours by different probe hardware versions for probes that experienced a temporary disconnection .
Probe status by hardware type
Table 1 below shows a breakdown of probe status vs. hardware type as of 31 July 2016. As expected,
RIPE Atlas anchors
score best, with 94% connected. Version 1 probes are second with 63% connected, followed by version 2 and version 3 probes with SanDisk USB sticks. Version 3 probes with Verbatim fall behind a bit with just 51% connected, or 7% less than version 3 SanDisk probes. At the same time, version 3 Verbatim probes have, next to anchors, the lowest percentage of abandoned probes. At first sight, this may seem to be a contradiction, but keep in mind that the probes with Verbatim USB sticks are now the dominant type being distributed and distribution only started a little more than a year ago. The data in figure 2 shows that most of the disconnects by version 3 Verbatim probes happened in the last three months. Thus, for the majority of these, the verdict is still out. Not enough time has passed to see how many of the disconnected version 3 Verbatim probes have left RIPE Atlas for good and which will have disconnected only temporarily.
|v1||63% (822)||5% (63)||32% (415)|
|v2||59% (1,453)||6% (141)||35% (863)|
|Anchor||94% (200)||3% (5)||3% (7)|
|v3 SanDisk||58% (5,603)||11% (1,105)||31% (2,949)|
|v3 Verbatim||51% (1,175)||27% (616)||23% (533)|
|Totals:||58% (9,253)||12% (1,930)||30% (4,767)|
Table 1: Status distribution per probe type
The relatively poor ratio of connected version 3 Verbatim probes also becomes visible when we add information about probe status (connected vs. disconnected/abandoned) to figure 1 (the distribution of first connection times). In figure 4 below, the main colours again show how many probes of each type connected at which time. For each type, the darker bars represent probes that are still connected, while the lighter bars show the number of probes that are disconnected or abandoned. We can see how, for all but the two most recent months, the ratio of not-connected to connected Verbatim probes is rather high, and higher than for SanDisk probes that were activated in the same months.
Figure 4: Distribution of first-connect time with probe status overlaid: light colours for disconnected probes and darker colours for connected probes
The discrepancy is even stronger if we look at the current status of the version 3 probes that first connected between 1 July 2015 and 1 January 2016. Of the 1,010 SanDisk USB probes that first connected in this period, 66% were still connected on 31 July 2016. Of the 840 Verbatim probes that first connected in the same period, only 39% were still connected on 31 July 2016.
|v3 SanDisk||66% (670)||12% (120)||22% (220)|
|v3 Verbatim||39% (328)||20% (168)||41% (344)|
Table 2: Status of probes that connected in the second half of 2015
The percentage of probes that came online during these six months and have now been classified as "abandoned" (because of a long-lasting disconnected state) is also higher for version 3 Verbatim probes. This indicates that the issue with these probes, at least, are more serious than just incidental, short-lived issues with network connectivity.
In the earlier article, Further Analysis of RIPE Atlas Version 3 Probes , we found that between 35% and 40% of the version 3 probes that were connected to RIPE Atlas experienced problems with their USB sticks and recovered from these (with or without human intervention). Because that analysis split the probes into two categories (active/inactive) compared to the three statuses (connected/disconnected/abandoned) used in the RIPE Atlas probe API , the comparison did not reveal significant differences in failure rates between the two USB versions; many of the disconnected probes with Verbatim USB sticks were assumed to still be active because they hadn't been disconnected longer than two months. However, in the cumulative distribution of the time passed to first USB re-initialisation (figure 2 in the article), we did observe a steeper curve for Verbatim probes. Although the difference in probe age is a major factor here, it is still possible that Verbatim probes are more prone to filesystem corruption. This might be a reason why, in the present analysis, we see Verbatim probes disconnecting sooner.
Conclusion: The percentage of disconnected or abandoned version 3 Verbatim probes is relatively high in the second half of 2015, and higher than the subset of version 3 SanDisk probes that first connected in the same time period.
Availability of probes during their lifetime
Finally, we take a look at probe availability, meaning the fraction of time a probe has been connected to the RIPE Atlas infrastructure during its lifetime. From the first connection to the very last disconnection, the RIPE Atlas back-end keeps track of the total number of seconds each probe spent connected and was thus available for measurements. The cumulative distribution of total uptime for different probe hardware versions is shown in figure 5a.
Figure 5a and 5b: Cumulative distributions of absolute and relative probe uptimes
Because of the different time periods in which the probes were handed out, the distributions vary widely. Whereas probes with Verbatim USB sticks have been around for only about 400 days, the oldest version 1 probes started to come online more than 2,000 days ago. So for newer probes, a total uptime of 300 days, for example, represents a much higher average availability than for older probes.
To eliminate this age factor, we look at relative uptime - the fraction of time a probe was connected during its active life in RIPE Atlas. For probes still connected, this is the ratio between the measured uptime and the time elapsed since the first connection. For disconnected probes, it is the ratio between the measured uptime and the time between first connection and the last disconnection. The cumulative distribution function (CDF) of this is shown in figure 5b.
As expected, the anchors, which are generally well connected and stable, perform best: 90% of them record an uptime of 98% or more during their active life. The distributions of the other probe types are similar. Between 72% and 79% of the version 1, version 2 and version 3 probes had an availability of at least 80%. The distribution for version 3 Verbatim probes is just slightly off with respect to the others - about 7% more have a relative uptime below 85%. This could be because of increased or longer lasting temporary disconnections, but it could also be a remnant of the age factor; for probes that connected within the last months, an occasional outage of a few days weighs much harder on relative uptime than it does for probes that connected more than a year ago.
Conclusion: During their entire lifetimes, before the final disconnection, the majority of probes maintain good availability. The version 3 Verbatim probes are not significantly different from other probes in this aspect.
From 1 March 2016 to 1 August 2016, about 1,250 probes joined RIPE Atlas by connecting to the infrastructure for the first time. However, this did not lead to expanded network coverage with more active probes overall. Compared to five months ago, the number of probes connected at a specific time of the day actually decreased by 1-2%.
To some extent, this stagnation may be caused by the latest probe hardware (version 3 probes with a Verbatim USB stick). Data suggest these probes disconnect and leave RIPE Atlas at an earlier stage in their lifetimes than other probe types. Combined with an already somewhat increased attrition rate of the other type of version 3 probes (those with a SanDisk USB stick), this leads to zero or even negative growth overall. New probes do not join RIPE Atlas quickly enough to compensate for those permanently disconnecting.
As filesystem corruption is the most common failure in probes that recovered and came back online, we are working to update the filesystem on the version 3 probes' USB sticks, as well as investigating options for making it easier for probe hosts to keep their probes online with an adaptable power supply. In the longer term, we also started the process of investigating other hardware options for RIPE Atlas probes. We'll keep you updated about these developments. If you have any feedback, please sign in to your RIPE NCC Access account and leave a comment below, or send an email to the RIPE Atlas mailing list .