Six months have passed since we last reported on the status of RIPE Atlas probe connections. Time for an update.
Introduction
In August 2016 we published a detailed analysis of probe status and up-time broken down per hardware type. We reported how growth in the number of simultaneously connected RIPE Atlas probes had stagnated and how version-3 (v3) probes, equipped with a USB stick for storage, tended to be less connected; especially, at that time, the most recent ones with a Verbatim USB stick. A related analysis of USB failure modes found signs that Verbatim sticks could be more prone to file-system corruption. Although no firm conclusion could be made on this, due to the different lengths of time that probes with Verbatim USB had been deployed with respect to probes with SanDisk USB, we did switch to a different brand for the latest batch of v3 probes. Deployed since September 2016, most of these probes have a Philips USB stick; but due to logistics, about 10% have SanDisk USB storage. In addition, we published a troubleshooting guide that explains how users can fix problems with USB sticks themselves.
However, despite these efforts and the fact that in the past six months, between 31 July 2016 and 31 January 2017, 1,686 new probes and 43 new anchors joined RIPE Atlas, the number of simultaneously connected probes did not increase; it even went down some more, from 9,253 half a year ago to 9,138 now. Somehow, probes disconnect and leave RIPE Atlas at a rate that is identical to, and sometimes even faster than, the rate at which new probes come online. In this article, we aim to provide more insight on the situation and the observed connect/disconnect patterns.
Connects, Disconnects and Up-time
Figure 1 groups probes by the month they first connected and shows the present status. The total height of each bar presents the total number of probes that connected to the infrastructure during that month. Within each bar, the primary colours show how many probes of each type connected. The darker coloured segments represent probes that are still connected, while lighter colours represent probes which have disconnected. We can see that, with the exception of some highs and lows, the rate at which probes join has been hovering around 300 per month since Spring 2013. The graph also shows how probes connected in the last two months perform best; these groups have the highest percentage still connected. Looking at different hardware types, the graph suggests v3 probes with Verbatim USB have the lowest ratio connected. We'll return to this point later.
Figure 1: Probe first-connect times and present status: light colours for disconnected probes, darker colours for connected probes
In Figure 2, we group the no longer connected probes by the date they disconnected from the RIPE Atlas infrastructure. As before, the height of each bar shows how many probes in total disconnected in a specific month, while the different colours indicates how many of the different hardware types disconnected. The most recent month sees the highest number of disconnected probes. To some extent, this is normal and expected. In the previous analysis we found it is quite possible for probes to experience temporary outages of up to 30 days. Based on past behaviour of probes in the RIPE Atlas system, we expect a substantive part of the 1,151 probes that disconnected in January 2017 to reconnect in the coming weeks and months. In August 2016, this same "Date of last Disconnect" graph showed 1,050 probes had disconnected in the preceding month of July. However, six months later, the graph only shows 373 probes disconnected in July 2016. The other 677 did come back (but may have disconnected again at a later point in time).
Figure 2: Distribution of last disconnect time
In Figure 3 we look at the cumulative distribution function of relative up-time of all probes. This quantity is defined as the fraction of time a probe was connected to the infrastructure between the very first and very last time it was seen connected. As before, anchors are doing best, with virtually all of them recording an uptime of at least 80%. The latest batch of v3 probes is doing well too. Although these probes have not been in the field very long, it is encouraging to see that, thus far, they maintain a better than average uptime; temporary disconnects are (still) short. On the other hand, probes with Verbatim USB perform worst, even though they form the second youngest batch of deployed probes. 35% of these have an up-time of 80% or less. It signals that these probes experienced, relative to their entire lifetime, longer down time than the other types of hardware.
Figure 3: Cumulative distribution of relative probe uptime
Overall State
Figure 4 summarises the present state of all probes. In this chart, the width of each bar is proportional to the number of probes of each hardware type that have connected to the RIPE Atlas infrastructure at least once. Inside each bar, the stacked colour segments represent the fraction of probes which are, respectively, connected, disconnected and abandoned (disconnected for more than 90 days). From the three types of version-3 probes, the latest batch appears to be connected best, with only 11 out of 946 probes considered abandoned. However, statistics here are highly influenced by the relatively young age of these probes. We started distribution in September 2016 and, by definition, only those probes which disconnected before November 2016 will be counted in the 'abandoned' category.
Figure 4: RIPE Atlas probe status by hardware
Table 1 below shows the numbers behind the graph. Compared to July 2016, the biggest loss is with v3-SanDisk probes; the number of probes with status abandoned increased by 796, with the percentage of connected probes decreasing to 50%. The v3-Verbatim are, relatively speaking, least connected. The number of abandoned probes of this type increased by 644, leaving only 40% connected. Besides v3 probes, we also lost version 1 and 2 probes, which do not have USB storage. The number of abandoned v1 probes increased by 40, bringing the connected ratio down to 58%, while for v2 probes, we have 52 additional probes that are considered abandoned, which makes for a connected ratio of 57%.
Type | Connected | Disconnected | Abandoned |
---|---|---|---|
v1 | 758 | 92 | 455 |
v2 | 1,435 | 143 | 915 |
Anchor | 237 | 8 | 10 |
v3 SanDisk | 4,911 | 1,141 | 3,745 |
v3 Verbatim | 1,152 | 547 | 1,177 |
v3 Latest | 645 | 290 | 11 |
The precise reasons for probes disconnecting, and therefore leaving the RIPE Atlas infrastructure, is unclear. We have seen and received reports of probes failing with USB errors. However, it is also possible that probe hosts lose their interest in RIPE Atlas and unplug the device. A third likely possibility would be a combination of the two: when a probe fails because of USB or other issues, a less motivated probe host might give up completely and stop making any effort to fix recoverable errors.
Trends Per Country
To get a feeling for the influence of probe hosts, we look at the status of probes with SanDisk USB split by country. Since most of these first connected to RIPE Atlas more than a year ago, the age factor plays less of a role here. Also, to get statistically meaningful results, we only consider countries which received and connected at least 100 probes. The results are shown in figure 5. As in figure 4, the width of the bars is proportional to the number of probes in each country while the different colours represent the fraction of connected, disconnected and abandoned probes in each country.
From the larger countries, Germany and the United States have more than the average 50% connected; France and the United Kingdom on the other hand record below average connection rates. Quite notable is the relatively poor performance of SanDisk probes in Iran; only 29 out of 122 probes are connected, with 71 abandoned. Japan on the other hand records a connected rate of 69%, 78 out of 113 connected. With over a hundred probes in each country, it is unlikely that hardware reliability alone is causing these differences.
Figure 5: RIPE Atlas probe status by country
Other Aspects
It's useful to note that each and every RIPE Atlas probe goes through the same initialisation and verification process within the RIPE NCC. We're therefore confident that all of them "leave the building" in an operationally good condition. Once they get connected, some are much more stable than we originally thought they would be. For instance, as of the time of publishing this article, more than 160 probes have been connected for at least three months, and some have been up and running for more than six months continuously already. This reinforces the suggestion, made in the previous section, that there are more factors at play than simple flash memory wear-out or file system corruption.
Conclusion
In the past six months, the RIPE Atlas network, as measured by the number of simultaneously connected probes, did not grow. Although the new probes are distributed and making new connections at rates similar to those that came before them, this rate is now only enough to replace the number of probes that leave RIPE Atlas, disconnect, and are not seen coming back. The reasons for lasting disconnects can be hardware errors (specifically issues with USB sticks), but the interest and ability of hosts in addressing issues with the probes may also play a role. This is hinted at, for example, by the striking difference between connected probe ratios in Iran and Japan.
To decrease the chances of probes going offline because of USB issues, we will soon launch new firmware for the v3 probes which makes these perform less writes to the USB storage. This is expected to increase the "mean time between failures" (MTFB) ratio, even though it cannot really help if the cause of the observed flakiness is not related to the use of the USB drive. Looking ahead, we're exploring other hardware options that do not involve USB at all.
Next to addressing possible reasons for probes leaving RIPE Atlas, we also want to investigate the success rate of probes coming online in the first place. When probes are shipped to end users, how long does it take for them to get installed and make a first connection to the RIPE Atlas infrastructure? How many are not activated at all? Are they lost in transit or forgotten by the probe host or, perhaps, "repurposed"? This analysis is not straightforward, because probes that have not been activated can be at various places: they can be in the RIPE NCC's inventory, in the inventory of the agent who helps us ship and distribute probes worldwide, in the inventory of RIPE Atlas ambassadors, in transit, or resting in a future host's bag.
Last but not least we want to thank all probe and anchor hosts for keeping the devices connected as much as possible. These efforts are crucial to keep the RIPE Atlas network strong and useful.
Comments 7
Comments are disabled on articles published more than a year ago. If you'd like to inform us of any issues, please reach out to us via the contact form here.
John Klensin •
Let me suggest one other hypothesis based on experience. There appear to be two types of USB storage failures, ones that can be overcome by the unplug, remove USB stick, plug in, reinsert USB stick procedure and ones in which the USB stick itself get trashed so that it can no longer be formatted, etc. The former issue is just an annoyance, but the annoyance is cumulative: if the probe disconnects once, whomever is hosting it is likely to go to the trouble to get it back up. After several times, the motivation to bring it back up and do it quickly may diminish. The latter issue is more serious, at least for those of us who do not keep an inventory of USB sticks in the appropriate size on hand, because the process is to spend a fair amount of time determining that resetting / reloading the USB stick won't work, finding time to go to the store or order a USB stick and wait for it to arrive, and then set the probe up again. Once sure, twice maybe, but, after running through three or four USB sticks, I'd guess that the odds of a probe being reconnected go down rapidly. Suggestion: provide a way for people to record when they have applied the "corrupt file system" fix, when they have replaced a USB stick, and what size and brand they have replaced it with. Wrt the latter, the data in the article probably don't show what USB stick is in a connected probe, only what kind of stick you shipped it with.
Hide replies
Mirjam Kühne •
Hi John, Thanks for constructive thoughts and your suggestion. We keep track of failure rates for various versions of RIPE Atlas probes to a certain extend. You can see some of this anlysis in this earlier article on RIPE Labs: https://labs.ripe.net/Members/philip_homburg/further-analysis-of-ripe-atlas-version-3-probe. Ultimately the best solution might be to look at completely different hardware. This is on our roadmap but will take some time to investigate.
Owen DeLong •
I'm interested in addressing the issues that have come up on my V1 probe, but it's unclear how to go about doing so.
Hide replies
Mirjam Kühne •
Hi Owen, Can you please send this as a ticket to atlas@ripe.net so we can respond in more detail. Please also mention the 12 characters printed under the MAC address of the probe. Thank you.
Vlad Studenichnikov •
Very bad quality of Verbatim stick & probe firmware! Too many filesystem crash. And now one of two probe - stick is read only... i use my personal flash disk... Too bad for the equipment that I gave free shelter...
Hide replies
Alun Davies •
Hi Vlad, We understand your frustration and we've been working hard to overcome the issues with RIPE Atlas probe USB sticks. As mentioned in the article, we recently launched a new firmware update that seems to have gone some way towards resolving the issue (more on this here: https://labs.ripe.net/Members/kistel/ripe-atlas-countering-hardware-issues-with-better-firmware) That said, we're aware that this is not a complete fix, and we really do appreciate the efforts you're making to keep your probes connected.
Geert Jan de Groot •
Has RIPE NCC ever tried to contact the manufacturers of the USB sticks we are harping? There could be reasons why the USB sticks behave this way, and I think it just to ask the other side of the story. For instance, a USB stick consists of a flash storage array, and a controller that maps the storage pages to disk sectors to allow wear levelling. An encrypted filesystem may play havoc with the controller. Anybody ever tried endurance of an encrypted filesystem, versus a non-encrypted filesystem containing encrypted files?