We continue to look at failure rates for RIPE Atlas version 3 probes and the possible causes.
In recent weeks we've been contacted by a number of RIPE Atlas hosts who've had problems with their RIPE Atlas probes and suspected that something was wrong with the USB sticks in the version 3 (v3) probes. We starting investigating the issue and published some initial findings on RIPE Labs last month.
Initially there seemed to be a potential issue with some of the v3 probes' USB sticks. We use the sticks to store both the operating system and the measurement data on the probes. (Note that we are looking into future hardware solutions that don't rely on USB sticks for local storage.)
We've done some further analysis to try to determine where the problem actually lies. In fact, it doesn't seem to be a hardware problem with the USB sticks. Figure 1 below provides a visual impression of probe failure related to USB sticks.
Figure 1: RIPE Atlas probe failure rates related to USB sticks
The horizontal axis shows the probe ID, which ranges from 10,000 upwards. The first and second generation probes and RIPE Atlas anchors are ignored; we're only looking at version three probes for this analysis.
The vertical axis shows the time, in days, between the time the probe first connected to a controller and the time of the first failure. Along the x-axis you can see whether a probe was originally shipped with a SanDisk (red) or a Verbatim (blue) USB stick.
A few features stand out:
- The vast majority of failures are USB re-initialisations. These can be detected when a probe goes from running from the USB stick to running from the built-in flash and then back to running from the USB stick again with the same firmware version. We'll come back to this later.
- On the left side of the graph you can see a lot of white compared to the right side. The reason is that unfortunately some old log files were lost. RIPE Atlas probes with clearly missing data were left out of the analysis.
- The red solid squares are an unusual mode in the SanDisk USB sticks: in this case, they seem to have been reset completely, which resulted in losing both the serial number and information about their capacity.
- The blue open squares illustrate probes that have had their USB sticks replaced. This can be detected by comparing the manufacturer, product name, and serial number of the USB stick as reported by the probe. Just below probe ID 16,000, there is a relatively large concentration of these events with negative failures times. After initialising the probes with SanDisk USB sticks, we re-initialised them with Verbatim sticks. All of this happened before the probe was first recorded as connected (at time zero).
- Finally, there are some yellow dots that illustrate probes with USB sticks that became read-only. This is likely a signal from the USB stick that it is broken to the point of no longer being able to write anymore, but it can still read.
We can also see that failures seem to adhere to a downward slope over time. The reason for this is that probes with higher probe IDs were distributed more recently, and we therefore only have data about failures for them if those failures happened relatively recently compared to those probes with lower probe IDs that have been in the field longer and therefore could have experienced first failures after a longer period of time.
Table 1 shows some statistics on this.
|Type of failure||Total number of probes|
|Total number of probes||18,595|
|Never connected probes||7,012|
|Inactive probes (not active since 1 May 2016)||3,798|
|Lost initial registration (active)||1,520|
|Lost initial registration (not active)||902|
|Lost initial registration (never connected)||645|
Table 1: Types of failure
Of the 18,595 version 3 probes in total, 7,012 never connected to a controller and 3,798 have not connect since 1 May 2016. A portion of the probes that never connected have yet to be shipped or are in the hands of ambassadors who have yet to hand them out.
As mentioned above, some of the log files have been lost. So for 3,067 probes, it is not clear with what USB stick they were initialised. Of these, 1,520 are still active, 902 are not, and 645 never connected.
Table 2 gives a breakdown by USB stick brand (SanDisk vs Verbatim) of how probes failed and the probe status for each type of failure.
|Type of failure||Number||Percentage||Percentage (excluding never connected)||Number||Percentage||Percentage (excluding never connected)|
|read-only (active)||17||0.2 %||0.2 %||30||0.6 %||1.5 %|
|read-only (inactive)||33||0.3 %||0.5 %||32||0.7 %||1.6 %|
|read-only (never connected)||0||0.0 %||1||0.0 %|
|broken (active)||104||1.0 %||1.5%||0||0.0 %||0.0 %|
|broken (inactive)||69||0.6 %||1.0%||0||0.0 %||0.0 %|
|broken (never connected)||5||0.0 %||0||0.0 %|
|USB changed (active)||128||1.2 %||1.8%||103||2.2 %||5.2 %|
|USB changed (inactive)||137||1.3 %||1.9 %||11||0.2 %||0.6 %|
|USB changed (never connected)||97||0.9 %||6||0.1 %|
|re-initialised (active)||1,877||17.4 %||26.2 %||406||8.6 %||20.7 %|
|re-initialised (inactive)||697||6.5 %||9.7 %||109||2.3 %||5.5 %|
|re-initialised (never connected)||5||0.9 %||14||0.3 %|
|no failure (active)||2,786||25.9 %||38.9 %||793||16.8 %||40.4 %|
|no failure (inactive)||1,320||12.3 %||18.4 %||481||10.2 %||24.5 %|
|no failure (never connected)||3,492||32.4 %||2,726||57.9 %|
|total (active)||4912||45.6 %||68.5 %||1332||28.3 %||67.8 %|
|total (inactive)||2256||21.0 %||31.5 %||633||13.4 %||32.2 %|
|total (never connected)||3599||33.4 %||2747||58.3 %|
Table 2: Types of failure for RIPE Atlas probes with SanDisk and Verbatim USB sticks
In general, hardware failures (USB sticks becoming read-only, the special broken mode of SanDisks and USB sticks being replaced, presumably because the old one was broken) occur in only a few cases.
The largest group of failures is with USB sticks that were re-initialised. This is typically not a hardware failure, but is causes by a corrupt filesystem. This can be caused by, for example, a power failure. The good news is that when this happens, the probes typically remain active and probes easily recover from this with a successful re-initialisation.
If we look at the probes for which no failure is recorded, a large percentage never connected. Another large group is still active and never experienced a failure. Finally, a significant group never reported a failure, but are also no longer active. It is quite possible that those probes would benefit from .
Figure 2 shows the cumulative distribution functions, by USB stick brand, for the time to re-initialise broken down sticks.
Two things stand out:
- The SanDisk curve (purple) is roughly exponential. For reference I included a plot of an exponential curve (light blue).
- The Verbatim curve (green) is much steeper than that of SanDisk.
The exponential nature of the SanDisk curve means that during each time interval, a probe has the same chance that the USB stick will be re-initialised. In other words, there is no significant time-related aspect to the filesystem corruption.
The fact that the Verbatim curve is much steeper suggests that the Verbatim USB sticks are more prone to failure. However, the Verbatim USB sticks have been deployed for a much shorter period of time than the SanDisk sticks. At this time, it is therefore too early to draw conclusions about the relative performance of the SanDisk and Verbatim USB sticks when it comes to filesystem corruption.
Effect of power supply
Finally, there is a measurable difference in how the SanDisk and Verbatim USB sticks respond to low voltage. A USB port is suppose to deliver five volts, but what if it doesn't?
With a SanDisk USB stick, the probe still works fine at 4.3V and draws 230mA at most to initialise the USB stick. The Verbatim USB sticks can operate on a lower voltage. At 3.5V, the Verbatim still works and initialisation requires at most 280mA. Without a USB stick, the TP-Link device runs at 3.2V and draws at most 230mA.
One of the reasons we switched from SanDisk to Verbatim was that, in some cases, the probe just didn't see the USB stick. It is possible that marginal USB power supplies drop below the 4.3V required by the SanDisk sticks.
Playing with different voltage settings and limits did not cause filesystem corruption.
Actual hardware failures for version 3 RIPE Atlas probes are relatively rare compared to filesystem corruption. Filesystem corruption seems to occur randomly. There doesn't seem to be a big difference in hardware failure rates for SanDisk or Verbatim USB sticks, but the Verbatim USB sticks are less sensitive to the USB power supply voltage.
Going forward, we plan to switch from the Linux ext2 filesystem to ext4 as a first measure. There are other possible solutions available as well should that not work, and we will keep you updated as things progress.
Each generation of probes will of course experience its own quirks and may require some maintenance from time to time, but none that should impact RIPE Atlas as a whole. It's important to note that, in many cases, it's possible to overcome the filesystem corruption issue for version 3 probes by . We're also looking into potential future hardware solutions that will overcome these particular issues, and we'd like to thank all our probe hosts for keeping their probes online and helping to make the RIPE Atlas network as strong as possible.
We'll also continue to look into these issues and publish new findings here on RIPE Labs - stay tuned!