Philip Homburg

Further Analysis of RIPE Atlas Version 3 Probes

Philip Homburg

12 min read

0 You have liked this article 0 times.
1

We continue to look at failure rates for RIPE Atlas version 3 probes and the possible causes.


Background

In recent weeks we've been contacted by a number of RIPE Atlas hosts who've had problems with their RIPE Atlas probes and suspected that something was wrong with the USB sticks in the version 3 (v3) probes. We starting investigating the issue and published some initial findings on RIPE Labs last month.

Initially there seemed to be a potential issue with some of the v3 probes' USB sticks. We use the sticks to store both the operating system and the measurement data on the probes. (Note that we are looking into future hardware solutions that don't rely on USB sticks for local storage.)

Analysis

We've done some further analysis to try to determine where the problem actually lies. In fact, it doesn't seem to be a hardware problem with the USB sticks. Figure 1 below provides a visual impression of probe failure related to USB sticks.

Probe failure rates Figure 1:  RIPE Atlas probe failure rates related to USB sticks

The horizontal axis shows the probe ID, which ranges from 10,000 upwards. The first and second generation probes and RIPE Atlas anchors are ignored; we're only looking at version three probes for this analysis.

The vertical axis shows the time, in days, between the time the probe first connected to a controller and the time of the first failure. Along the x-axis you can see whether a probe was originally shipped with a SanDisk (red) or a Verbatim (blue) USB stick.

A few features stand out:

  • The vast majority of failures are USB re-initialisations. These can be detected when a probe goes from running from the USB stick to running from the built-in flash and then back to running from the USB stick again with the same firmware version. We'll come back to this later.
  • On the left side of the graph you can see a lot of white compared to the right side. The reason is that unfortunately some old log files were lost. RIPE Atlas probes with clearly missing data were left out of the analysis.
  • The red solid squares are an unusual mode in the SanDisk USB sticks: in this case, they seem to have been reset completely, which resulted in losing both the serial number and information about their capacity.
  • The blue open squares illustrate probes that have had their USB sticks replaced. This can be detected by comparing the manufacturer, product name, and serial number of the USB stick as reported by the probe. Just below probe ID 16,000, there is a relatively large concentration of these events with negative failures times. After initialising the probes with SanDisk USB sticks, we re-initialised them with Verbatim sticks. All of this happened before the probe was first recorded as connected (at time zero).
  • Finally, there are some yellow dots that illustrate probes with USB sticks that became read-only. This is likely a signal from the USB stick that it is broken to the point of no longer being able to write anymore, but it can still read.

We can also see that failures seem to adhere to a downward slope over time. The reason for this is that probes with higher probe IDs were distributed more recently, and we therefore only have data about failures for them if those failures happened relatively recently compared to those probes with lower probe IDs that have been in the field longer and therefore could have experienced first failures after a longer period of time.

Table 1 shows some statistics on this.

Type of failure Total number of probes
Total number of probes 18,595
Never connected probes 7,012
Inactive probes (not active since 1 May 2016) 3,798
Lost initial registration (active) 1,520
Lost initial registration (not active) 902
Lost initial registration (never connected) 645

  Table 1: Types of failure

Of the 18,595 version 3 probes in total, 7,012 never connected to a controller and 3,798 have not connect since 1 May 2016. A portion of the probes that never connected have yet to be shipped or are in the hands of ambassadors who have yet to hand them out.

As mentioned above, some of the log files have been lost. So for 3,067 probes, it is not clear with what USB stick they were initialised. Of these, 1,520 are still active, 902 are not, and 645 never connected.

Table 2  gives a breakdown by USB stick brand (SanDisk vs Verbatim) of how probes failed and the probe status for each type of failure.

SanDisk Verbatim
Type of failure Number Percentage Percentage (excluding never connected) Number Percentage Percentage (excluding never connected)
read-only (active) 17 0.2 % 0.2 % 30 0.6 % 1.5 %
read-only (inactive) 33 0.3 % 0.5 % 32 0.7 % 1.6 %
read-only (never connected) 0 0.0 % 1 0.0 %
broken (active) 104 1.0 % 1.5% 0 0.0 % 0.0 %
broken (inactive) 69 0.6 % 1.0% 0 0.0 % 0.0 %
broken (never connected) 5 0.0 % 0 0.0 %
USB changed (active) 128 1.2 % 1.8% 103 2.2 % 5.2 %
USB changed (inactive) 137 1.3 % 1.9 % 11 0.2 % 0.6 %
USB changed (never connected) 97 0.9 % 6 0.1 %
re-initialised (active) 1,877 17.4 % 26.2 % 406 8.6 % 20.7 %
re-initialised (inactive) 697 6.5 % 9.7 % 109 2.3 % 5.5 %
re-initialised (never connected) 5 0.9 % 14 0.3 %
no failure (active) 2,786 25.9 % 38.9 % 793 16.8 % 40.4 %
no failure (inactive) 1,320 12.3 % 18.4 % 481 10.2 % 24.5 %
no failure (never connected) 3,492 32.4 % 2,726 57.9 %
total (active) 4912 45.6 % 68.5 % 1332 28.3 % 67.8 %
total (inactive) 2256 21.0 % 31.5 % 633 13.4 % 32.2 %
total (never connected) 3599 33.4 % 2747 58.3 %
Total 10,767 4,712

  Table 2: Types of failure for RIPE Atlas probes with SanDisk and Verbatim USB sticks

In general, hardware failures (USB sticks becoming read-only, the special broken mode of SanDisks and USB sticks being replaced, presumably because the old one was broken) occur in only a few cases.

The largest group of failures is with USB sticks that were re-initialised. This is typically not a hardware failure, but is causes by a corrupt filesystem. This can be caused by, for example, a power failure. The good news is that when this happens, the probes typically remain active and probes easily recover from this with a successful re-initialisation.

If we look at the probes for which no failure is recorded, a large percentage never connected. Another large group is still active and never experienced a failure. Finally, a significant group never reported a failure, but are also no longer active. It is quite possible that those probes would benefit from re-initialising the USB stick .

Figure 2 shows the cumulative distribution functions, by USB stick brand, for the time to re-initialise broken down sticks.

Commulative

Two things stand out:

  • The SanDisk curve (purple) is roughly exponential. For reference I included a plot of an exponential curve (light blue).
  • The Verbatim curve (green) is much steeper than that of SanDisk.

The exponential nature of the SanDisk curve means that during each time interval, a probe has the same chance that the USB stick will be re-initialised. In other words, there is no significant time-related aspect to the filesystem corruption.

The fact that the Verbatim curve is much steeper suggests that the Verbatim USB sticks are more prone to failure. However, the Verbatim USB sticks have been deployed for a much shorter period of time than the SanDisk sticks. At this time, it is therefore too early to draw conclusions about the relative performance of the SanDisk and Verbatim USB sticks when it comes to filesystem corruption.

Effect of power supply

Finally, there is a measurable difference in how the SanDisk and Verbatim USB sticks respond to low voltage. A USB port is suppose to deliver five volts, but what if it doesn't?

With a SanDisk USB stick, the probe still works fine at 4.3V and draws 230mA at most to initialise the USB stick. The Verbatim USB sticks can operate on a lower voltage. At 3.5V, the Verbatim still works and initialisation requires at most 280mA. Without a USB stick, the TP-Link device runs at 3.2V and draws at most 230mA.

One of the reasons we switched from SanDisk to Verbatim was that, in some cases, the probe just didn't see the USB stick. It is possible that marginal USB power supplies drop below the 4.3V required by the SanDisk sticks.

Playing with different voltage settings and limits did not cause filesystem corruption.

Conclusion

Actual hardware failures for version 3 RIPE Atlas probes are relatively rare compared to filesystem corruption. Filesystem corruption seems to occur randomly. There doesn't seem to be a big difference in hardware failure rates for SanDisk or Verbatim USB sticks, but the Verbatim USB sticks are less sensitive to the USB power supply voltage.

Going forward, we plan to switch from the Linux ext2 filesystem to ext4 as a first measure. There are other possible solutions available as well should that not work, and we will keep you updated as things progress.

Moving forward

Each generation of probes will of course experience its own quirks and may require some maintenance from time to time, but none that should impact RIPE Atlas as a whole. It's important to note that, in many cases, it's possible to overcome the filesystem corruption issue for version 3 probes by re-initialising the USB stick . We're also looking into potential future hardware solutions that will overcome these particular issues, and we'd like to thank all our probe hosts for keeping their probes online and helping to make the RIPE Atlas network as strong as possible.

We'll also continue to look into these issues and publish new findings here on RIPE Labs - stay tuned!

0 You have liked this article 0 times.
1

You may also like

View more

About the author

Philip Homburg worked on the measurement code and other firmware behind RIPE Atlas probes and anchors until leaving the RIPE NCC in 2022.

Comments 1