You are here: Home > Publications > RIPE Labs > Robert Kisteleki > RIPE Atlas: Countering Hardware Issues with Better Firmware

RIPE Atlas: Countering Hardware Issues with Better Firmware

Robert Kisteleki — 06 Dec 2016
Contributors: Philip Homburg
As RIPE Atlas is expanding, it is approaching the magical milestone of 10,000 probes. However, as our public graphs also illustrate, the expansion has slowed down recently.

Figure 1 illustrates this development.

 

Figure 1: Number of connected RIPE Atlas probes over time

One factor that contributes to this is the appearance of USB stick related issues in the currently used v3 probe generation. Even though, for many hosts, this problem never materialises, for some other users this is a recurring issue.

While we have published multiple analyses on the topic (see Troubleshooting RIPE Atlas Probes: USB Sticks and Further Analysis of RIPE Atlas Version 3 Probes  and Another Look at RIPE Atlas Probe Lifetime), so far we have not identified a single ultimate reason for this behaviour. We understand that having to tend to this can be frustrating, so in the spirit of transparency we'd like to update the RIPE Atlas community about our plan of action.

One of the "ultimate solutions" is to look for a more suitable hardware to be used as v4 probe. It is beyond the scope of this article to describe the full requirements, but please contact us if you have a potential device in mind!

In the meantime we plan to enhance how the current devices behave via firmware updates. The sections below lay out our current plan.

To be clear: the available built-in storage of the v3 probes is 4MB, which is not enough to store the measurement firmware, so we cannot do away without the USB extension. But we can optimise how we use it.

Using the USB Storage Less Often

One of the enhancements we're thinking about is to use the USB sticks less, thereby reducing power use and the chances of file system corruption. This can be achieved by storing result data in memory instead of writing it to disk immediately.

The v3 probes have 32MB of RAM to work with, which is a serious constraint. However, our measurement code is efficient enough to run even on the v2 probes which only have 16MB and, with some more constraints, on v1 probes with 8MB memory. So we can put aside a bit of this 32MB to act as a RAM disk for temporary storage. As long as the probe is connected to our network, it submits its results periodically (currently this is done every 90 seconds but we can tune this). This means that in normal operating mode the RAM disk is likely to be enough to store results until submission.

If the probe is unable to send in the results, we'll move the collected data to permanent storage (i.e. the USB stick) which has enough capacity to store this even if the disconnection is prolonged. Once the probe is connected again, it will deliver this backlog together with the newly collected results.

Of course this solution has the drawback of losing the collected results of the last minute or so in case of a power failure. From the engineering standpoint, this seems like a good tradeoff.

More Resiliency Against File System Corruption

Probes, just like any hardware without backup power, can lose power, and this can lead to file system corruption. We have seen this in real life: sometimes it's the result data that suffers, other times it's even the measurement code or the metadata (connection status, list of current tasks, etc.). This can confuse the probe - even though there's nothing wrong with the hardware, the unit becomes dysfunctional.

Writing to the disk, as explained above, helps this case already, but it also allows us to rearrange the disk layout: we can now separate the firmware from the data. The firmware part can be explicitly marked as a read-only file-system, reducing the likelihood of errors.

Conclusion

All this is easier said than done, and cannot happen overnight. However, we already started working on the implementation, and we believe that we can start testing this in real life soon. Please stay tuned!

 

2 Comments

Besmir zanaj says:
06 Dec, 2016 05:17 PM
Very nice explained article.
Can I suggest that if the measurement data is not readable, the probe initiates a USB drive format to at least let us know that the USB is in readolny state. By doing so we get rid of the issues of electric failures to the probes.
Daniel AJ says:
08 Dec, 2016 06:37 PM
What about a virtual probe, running on Omnia Turris devices? Those routers were developed and recently shipped by nic.cz - I am sure they would help Atlas get the necessary software packages as an option onto the Turris devices.
Add comment

You can add a comment by filling out the form below. Comments are moderated so they won't appear immediately. If you have a RIPE NCC Access account, we would like you to log in.