Stéphane Bortzmeyer

Using RIPE Atlas to Debug Network Connectivity Problems

Stéphane Bortzmeyer

15 min read

0 You have liked this article 0 times.
1

This article explains how I use RIPE Atlas probes, the official API and custom scripts to debug network issues.


NOTE: The tools presented here has been superseded by Blaeu, available at https://framagit.org/bortzmeyer/blaeu and documented at Creating RIPE Atlas One-off Measurements with Blaeu.

Introduction

Typically a network or system administrator uses tools like ping and traceroute to debug her network problems. Cannot visit fr.wikipedia.org? Let's ping it to see if it is up and reachable. ping times out? Use traceroute, so you will see where it stops. But these tools have a serious limitation for today's Internet: they measure from only one vantage point. In a world of CDN, BGP issues and network filtering, you need several vantage points, which is one of the biggest strength of RIPE Atlas.

It is specially important if you use anycast for your service. With anycast, the risk that failure of success depends on the location of the RIPE Atlas probe is much higher.

I explain here ho w I debug. Not because I think that my method is specially good or because I believe that my tools are the best, just to report on an actual ordinary sysadmin experience.

Installation

The tools I use were developed locally, using the RIPE Atlas API. There is now an official command-line tool , but I prefer my own tools because I'm used to them. They were developed long before there was the official tool. They use the version 1 of the RIPE Atlas API (a v2 is planned for a long time but not yet published).

These tools are written in Python and you therefore need a working Python installation.  You also need one additional (not in the standard library) Python package, cymruwhois . You can typically install it with

 sudo easy_install cymruwhois

You also need a RIPE NCC Access single sign-on account and an API key (you create it from your "My Atlas" page ). If you are going to create your own tests, you will have to spend a certain amount of credits. It is not necessary to host your own RIPE Atlas probe to earn credits (although it helps, because it contributes to the whole system). However, there are other ways to obtain credits .

There are several ways to install the necessary programs. They are written in Python but there is not yet a Python package, because Python packaging tools are a big mess. So, we'll install them in a more manual way, here on a FreeBSD machine (but it should work on any Unix), where we use sudo for everything that requires being root.

First, copy from the Github repository :

 % git clone https://github.com/RIPE-Atlas-Community/ripe-atlas-community-contrib.git
 
Cloning into 'ripe-atlas-community-contrib'...
remote: Counting objects: 436, done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 436 (delta 6), reused 0 (delta 0), pack-reused 421
Receiving objects: 100% (436/436), 473.45 KiB | 148.00 KiB/s, done.
Resolving deltas: 100% (226/226), done.
Checking connectivity... done.

Then, install the library RIPEAtlas.py in a place where Python will find it. If you have the environment variable PYTHONPATH defined, copy it there (or use a symbolic link, as we do here), otherwise, you can find out where Python will search the packages with:

 % python -c 'import sys; print sys.path;'
 
['', '/usr/local/lib/python27.zip', '/usr/local/lib/python2.7', '/usr/local/lib/python2.7/plat-freebsd10', '/usr/local/lib/python2.7/lib-tk', '/usr/local/lib/python2.7/lib-old', '/usr/local/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/site-packages']

Then we copy or link:

 % sudo ln -s $HOME/ripe-atlas-community-contrib/RIPEAtlas.py /usr/local/lib/python2.7/site-packages

We know make the programs executable and link them (or copy) from a directory where we put our programs (here, ~/bin):

 % chmod a+x $HOME/ripe-atlas-community-contrib/reachability.py
 
% ln -s $HOME/ripe-atlas-community-contrib/reachability.py ~/bin/atlas-reach
% ln -s $HOME/ripe-atlas-community-contrib/traceroute.py ~/bin/atlas-tracert
% ln -s $HOME/ripe-atlas-community-contrib/resolve-name.py ~/bin/atlas-resolve

One last thing, create an API key from the RIPE Atlas web pages and copy it in $HOME/.atlas/auth.  The programs are now available:

 % atlas-reach 2001:678:c::1
 
4 probes reported
Test #3316036 done at 2016-01-10T17:25:53Z
Tests: 12 successful tests (100.0 %), 0 errors (0.0 %), 0 timeouts (0.0 %), average RTT: 123 ms

The program returns the number of RIPE Atlas probes that reported results (by default, we request five probes but, here, one of them was too slow to answer), the measurement ID (which we can then lookup on the RIPE Atlas web pages ), the date and the actual results. By default, we run three tests per probe, which explains why we have twelve tests. The date is very important  because the Internet is ever changing: 50 % failures at a given moment does not mean it will last forever, it can be a temporary problem.

There is no man page for this tool or the others, use the -h option to find about possible options (some will be introduced in the rest of this article).

In all the examples later, I will use the long form of options (--tests instead of -t) because it is clearer. But, if you're a challenged typist, don't worry, these tools also have a short form.

Examples

The first example showed a machine where everything worked fine. Be careful that, by default, only five Atlas probes are requested. The goal is to save network resources (and RIPE Atlas credits). But it is risky to infer statistics (such as "100 % success") from such a small number. You can ask for more probes with --requested:

 % atlas-reach --requested 100 2001:678:c::1
 
86 probes reported
Test #3316047 done at 2016-01-10T17:49:40Z
Tests: 243 successful tests (96.4 %), 3 errors (1.2 %), 6 timeouts (2.4 %), average RTT: 73 ms

 We requested one hundred probes. Of course, you will not always get what you requested, specially when you add restrictions, for instance only probes from a given country. 86 probes reported a result, which should be (with the default value), 258 tests, but some probes made fewer tests (we got reports for about 252 tests). They were also a few errors, as could be expected on the Internet.

Because many probes believe they have IPv6 connectivity while it's actually not true , the tools add by default the RIPE Atlas tag system-ipv6-works. You can add more tags with --include, for instance here we use only probes tagged as being behind a NAT router (small warning: this replaces all tags, even the system-ipvX-works):

 % atlas-reach --include nat $(dig +short +nodnssec A fr-cdg-as2486.anchors.atlas.ripe.net)
 
5 probes reported
Test #3762644 done at 2016-05-06T14:36:39Z
Tests: 15 successful tests (100.0 %), 0 errors (0.0 %), 0 timeouts (0.0 %), average RTT: 98 ms

The tool does not accept host names (because they are ambiguous: what to ping if there are several IP addresses?) hence the small trick with dig to find the IPv4 address of this anchor.

If you want to see the actual JSON data which were sent to RIPE Atlas, use --verbose:

 % atlas-reach --requested 500 --country FR --verbose 37.49.233.130
 
{'definitions': [{'description': 'Ping 37.49.233.130 from FR', 'af': 4, 'packets': 3, 'type': 'ping', 'is_oneoff': True, 'target': '37.49.233.130'}], 'probes': [{'requested': 500, 'type': 'country', 'value': 'FR', 'tags': {'include': ['system-ipv4-works']}}]}
Measurement #3558191 to 37.49.233.130 uses 499 probes
486 probes reported
Test #3558191 done at 2016-02-14T18:21:17Z
Tests: 982 successful tests (84.6 %), 0 errors (0.0 %), 179 timeouts (15.4 %), average RTT: 28 ms

Many servers today do not accept to reply to ICMP echo packets, so you will often get a sad "No successful test":

 % atlas-reach 193.0.6.139
 
5 probes reported
Test #3762577 done at 2016-05-06T14:34:07Z
No successful test

Often, problems do not last. Either they fix themselves, or someone changes something and the network unbreaks. It is then useful to re-run the test to see if the problem is indeed solved.  But if the problem is very local, and the probe selection produces a completely different set of probes, you may run a different experiment. In order to control the parameters (change only one thing at a time in a scientific experiment), there is a very useful RIPE Atlas feature, the ability to select the same probes as in a previous measurement. Use --old_measurement for that.

As I said in the introduction, when ping fails, we typically run traceroute to further investigate the problem. We can do it with RIPE Atlas, too (--format display the result; by default, the tool is quiet):

 % atlas-tracert --format --requested 1 193.0.6.139
 

Measurement #3762813 Traceroute 193.0.6.139 uses 1 probes
1 probes reported
Test #3762813 done at 2016-05-06T14:53:04Z
From:  62.221.110.159    1547    IDK-NETWORK JSCC Interdnestrcom, MD
Source address:  192.168.1.202
Probe ID:  4149
1    192.168.1.1    4565    MEGAPATH2-US - MegaPath Networks Inc., US    [2.405, 1.673, 1.627]
2    10.133.0.1    4565    MEGAPATH2-US - MegaPath Networks Inc., US    [2.691, 2.291, 2.38]
3    10.133.1.1    4565    MEGAPATH2-US - MegaPath Networks Inc., US    [2.887, 2.652, 2.515]
4    10.1.0.170    4565    MEGAPATH2-US - MegaPath Networks Inc., US    [42.123, 42.273, 42.229]
5    80.81.192.95    None    None    [53.453, 53.238, 53.186]
6    213.136.1.89    12859    NL-BIT BIT BV, NL    [52.049, 51.679, 51.868]
7    ['*', '*', '*']
8    ['*', '*', '*']
9    ['*', '*', '*']
10    ['*', '*', '*']
11    ['*', '*', '*']
255    ['*', '*', '*']

In that case, the traceroute failed before reaching the target. If the traceroute goes to the end, we see:

 % atlas-tracert --format --probes 25111 2001:e30:1c1e:1::333
 
Measurement #3339262 Traceroute 2001:e30:1c1e:1::333 uses 1 probes
1 probes reported
Test #3339262 done at 2016-01-15T15:43:22Z
From:  2001:4490:dc4c:0:16cc:20ff:fe48:d468    9829    BSNL-NIB National Internet Backbone,IN
Source address:  2001:4490:dc4c:0:16cc:20ff:fe48:d468
Probe ID:  25111
1    2001:4490:dc4c::1    9829    BSNL-NIB National Internet Backbone,IN    [0.562, 0.469, 0.456]
2    2001:4490:fffc:8400::2    9829    BSNL-NIB National Internet Backbone,IN    [3.518, 1.973, 2.079]
3    ['*', '*', '*']
4    2001:41a8:4000:2::11    6762    SEABONE-NET TELECOM ITALIA SPARKLE S.p.A.,IT    [256.066, 252.362, 253.839]
5    2001:41a8:4000::33    6762    SEABONE-NET TELECOM ITALIA SPARKLE S.p.A.,IT    [251.684, 251.61, 252.091]
6    2001:5a0:12:100::35    6453    AS6453 - TATA COMMUNICATIONS (AMERICA) INC,US    [288.719, 288.659, 288.804]
7    2001:5a0:12:100::1a    6453    AS6453 - TATA COMMUNICATIONS (AMERICA) INC,US    [288.976, 289.061, 289.101]
8    2001:5a0:4500:100::6    6453    AS6453 - TATA COMMUNICATIONS (AMERICA) INC,US    [290.159, 289.276, 289.173]
9    2001:5a0:400:700::1    6453    AS6453 - TATA COMMUNICATIONS (AMERICA) INC,US    [289.24, 289.046, 290.527]
10    2001:5a0:400:200::26    6453    AS6453 - TATA COMMUNICATIONS (AMERICA) INC,US    [228.199, 228.212, 228.09]
11    2404:a800::50    9498    BBIL-AP BHARTI Airtel Ltd.,IN    [239.983, 239.713, 239.774]
12    2404:a800:2:1e::3a:2    9498    BBIL-AP BHARTI Airtel Ltd.,IN    [242.442, 243.905, 242.462]

Of course, going to the USA for an India-to-India trip is not ideal but it works.

When there is a real problem

Now, let's see a real problem. We test a Linode machine during an ongoing dDoS.

 % atlas-reach --requested 500 69.164.200.203
 
495 probes reported
Test #3316048 done at 2016-01-10T17:50:37Z
Tests: 1469 successful tests (99.0 %), 0 errors (0.0 %), 15 timeouts (1.0 %), average RTT: 149 ms

There are some timeouts. But remember there are (by default) three tests on each probe. Is the problem localised on some probes? Let's display the statistics by probe, using --measurement_ID to reuse an existing measurement, not to start a new one:

 % atlas-reach --requested 500 --by_probe --measurement_ID 3316048 69.164.200.203
 
Test #3316048 done at 2016-01-10T17:50:37Z
Tests: 490 successful probes (99.0 %), 5 failed (1.0 %), average RTT: 149 ms

We see that the problem is always on the same probes. It is something common for routing problems, or it can be that the dDoS saturated only some links, the ones used by these probes. By the way, if we want to know the probes with problems, we use --displayprobes:

 
% atlas-reach --requested 500 --by_probe --measurement_ID 3316048 --displayprobes 69.164.200.203
Test #3316048 done at 2016-01-10T17:50:37Z
Tests: 490 successful probes (99.0 %), 5 failed (1.0 %), average RTT: 149 ms
[21075, 14555, 24706, 12486, 11159]

And we can try to see their characteristics (or run traceroute) to see if there is a pattern.

Looking at another network issue at the same hoster, the difference between the loss rate per test and the loss rate per probe is clearer:

  % atlas-reach --requested 500 106.186.29.14
 
498 probes reported
Test #3499149 done at 2016-02-09T10:55:49Z
Tests: 834 successful tests (70.2 %), 0 errors (0.0 %), 354 timeouts (29.8 %), average RTT: 276 ms

% atlas-reach --requested 500 --by_probe -measurement_ID 3499149 106.186.29.14
Test #3499149 done at 2016-02-09T10:55:49Z
Tests: 462 successful probes (92.8 %), 36 failed (7.2 %), average RTT: 276 ms

This difference clearly indicates an issue with something else than routing or filtering.

Various examples

With traceroute

Some probes cannot ping this machine. We display their identifiers:

 % atlas-reach --requested 500 --by_probe --measurement_ID 3316851 --displayprobes 2a00:a4c0:200:1::130
 
Test #3316851 done at 2016-01-12T08:05:04Z
Tests: 482 successful probes (99.4 %), 3 failed (0.6 %), average RTT: 69 ms
[16882, 18317, 24023]

We then traceroute to this machine from these probes. Note the second traceroute (probe #18317), a routing loop:

 %  atlas-tracert --format --probes "16882, 18317, 24023" 2a00:a4c0:200:1::130
 

Measurement #3316853 Traceroute 2a00:a4c0:200:1::130 uses 3 probes
3 probes reported
Test done at 2016-01-12T08:12:59Z
From:  2a02:120:402b:1:12fe:edff:fec3:49a    29396    UNET Unet B.V.,NL
Source address:  2a02:120:402b:1:12fe:edff:fec3:49a
Probe ID:  16882
1    2a02:120:402b:1::1    29396    UNET Unet B.V.,NL    [0.61, 0.446, 0.435]
2    2a02:120:402b:1::1    29396    UNET Unet B.V.,NL    [u'*', 249.375, u'*']

From:  2a00:e80:0:105:ffff::6    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR
Source address:  2a00:e80:0:105:ffff::6
Probe ID:  18317
1    2a00:e80:0:105:ffff::5    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [1.133, 0.627, 0.53]
2    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [0.731, 0.721, 0.72]
3    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [0.982, 1.004, 20.0]
4    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [1.249, 1.2, 1.088]
5    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [1.387, 1.249, 11.737]
6    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [1.498, 1.441, 1.65]
7    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [1.666, 1.61, 10.989]
8    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [1.92, 1.782, 1.757]
9    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [2.062, 20.478, u'*']
10    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [2.273, 2.208, 2.184]
11    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [2.544, 2.356, 2.359]
12    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [2.512, 2.565, 2.459]
13    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [2.705, 2.743, 2.723]
14    2a00:e80:0:2::15    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [2.937, 2.902, 2.928]
15    2a00:e80:0:2::16    20917    KHEOPS-AUTONOMOUS-SYSTEM KHEOPS Organisation,FR    [3.087, 3.039, 3.02]
...

From:  2a02:120:402a:abcd:c66e:1fff:fe3a:de7c    29396    UNET Unet B.V.,NL
Source address:  2a02:120:402a:abcd:c66e:1fff:fe3a:de7c
Probe ID:  24023
1    2a02:120:402a:abcd::1    29396    UNET Unet B.V.,NL    [0.541, 0.625, 0.493]
2    2a02:120:402a:abcd::1    29396    UNET Unet B.V.,NL    [1565.825, u'*', 2468.934]

Here, we see that all traceroutes fail in the initial network, so it is not the fault of the target.

By Autonomous  System

Of course, routing problems are often AS-specific. It is therefore very useful to be able to select probes by AS:

 % atlas-reach --requested 500 --tests 10 --asn 12322 37.49.233.130    
 
220 probes reported
Test #3316862 done at 2016-01-12T08:30:05Z
Tests: 983 successful tests (94.2 %), 0 errors (0.0 %), 60 timeouts (5.8 %), average RTT: 27 ms

There is unfortunately no way to exclude one AS, to have a view of "the rest of the Internet".

Random loss

This is on a network where faulty hardware led to a serious loss of packets:

 % atlas-reach --requested 500 37.49.233.130
 
500 probes reported
Test #3338970 done at 2016-01-15T09:38:13Z
Tests: 977 successful tests (82.2 %), 0 errors (0.0 %), 212 timeouts (17.8 %), average RTT: 63 ms

% atlas-reach --requested 500 --measurement_ID 3338970  --by_probe 37.49.233.130
Test #3338970 done at 2016-01-15T09:38:13Z
Tests: 490 successful probes (98.0 %), 10 failed (2.0 %), average RTT: 63 ms

Note the percentage of failed probes is much lower than the percentage of failed tests. This is typical when there is random packet loss, for instance when the network equipments are overloaded, or because, as it is the case here, the hardware has a defect. Other tests related to the same issue.:

 % atlas-reach --requested 500 37.49.233.36
 
499 probes reported
Test #3346950 done at 2016-01-17T15:50:34Z
Tests: 1124 successful tests (90.5 %), 0 errors (0.0 %), 118 timeouts (9.5 %), average RTT: 61 ms

% atlas-reach --requested 500 --by_probe --measurement_ID 3346950 37.49.233.36
Test #3346950 done at 2016-01-17T15:50:34Z
Tests: 499 successful probes (100.0 %), 0 failed (0.0 %), average RTT: 61 ms

Here, all probes can ping the target at least once.

Conclusion

Do not hesitate to use the RIPE Atlas probes, they are deployed for you. And you can send Github's Pull requests for the tools if you patch them.

0 You have liked this article 0 times.
1

You may also like

View more

About the author

Stéphane Bortzmeyer Based in Paris (France)

I work at AFNIC (the registry of .fr domain names), in the R&D department, on, among other things, DNS, security, statistics.

Comments 1