Using RIPE Atlas to Investigate Slow Servers

Suzanne Taylor Muzzin — Feb 18, 2014 12:35 PM
Contributors: Nathan Howard
Filed under: , ,
When a senior engineer at FreeAgent, an accounting software provider, got a tweet about their servers running slowly, he turned to RIPE Atlas to help solve the problem.

The following article was written by Nathan Howard, a Senior Operations Engineer at FreeAgent. RIPE Labs would like to thank FreeAgent for allowing us to publish this article, which first appeared on their engineering blog, Grinding Gears. FreeAgent retains full copyright.

Last Monday evening we received this tweet:

FreeAgent tweet

Naturally we take anything like this seriously so we started digging. First stop was New Relic, which shows us average application and browser response times as well as a whole host of other useful metrics. Everthing looked normal. OK, time to hit the logs in search of any long-running requests.

We store all our logs in ElasticSearch, which we can search using Kibana. This combination has proved invaluable to us and once again it started to point us in the right direction.

While searching for all long-running requests, we noticed a few patterns. Firstly, that the long-running requests were only long-running on the load balancers and not the application servers. Secondly, that most of these long-running requests were from BT-owned IP addresses.

Then as suddenly as this started, it stopped. This isn’t the first time we’ve had issues with BT. Guess they had some incident which they resolved.

Along comes Tuesday evening and again:

FreeAgent tweet2

Two disruptions in two nights, each lasting only a couple of hours. This was odd. I decided that if this was going to happen again, I wanted to be ready for it.

Let's rewind a few months. A friend of mine told me about the RIPE Atlas project and kindly gave me some credits to try it out. The RIPE NCC is trying to build the largest internet measurement network by giving probes away to people to host on their networks. In return for hosting a probe, you gain credits which can be used to query other probes.

The great thing about these probes is most of them are hosted on residential internet connections and not in data centres, which usually have lots of bandwidth and redundant links.

I applied for a probe and a few weeks later it arrived.

FreeAgent RIPE Atlas probe

With users reporting problems which we believed to be network related, it was time to test this out.

I set up two measurements:

  1. Find 10 UK probes and ping our app every 5 min.
  2. Find 10 BT probes and ping our app every 5 min.

The results were interesting and confirmed exactly what we were seeing. The results from the UK-wide probes showed that only BT was having.

FreeAgent RIPE Atlas UK probesLooking at the BT probes, we see that nearly all of them have packet loss or increased ping times at the same time.

FreeAgent RIPE Atlas BT probesIt’s important to remember here that these are hosted on home connections, so we do expect them to have fluctuations every now and again.

As it happens, the problem turned out not to be with BT at all but with a company our hosting provider has a BGP peering with. BT customers were favouring this route, which was flapping causing the packet loss and slowness our customers were seeing.

Of course we used other tools, which might be the subject of a future blog post, to find exactly where the problem was, but the RIPE Atlas project provided a lot of useful information.

Thankfully this issue is now resolved and we’ve been carefully monitoring it over the last week or so.

If you’re interested in this technology, you can help @RIPE_Atlas expand their network by hosting a probe for them.

0 Comments

Add comment

You can add a comment by filling out the form below. Only plain text is possible. Web and email addresses will be transformed into clickable links. Comments are moderated so they won't appear immediately.

Related Items
Increased Reach of RIPE Atlas Anchors

Increasing the reach of RIPE Atlas anchors is one of the highest priority goals of RIPE Atlas Team. ...

Proposing Making RIPE Atlas Data More Public

RIPE Atlas is now three years old, and is moving from a prototype to production service. Based on ...

Modifications to the IP Analyser to Reflect New Policy

We are in the process of implementing the policy regarding Post Depletion Adjustment of Procedures ...

RIPE Atlas: Improved Probe Pages

We've made it much easier to get an overview of the history and measurements for all the public ...

Visualising Bandwidth Capacity and Network Activity in RIPEstat Using M-Lab Data

As a result of the cooperation between the RIPE NCC and Measurement Lab (M-Lab), you can now ...

more ...