VisualK – Monitoring K-root in Near Real Time

Claudio Squarcella — Feb 17, 2012 03:10 PM
Filed under: , ,
VisualK is a new tool that allows to visualise the evolution of the service provided by the K-root name server. It can be used to monitor the status of K-root in near real-time.

Introduction

K-root is one of the 13 Internet root name servers. It is operated by the RIPE NCC since 1997. The service is provided by a set of distributed nodes (called root name server instances) using anycast which means a single IPv4 or IPv6 address is announced simultaneously by a set of nodes deployed in different geographical locations. K-root operates 18 anycast nodes. You can see a map of all anycast nodes on the RIPE NCC’s website.

Root name servers are core services of the Internet: as such they receive a huge amount of queries per second and are supposed to answer reliably with acceptable delay. K-root makes no exception, with an average between 10k and 15k queries per second. In the image below, you can see the typical distribution of queries per second on all instances of K-root.

K-root typical qps graphFigure 1: Typical distribution of queries per second on instances of K-root

VisualK was conceived as a tool to efficiently monitor the performance of K-root, showing the traffic load supported by each instance and the traffic migration between pairs of instances over time. It aims at distinguishing usual patterns from unusual patterns, helping operators to timely address specific issues and have a clearer idea of the behavior of the system as a whole.

Try VisualK now!

Overview

The system in action looks as follows. Each instance of K-root is represented by two concentric circles:

  • The first one, filled with colour, has a size proportional to the number of queries per second received on that instance;
  • The other one, surrounded by a dotted line, shows the average load over the previous 30 minutes as a reference value.

In the screenshot below you can see that in most cases, these circles overlap. But in some cases the dotted line is much larger, for example at the root name server instance in Poznan, Poland. This means that something has changed recently: the number of queries has dropped significantly.

VisualK interface

Figure 2: VisualK standard interface

In the image you can also see that pairs of instances are connected by links (or "tentacles") if they are considered topologically adjacent. The corresponding graph (circles as nodes, links as edges) is what we call migration graph. Links are generally grey and semi-transparent, but become active when a traffic migration is detected between the corresponding instances: color and size show origin and volume of traffic flow, together with bubbles pouring into the instance receiving the traffic. In our example you can see that a lot of traffic has moved from the instance at NAP (in Miami, Florida) to the one located at LINX (in London, UK).

In addition, VisualK highlights unusual behavior with visual effects. Flashing arrows show traffic migrations between instances that are not adjacent, while instances blink if their traffic load decreases significantly. In Figure 3 below, you can see that a lot of traffic moves from the instance in Poznan to the one located at LINX even though they are not close to each other.The goal is to help spotting unexpected changes while they are happening and to allow root name server operators to investigate what causes these changes.

VisualK unusual patterns

Figure 3: VisualK interface showing unexpected events

Technology

VisualK is built on top of the new Hadoop cluster introduced at the RIPE NCC for storage and large computations. Query logs from all instances are timely stored on the cluster, while VisualK periodically retrieves aggregate data and computes an updated layout. The client application, entirely based on JavaScript, receives the layout and takes care of the animation on screen.

The data layer inevitably plays a crucial role, given the high volume of queries received by K-root. Query logs are sent from the instances as gzipped PCAP files and stored on the cluster, then directly processed with MapReduce jobs thanks to the Hadoop PCAP library open sourced by the RIPE NCC last year. The default time period between two layout updates is five minutes: to give an idea, that roughly translates to four millions individual queries that are processed on the cluster for each iteration. In terms of data volume, five minutes of compressed PCAP files from all instances of K-root generally sum up to about 300MB, which become 600MB or even 1GB when uncompressed. The response time depends on the load of the cluster, but usually lies in the range of seconds.

As a final remark, please note that the visualised data is not actually real time, because query logs need to reach the cluster before being processed. Hence a safe delay interval is applied to wait for all instances to send their data. The current delay is set to one hour, with an upper bound imposed by distant nodes (like those in Tokyo or Delhi). During the following months this bottleneck will be improved, allowing for an even more up-to-date visualisation.

Credits

VisualK was realised in collaboration with the Compunet Lab at Roma Tre University. It is based on research work published in the proceedings of the 19th International Symposium on Graph Drawing with the title How to Visualize the K-root Name Server.

VisualK is currently a beta prototype. It has been tested on major Web browsers. For questions, suggestions and bug reports you can leave a comment here or use the contact info available on the VisualK page.

2 Comments

Jay Daley
Jay Daley says:
Mar 06, 2012 02:55 AM
Excellent work. This is very much at the cutting edge of big data for registries. Can you explain more about the processing on the Hadoop cluster: Are you using Pig or Hive or something else? Have you got map/reduce jobs that extract DNS data from PCAP files that you could share? and more ...
Claudio Squarcella
Claudio Squarcella says:
Mar 06, 2012 12:05 PM
Hi Jay,

thank you for the feedback!

We use simple MapReduce jobs to process data. The server is written in Java. Once the time interval of interest is defined we select the PCAP files within that interval with a class extending PathFilter. Then we group the input data by source host (map) and for each host we compute the total number of queries and the instance answering its last query (reduce).

We tried with Hive for a while, but that implies adding more structure (e.g. partitions) to the underlying data to obtain comparable performance (otherwise it simply scans the whole dataset). Also, you can notice how the above MapReduce algorithm is a bit tricky to write in plain SQL.

Hoping that helps,
Claudio
Add comment

You can add a comment by filling out the form below. Only plain text is possible. Web and email addresses will be transformed into clickable links. Comments are moderated so they won't appear immediately.

Related Items
Increased Reach of RIPE Atlas Anchors

Increasing the reach of RIPE Atlas anchors is one of the highest priority goals of RIPE Atlas Team. ...

Proposing Making RIPE Atlas Data More Public

RIPE Atlas is now three years old, and is moving from a prototype to production service. Based on ...

Modifications to the IP Analyser to Reflect New Policy

We are in the process of implementing the policy regarding Post Depletion Adjustment of Procedures ...

RIPE Atlas: Improved Probe Pages

We've made it much easier to get an overview of the history and measurements for all the public ...

RIPE Atlas Fun: Map a RIPE Atlas Anchor

View maps based on RIPE Atlas traceroute measurements. Compare the maps to the ISP's description of ...

more ...