Wolfgang Nagele

Large-scale PCAP Data Analysis Using Apache Hadoop


The RIPE NCC operates various data intensive services. As part of our DNS operations we have been operating K-root since 1997. A key to the stable operation of this service is a solid understanding of the traffic it responds to and how it evolves over time. With the success and growth of the Internet, traffic to the DNS root servers has increased and K-root produces terabytes of raw packet capture (PCAP) files every month. We were looking for a scalable and fast approach to analyse this data. In this article I will explain how we use Apache Hadoop and why we open-sourced our PCAP implementation for it.

In 2009, ICANN announced that it would start development for a signed root zone and intends to have DNSSEC in the root zone operational within a year. The RIPE NCC has been supporting the adoption of DNSSEC from the beginning. As the operator of K-root, we had to ensure that it was ready for the signed root zone. We partnered with NLNetLabs to analyse the changes that would be necessary for our infrastructure ( whitepaper ) and made sure we deployed the capacity needed for the rollout of the signed root zone.

We performed a lot of analysis of real-world K-root traffic as part of this research. We noticed that our regular use of libpcap was too slow for us to iterate on the results. At the same time, Apache Hadoop had its early releases and got a lot of positive feedback. Amazon released Elastic MapReduce , which runs on their EC2 cloud, so we could test our idea without a major upfront investment in infrastructure.

The only downside was that there was no support for PCAP in Apache Hadoop. Within a week we hacked up a basic proof-of-concept to test the idea. Our initial results were encouraging. Our native Java PCAP implementation was able to read 100GB of PCAP data spread out over 100 EC2 instances within three minutes. What followed was the introduction of our own Apache Hadoop cluster for both storage and computation at the RIPE NCC. To date, this cluster has a capacity of 200TB and 128 CPU cores for computation. We use this cluster for the storage of our research datasets and do ad-hoc analysis on these.

So, that's how we got to where we are today. We believe that this piece of software is useful to many. I am proud that we have released the current status of our development as version 0.1 on 7 October 2011 and have open-sourced it on GitHub. We choose the LGPL as license to ensure that you can use it within your organisations as you see fit, but keep true to the spirit of our initial idea to open-source it.

The software can be found here: https://github.com/RIPE-NCC/hadoop-pcap

To allow you to quickly test this, we have also compiled a Screencast which shows you step-by-step instructions on how to use this library in Amazon ElasticMapreduce (click on the image to start the video):

Hadoop Screencast

I hope that you find this useful. If you have any feedback, we encourage you to comment under the article. If you have any feature requests, please file them as an issue on our GitHub tracker .


About the author

Comments 17