Large-scale PCAP Data Analysis Using Apache Hadoop
In 2009, ICANN announced that it would start development for a signed root zone and intends to have DNSSEC in the root zone operational within a year. The RIPE NCC has been supporting the adoption of DNSSEC from the beginning. As the operator of K-root, we had to ensure that it was ready for the signed root zone. We partnered with NLNetLabs to analyse the changes that would be necessary for our infrastructure ( whitepaper ) and made sure we deployed the capacity needed for the rollout of the signed root zone.
We performed a lot of analysis of real-world K-root traffic as part of this research. We noticed that our regular use of libpcap was too slow for us to iterate on the results. At the same time, Apache Hadoop had its early releases and got a lot of positive feedback. Amazon released Elastic MapReduce , which runs on their EC2 cloud, so we could test our idea without a major upfront investment in infrastructure.
The only downside was that there was no support for PCAP in Apache Hadoop. Within a week we hacked up a basic proof-of-concept to test the idea. Our initial results were encouraging. Our native Java PCAP implementation was able to read 100GB of PCAP data spread out over 100 EC2 instances within three minutes. What followed was the introduction of our own Apache Hadoop cluster for both storage and computation at the RIPE NCC. To date, this cluster has a capacity of 200TB and 128 CPU cores for computation. We use this cluster for the storage of our research datasets and do ad-hoc analysis on these.
So, that's how we got to where we are today. We believe that this piece of software is useful to many. I am proud that we have released the current status of our development as version 0.1 on 7 October 2011 and have open-sourced it on GitHub. We choose the LGPL as license to ensure that you can use it within your organisations as you see fit, but keep true to the spirit of our initial idea to open-source it.
The software can be found here: https://github.com/RIPE-NCC/hadoop-pcap
To allow you to quickly test this, we have also compiled a Screencast which shows you step-by-step instructions on how to use this library in Amazon ElasticMapreduce (click on the image to start the video):
I hope that you find this useful. If you have any feedback, we encourage you to comment under the article. If you have any feature requests, please file them as an issue on our GitHub tracker .
1. Extremly poor performance (100GB data in 3 minutes is OK with 1 thread, you have 100 threads?!?! That is /1 GB DATA PER 3 MINUTES/!!! You could use Excel to achieve this...)
2. You have to support your own bloated code, whilst libtrace is free to use.
3. Losing face by telling people you can process 1 GB data in 3 minutes when this is something that could be done 10 years ago in a lower level language.
Sorry guys but, wtf are you doing here... ;p
/Anonymous, been using libpcap and libtrace for long enough to realize how utterly silly this article was. ;p