Massimo Candela

Big Data and Network Measurements at TMA 2018

Massimo Candela

7 min read

0 You have liked this article 0 times.
0

The Network Traffic Measurement and Analysis Conference (TMA) took place in June in Vienna, Austria. A full week of events was scheduled - including a PhD school about Big Data on Monday and Tuesday, the TMA Experts Summit on Tuesday, and the main conference from Wednesday to Friday. Here's my summary of the week!


PhD School

The entire meeting took place at theTech Gate Vienna tower, and started with a PhD School with more than 40 students. This edition of the School was focusing on Big Data for managing network measurement data, with all the sessions offering hands-on experience for participants.

The first tutorial kicked off with Frank Brockners, engineer at Cisco, showing us the In-situ Operations, Administration, and Maintenance (IOAM) monitoring paradigm. The basic idea is to annotate the single packet of the traffic flow with additional data used for monitoring and troubleshooting the network. The annotation includes at least the traversed router ID, the in/egress interface, a sequence number, a proof of transit, and an opaque field, which is a sort of free format field (specifically requested by Facebook) carrying additional data (e.g. battery status of an IoT device). The IOAM fields can be carried in various protocols, among which: IPv6, VXLAN-GPE, and NSH.

The extra information added to the normal flow allows operators to sample the status of the network. Instead of a single sample, a more complex monitor can be added that uses the entire flow of metadata to monitor the network in real-time: it only keeps the data if something is deviating from preset thresholds. The use cases provided include: real-time SLA verification; malfunctioning device isolation for data centers; load balancing based on server performances.

Finally, Leonardo Linguaglossa, postdoctoral fellow at Telecom ParisTech, directed a lab session where we got live experience of IOAM at work.

The second and third tutorial focused on big data tools, with a special focus on Apache Spark. The second tutorial "Welcome to BigData Zoo!: Open Source Big Data Tools Bestiary with a special visit to Spark" was presented by Joseph Allemandou, data engineer at Wikimedia Foundation. The third tutorial "From Packets to Knowledge: Applying Data Science Approaches to Large Scale Passive Measurements" was presented by Idilio Drago, assistant professor at Politecnico di Torino. This time the students went back home with a challenge! By solving the challenge, instead of watching the World Cup, students would have known common big data approaches for traffic analysis. 

Main Conference: the Posters

Twenty two papers were presented at the conference, and more than thirty posters. The most recurring keywords I saw were certainly "real-time" and "anomaly detection". Here are some of the poster topics (click the link to see the full poster):

And two posters from our R&D staff:

I also met Diego Kiedanski, who enthusiastically pitched me a certain measurement network used in his research "Studying the IPv6 Latin America topology using RIPE Atlas".

Main Conference: the Papers

While all the papers were interesting and of high quality, some papers in particular caught my attention.

  • Passive Observations of a Large DNS Service: 2.5 Years in the Life of Google: this paper, presented by Wouter de Vries from the University of Twente, provides an analysis of 3.7 billion DNS queries spanning 2.5 years of the life of the well-known DNS service: Google's 8.8.8.8.

    Public DNS services have been disruptive for Content Delivery Networks (CDNs). CDNs rely on IP information to geo-locate clients. This no longer works, due to the presence of public resolvers. To mitigate this, EDNS0 allows resolvers to annotate DNS requests with the client’s IP address to help CDN locate the original client's position.

    The first part of the presentation shows that, despite that fact that Google has PoPs in many countries, traffic is frequently unnecessarily routed out of country. This reduces performance, and may expose users' DNS requests to other country surveillance. The second part shows that many e-mail providers use Google DNS as the resolver of their servers. This raises privacy concerns, as DNS queries from mail servers reveal information about hosts they exchange mail with. This information is shared with Google, which, by using EDNS0, shares it with any operator of an authoritative name server (EDNS0-enabled) during the lookup process.

    In addition to the importance of using EDNS0 to improve the web experience of your users, another take away for operators is: there is a drastic coincidence of users switching en masse to public DNS services when the DNS resolvers offered by their ISPs had outages. And, once configured in their network configuration...they do not switch back!
  • Studying the Evolution of Content Providers in the Internet Core: it's common to hear about "the core of the Internet", but what is this "core" and how has it evolved over time?

    In this presentation by Esteban Carisimo, graph theory is used to define which ASes compose the core of the Internet and to track the evolution of the core since 1999. The approach used is K-Core decomposition on AS relations deducted from BGP and traceroute data. The K-Core approach is used to identify small interlinked core areas on a network. To be included in a core of degree k, an entity must be linked to at least k other entities in the group. The central part of the network is made of ASes that belong to the TOPcore, the core with the maximum k.

    Big players on the Internet get checked to see if they really belong to the core, and if yes, since when. So who belongs to the TOPcore? Content Providers! Many of them by moving from the traditional CDN services to private owned CDNs.
  • Towards Provable Network Traffic Measurement and Analysis via Semi-Labeled Trace Datasets: we all know that reproducibility is fundamental in scientific work. In network research there are few datasets publicly available and few that can be easily shared due do privacy concerns. Without these datasets, we will never be able to reliably repeat, validate, and analyse research results. The discussed approach is based on annotated units of network traffic that can be synthetically generated, or derived from real-world traffic. These units typically contain only a minimum of personal data, so they can be shared and annotated. They can be normalised and combined with traffic data to create a so-called semi-labeled dataset. The semi-labeled dataset is represented by a combination of private traffic capture of non-annotated real-world network traffic and an annotated baseline that can be publicly shared. 
  • Exploring usable Path MTU in the Internet: here we are again, exploring another aspect of the "core" of the Internet. This time the Maximum Transfer Unit (MTU) of wired and mobile networks is being analysed. Many devices block ICMP traffic for perceived security benefits, this includes the error needed for proper MTU discovery. MSS clamping is a common work around, where MTUs are artificially set to lower values. This paper is so rich of numbers characterising this phenomenon across multiple datasets that it is really difficult to summarise. MSS Clamping results really common, and a comparison between the occurrence of it on IPv4 and IPv6 is provided.

And I guess that's all from me. Ah...there were beers.

RACI

If you are reading this, chances are you've already heard about the RIPE Academic Cooperation Initiative (RACI). Some TMA attendees had already presented their research at RIPE Meetings with the help of RACI. Please help us attract more brilliant academic presentations! Here is the homepage for the initiative and the application link. Deadline to apply for the upcoming RIPE 77 Meeting in Amsterdam and the RIPE NCC Regional Meeting in Almaty is 12 August!

0 You have liked this article 0 times.
0

You may also like

View more

About the author

Massimo Candela Based in Amsterdam

At the time of writing articles listed here, Massimo Candela was Senior Software Engineer with R&D at the RIPE NCC, working mainly on developing web applications that provide a visual and interactive representation of large amounts of network data. Tools he played a key role in developing are go-to resources for network operators to monitor certain aspects of Internet performance: e.g. RIPE IPmap, RIPE Atlas, RIPEstat, BGPlay, TraceMON, and DNSMON.

Comments 0