After we observed increased query load on the root name severs recently, we did some investigation and analysis which is described in the article below.
Earlier, we reported about an Increased Query Load on Root Name Servers . At no time was there any operational problem for K-root, the root name server the RIPE NCC is responsible for. In the meantime, traffic load is back to normal again and we would like to present some preliminary results:
We immediately noticed this increase and we kept a close watch on the developments. Because the increased load did not pose problems for K-root, we decided not to block the traffic. However, we took some measurements to spread the load across other instances of K-root. Also, the collaboration with our peers was excellent during the entire time.
Initially a few small drops could be observed in our DNS monitoring tool DNSMON . DNSMON is extremely sensitive and shows the slightest changes immediately. We took immediate action to limit the bandwidth impact of the increased load and the drops never appeared again.The engineering reason for these drops is an internal bandwidth limitation in our global nodes. This was known and upgrades were already scheduled. We have re-scheduled these upgrades to happen earlier where possible.
By working with our peers and their up-streams we established early on that the majority of the traffic was originated from a few Autonomous Systems (ASes) located mostly in China. We contacted the operators of these networks to find out what's going on. This was relatively complicated because of the geographic distance and because we don't have a lot of contact with the operators in question. We will be looking into streamlining communications with distant ISPs. However, this will always remain difficult because of language and time zone differences.
One of the ASes originating the traffic is very big compared to the others, so initially it looked like it only came from one single AS. Consequently, anycast did not spread the query load as effectively as usual: because the bulk of the queries came from one AS and because this AS was very large, most queries were targeted to one particular anycast instance of K-root. This is the first occurrence of this kind of traffic distribution, or lack thereof. We will work with the routing community and seek methods to make anycast more effective at a more granular level than just by AS. Ideas are welcome.
The increased load was caused by queries to a single .com domain, called <domain> in the text below. the www. <domain> .com hostname in this domain was by far the most popular. The webserver for www. <domain> .com was associated with a game site in China, which is not in the Alexa 1M list. A zone transfer of this zone from all 4 authoritative nameservers revealed a typical "one host small zone", and there are no indications this zone was using fast-flux or similar detection evasion techniques.
If we look at queries for *. <domain> .com. on 29 June, we find 99.996% of queries of type A (or 1 in decimal), and interestingly there are no queries of type AAAA. The very small fraction of other query-types is of both assigned and unassigned query-types, the top 3 are listed below. If we look at query class, 99.998% of queries is of class 'IN', the top 3 of other query-types is also listed below. Interestingly the value of the top query type and query class is 2561, for which we don't have an explanation (It is 2561 years ago that Confucius was born though). Queries for www. <domain> .com had the Recursion Desired (RD) flag set. Communication with the operators of .com revealed that they did not receive the same type of query-storm for www. <domain> .com. to their authoritative servers.
Spot checking of a couple of IP addresses that queried for www. <domain> .com revealed that these IP addresses didn't query for anything else but <domain> .com, which suggests that there are no typical DNS resolvers behind these IP addresses. Speculating: this could be caused by either a misconfiguration in a piece of software, or something initiated by a botnet.
If we look at the ASes for sources querying the www. <domain> .com, we find sources from 287 distinct ASes, that map to 48 distinct countries. As already indicated earlier, the main source were ASes in China. If we look at the top 10 ASes, in terms of the number of /24s that traffic came from 8 where in China, the other 2 were in Taiwan and India (based on the delegation files from the RIRs). This indicates that this was not something exclusively originating from China, but there seems to be some correlation to countries with Chinese language populations. Speculating on what could cause this correlation, it could be due to either a bug in a piece of software that is tied to Chinese languages, or a botnet who's mode of infection is Chinese language specific, like language specific malware or spam.
If we look at the total query rate for www. <domain> .com to K-root (see Figure 1), we see 2 distinct changes in query rate, one on 29 June around 6am UTC, and the other one on 30 June around noon UTC.
Figure 1: Query rate distribution for www.<domain>.com on 29 and 30 June
If we look at the query rate per source address, we see that this is quite low. Figure 2 shows boxplots of the distribution of queries per minute per source for 30 minutes intervals.
As you can see, the query rate per source address is at under 100 packets per minute typically. The same 2 distinct changes (on 29 June 6:00 UTC, and on 30 June 12:00 UTC) as in the overall query rate graph (Figure 1) are visible here.
In Figure 3, the number of unique IP address per 30 minute interval querying for www. <domain> .com is plotted for 29 and 30 June. As can be seen from this plot, about 60,000 to 65,000 distinct IP addresses were sending queries to K-root on 29 June and the first part of June 30. About midway through June 30 the number of distinct IP addresses dropped significantly, at the same time as the total query rate decreased (see Figure 1) and the median query-rate per IP address increased (see Figure 2).
Figure 3: Unique IP addresses querying for www.<domain>.com on 29 and 30 June
Even though we did not experience any disruption in service at any time, we decided to increase the over-provisioning of all global nodes of K-root in order to have even more headroom in the future. We will prepare methods to make anycast more effective at a more granular level than just by AS, in case such an incident repeats. We will work to improve communication with ISPs that are distant from our global nodes and our operations center.
As to the cause of this incident we are still in the dark. While it is plausible that this was caused by a botnet, this hardly can be classified as a DDoS attack to K-root, since the packet rates per individual source were too low to have significant impact. One speculation is that this was a test of the capabilities of a specific botnet.
If you have more information or have an idea about the cause of this incident, please let us know in a comment below.