DNS TTL violations is a controversial topic. It basically means a resolver overrides a TTL value provided by an authoritative server, and then serving its clients with this value. In this post, we analyse if this is happening in the wild.
The time to live (TTL) value of a DNS record is "primarily used by resolvers when they cache resource records (RRs). The TTL describes how long a RR can be cached before it should be discarded" (see RFC1034). In other words, the maximum value (in seconds) a DNS resolver should keep a domain in its cache.
When a recursive resolver overrides the TTL value of a DNS record as provided by the authoritative server, we call that TTL violation. For example, as documented on the dns-oarc mailinglist, Amazon EC2 local resolvers override the TTL of .nl from 172800 to 60. Other research has reported the same for wired and mobile networks, respectively [see here and here).
Whenever a resolver violates the TTL specified by a DNS zone, two things can happen:
- If the TTL value is reduced from 2 days to 60 seconds (for instance, as in the case of EC2), such records would expire from the resolver's cache while still being valid in the authoritative server (ultimately generating extra queries from the resolver to the authoritative server).
- If the TTL value is increased, however, a resolver may keep in its cache RRs for a longer period of time. And this can create inconsistency: a RR that has been updated in the authoritative server may remain outdated in the local cache of the resolver for as long as this resolver wants (basically the new TLL value). This poses a risk in the case of malicious domains, which can be removed from the authoritative zone and, if their TTL is increased on the resolver's side, will
still be valid in the local cache of the resolvers, posing a risk to the resolver's clients.
There are people who support TTL violations while others are against it (see here, here and here). The Internet Draft Serving Stale Data to Improve DNS Resiliency presents a method (currently being used by many cloud providers) to slate DNS query data when authoritative servers are unreachable.
In this post, we do not debate if resolvers should violate TTL values provided by authoritative servers or not. Instead, we are interested in a different question: Are TTL violations happening in the wild? TTL violations have been reported in other studies (e.g. on wireless networks), but not on a large number of providers.
To analyse this situation, we use (of course) RIPE Atlas probes.
To measure TTL violations in the wild, we have to perform the following steps:
- Register a non-used domain name (cachetest.nl)
- Set up two authoritative name severs for cachetest.nl:
- Set up the zone files for each NS, using RIPE Atlas probe IDs as subdomain (so we can use macros to send unique queries from each probe to avoid caching -- i.e., $p.cachetest.nl, where $p is probeid )
23559 333 IN TXT "this is ns1 responding to probe 23559"
23560 333 IN TXT "this is ns1 responding to probe 23560"
23561 333 IN TXT "this is ns1 responding to probe 23561"
23562 333 IN TXT "this is ns1 responding to probe 23562"
- Run RIPE Atlas measurements with 10,000 probes (see measurement details)
- Parse and analyse the results
As show in step 3, we make sure each probe queries a unique domain name, so even if they share the same resolver, they guarantee a cache-miss situation in the resolver. In other words, each query should lead the resolver to query one of our authoritative servers.
After running the measurement for 1 hour, querying every 600s (almost twice the value of TTL of the records in our zone), we generate the final dataset shown in Table 1. As you can seen, 9,119 RIPE Atlas probes were involved in this measurement, querying more that 6,687 resolvers.
Since each probe can contact multiple resolvers, we see that, in the end, there are 15,923 vantage points, i.e. unique combinations of probe-resolvers.
Our 54,115 queries lead to more than 94,805 answers, which we use in our analysis described in the next section.
|Unique ProbeResolver pairs||15,923|
So given that we set the TTL for every record in our demo zone to 333, the question is: How many resolvers are changing this TTL value? What is the typical change, if any?
The expected value for the TTL of queries is 333. However, since multiple probes can use the same resolver, we can expect some TTLs to be slightly smaller than 333. However, no queries should have a TTL above 333.
We divide the dataset from Table 1 into three parts:
- Normal TTL: for answers with 320 <= TTL <= 333
- Decreased TTL: for answers with TTL < 320
- Increased TTL: for answers with TTL > 333
Table 2 shows the results. As you can seen, a large majority of probes/queries/resolvers fall into the normal category, meaning their TTL deviates up to 13 from the original 333 (since multiple probes can use a same resolver). Next, we focus on both decreased and increased TTL resolvers, to understand why and how much these values change.
|Unique Probes||9,119||8,894||190 (2.08%)||274 (3.00%)|
|Unique Resolvers||6,587||6,480||130 (1.97%)||275 (4.17%)|
|Unique ProbeResolver pairs||15,923||15,418||257 (1.61%)||464 (2.91%)|
|# Queries||54,115||52,701||540 (1.00%)||1,464 (2.71%)|
|# Answers||94,805||91,610||732 (0.77%)||2,463 (2.60%)|
3.1 Decreased TTL answers
As shown in Table 2, 0.77% of all valid answers in this measurement had their TTL decreased. Figure 1 below illustrates this. Two types of resolvers dominate: those who cap the TTL around 50s, and those around 250-300.
Out of 130 resolvers that reduce their TTL, 71 reduce to less than 50. Many of those, however, are local resolvers using private IP address ranges. Out of the 71, 24 are not, and they belong to networks in mobile operators and research institutes.
Out of the 130 resolvers, 16 (out of those with non private addresses) reduced the TTL from 333 to between 250 and 320. No particular pattern was found in here - several operators from various countries were performing the same way. We also found cases in which Google quad8 resolvers were reduced, but that is an outlier, given the large volume of instances of their infrastructure.
Figure 1: Histogram of TTL values for the decreased queries group (see Table 2).
3.2 Increased TTL answers
We have seen in Table 2 that 4.17% of the resolvers will actually increase the TTL value of our RRs in this measurement.
Figure 2 below shows the Empirical Cumulative Distribution Function (ECDF) of TTL values for answers with TTL above 333. These are particularly worrying, since they will return to any of their clients a RR that may have already been expired on their respective zone.
Figure 2: The ECDF of TTLS larger than 333
DNS TTL violations is a controversial topic, with passionate arguments on both sides. It is publicly known that some cloud providers and CDNs override the original values provided by authoritative servers within their networks.
In this article, we use RIPE Atlas to measure if this is happening in the wild. Even though there is a small number of resolvers doing this, it is unclear how many users are being affected.
Reducing the TTL values is debatable, but ultimately will lead the resolver to querying an authoritative server more often. Thus users should always be provided with the correct RR whenever the TTL values are reduced.
Increasing TTLs, on the other hand, may be dangerous to users, since they maybe served with records that are already expired. Consider the case of domains that have been removed from a zone due to a phishing or malware attack: by extending the TTL of these domains, resolvers will "keep them alive" for any of their clients.
The parsed datasets from the RIPE Atlas measurements are attached in this zip file.
Appendix: RIPE Atlas probes with increased TTL values
Format: ProbeID-ResolverIP, ...