Around the end of October and beginning of November 2024, twenty six African TLDs had a technical problem - one of their authoritative name servers served stale data. This is a tale of monitoring, anycast, and debugging.
The problem apparently occurred around 29 October 2024. But it was reported only on 8 November. Some users were experiencing DNS resolution issues with the TLD .mg
(Madagascar). Others had no problem. A test with DNSviz showed that the DNSSEC signatures were expired.
Further tests showed that not all authoritative name servers served expired signatures. In fact, only one did: ns-mg.afrinic.net
, managed by Afrinic. And even more tests indicated that just one instance of this anycasted name server was outdated.
RIPE Atlas testing
You can see it with RIPE Atlas probes. This test querying ns-mg.afrinic.net
shows that, while most probes saw a SOA serial number of 2024110815, some saw only 2024102913. Assuming (which is reasonable in this case) that the serial number represents the date, we can see that one instance has more than one week of delay. You cannot see it easily on the RIPE Atlas Web interface, but all the stale data was served by just one instance, having the NSID (Name Server IDentifier, RFC 5001) s01-ns2.pkl
.
Luckily, RIPE Atlas probes can ask for the NSID, when doing DNS queries, 'set_nsid_bit': True
in the JSON sent to the API. Be careful when reviewing the RIPE Atlas measurements: two instances have the same NSID, and only one is broken.
Locating the problem
The problem was not only on .mg
, since this very server serves also 26 TLDs (though with other names), all in Africa. They all experienced the same issue. (Here is a measurement for .ss
, South Sudan). For those using DNSSEC signatures, this meant expired signatures. But more generally for all, it meant that the recent changes in the delegation of domain names were not always distributed. A hard problem to pinpoint!
Follow up
All the technical and administrative contacts of these TLDs have been notified by email. You will not be surprised to learn that two email addresses triggered a message by MAILER-DAEMON saying these addresses do not exist.
After notification, Afrinic took the offending instance offline and notified the DNS operators on the DNS-OARC mailing list. This made the problem disappear. The DNS is very robust against complete failures of servers but not so when the server still works but replies with outdated data. As of today, s01-ns2.pkl
is not yet back online.
Some lessons
What can we learn from this unfortunate incident? One is that DNSSEC, in addition to its main feature of guaranteeing data integrity, is useful in that it makes problems more immediately perceptible.
The other is that the use of anycast, obviously a very important technique for important DNS servers, come with some new challenges (.com
had a similar problem in 2023). Detecting and debugging problems is more difficult. The use of distributed measurement tools like RIPE Atlas is necessary.
And finally, monitoring that the name server replies is not sufficient. You also have to check the freshness of the received data.
Comments 0