On Thursday 16 March 2017, at around 20:00 UTC, a bug in a script caused an outage for some reverse DNS delegations registered in the RIPE Database. The effects of the bug were not immediate, but began a cascading failure, that persisted until 18:00 UTC on Friday 17 March. In the following article, you will find more information about what happened.
The Zonelet Exchange Mechanism
The IANA allocates address space to the RIRs in large blocks. For IPv4, this is typically done per /8 of address space. The RIR to which this address block has been allocated is also responsible for operating the corresponding reverse DNS zone. For example, 18.104.22.168/8 has been allocated to the RIPE NCC, and so the RIPE NCC must also operate the 193.in-addr.arpa reverse DNS zone.
LIRs that have address space from 22.214.171.124/8 can request delegation of their corresponding reverse DNS zones in 193.in-addr.arpa by creating domain objects in the RIPE Database. The RIPE NCC's DNS provisioning system transforms these domain objects into delegations in 193.in-addr.arpa and publishes the 193.in-addr.arpa zone to its names servers.
However, things become complicated when some address space is transferred to another RIR. For example, 126.96.36.199/16 has been transferred to an operator in the ARIN region, and the address space is registered in the ARIN Database. This user needs to request delegation for the reverse DNS zone of their address space via the ARIN Database. When they do this, ARIN publishes a "zonelet" file, called 193-ARIN, containing the delegation information, which looks like this:
122.193.in-addr.arpa. 86400 IN NS NS4.P24.DYNECT.NET.
122.193.in-addr.arpa. 86400 IN NS NS1.P24.DYNECT.NET.
122.193.in-addr.arpa. 86400 IN NS NS3.P24.DYNECT.NET.
122.193.in-addr.arpa. 86400 IN NS NS2.P24.DYNECT.NET.
This zonelet is published by ARIN on an FTP server. The RIPE NCC periodically downloads this zonelet, and merges the NS records from it into the 193.in-addr.arpa zone. Similarly, the RIPE NCC also downloads zonelets from AFRINIC and APNIC, and merges them into zones operated by RIPE NCC. Finally, the RIPE NCC also produces zonelets for the other RIRs to download.
Cause of Outage
At the RIPE NCC, all the zonelets we produce are actually written into files named after their parent zone. For example, we write delegations in 153.in-addr.arpa into a file called 153.in-addr.arpa-RIPE. However, back when the zonelet system was created, the naming convention chosen for the IPv4 reverse zones was XXX-RIPE, where XXX is the most significant octet of a block of /8 address space. So we create a symlink from 153-RIPE to 153.in-addr.arpa. This is all handled by a script. Unfortunately, a change was made to this script on Thursday 16 March, which caused the script to symlink files to the wrong sources. Therefore, for example, the file 153-RIPE was pointing to the wrong file, which was empty. Similarly, the symlink for 153-RIPE.asc (the PGP signature of the zonelet) was pointing to the signature file for the empty zonelet.
Both APNIC and ARIN downloaded empty zonelets. They both checked the PGP signatures of the files, as well as the summary data at the end the zonelets. The signatures were valid, and the counts in the summaries were zero. Both ARIN's and APNIC's provisioning systems concluded that these delegations were to be deleted, and did that. As soon as we became aware of the issue, we began examining it, and together with help from ARIN staff, we figured out what had happened. However, it took a while to fix the script, republish the zonelets, and wait for them to be reimported by both APNIC and ARIN. This was all eventually done by 18:00 UTC on Friday 17 March.
- This outage was caused by human error. While updating the script, a bug was introduced, but it was not reviewed, as the change was minor. We are enforcing more code review of things now, even seemingly minor changes.
- There is no active monitoring of the published zonelets. We are examining ways of adding more monitoring, to ensure that an error can be caught and fixed much more quickly.
- The zonelet exchange mechanism is slow. Each of the RIRs periodically downloads the zonelets produced by the other RIRs, and then merges the data. This polling means that it takes quite a while before delegation information from another RIR appears in the appropriate parent zone. If there's an error, it can take quite a while before the correct information can be republished. We are discussing ideas with the other RIRs about introducing some kind of "push" into the system, so that delegation information can be published more speedily.
Please also note the document describing the Inter-RIR zonelet exchange on the RIPE NCC ftp site.