Last week there were several problems with the RIPE NCC's reverse DNS (rDNS) service. This article is a first report about the events. It is not intended to analyse the causes or make detailed recommendations for action.
Last week there were several problems with the RIPE NCC's reverse DNS (rDNS) service. This article is a first report about the events. It is not intended to analyse the causes or make detailed recommendations for action. We briefly describe the systems involved, the time-line of events and, finally, we list a number of immediate actions we have taken to improve our systems and procedures. Over the coming weeks, we will analyse the causes of these events and take the appropriate actions. We will conclude this process with a short report.
rDNS is a part of the DNS that translates IP addresses to domain names, the inverse function of the regular DNS. rDNS can be used for diagnostic, logging and verification purposes. rDNS is not typically used in a way that makes it critical for browsing the web; rDNS is used frequently in the process of delivering electronic mail.
rDNS look-ups follow the IP address hierarchy. Therefore, the RIRs have a major role in providing the rDNS service. The RIPE NCC publishes rDNS information from two sources. The majority of the rDNS data comes from the IP address registry stored in the RIPE Database; this covers all address space allocated via the RIPE NCC since around the mid 1990s. The address space users can update this information in the RIPE Database and the rDNS provisioning system then translates it into rDNS zone files. A small amount of the rDNS data mostly pertaining to address space distributed before the mid-1990s comes from "zonelets" exchanged among the RIRs. The rDNS provisioning system combines all that information, passes it through DNSSEC signers and transfers it to the authoritative name servers.
Please see below a high-level overview of the system:
Figure 1: High level overview of reverse DNS provisioning at the RIPE NCC
Below you can find a table showing the series of events. You can also view the pdf version of the table .
|Time (UTC)||Events||Impact Assessment|
|Wed, 13 June|
|13:30||We discover that several zone files are missing from the DNS provisioning system [The cause of this is still unknown and under investigation. Circumstantial is a routine bind update in the morning.]||
No impact on DNS reverse operations, but zone updates broken for delegations in parent zones: 0.4.1.0.0.2.ip6.arpa, 185.in-addr.arpa, 220.127.116.11.0.2.ip6.arpa, 18.104.22.168.2.ip6.arpa, 22.214.171.124.0.2.ip6.arpa, 126.96.36.199.0.2.ip6.arpa, 188.8.131.52.0.2.ip6.arpa, 184.108.40.206.0.2.ip6.arpa, 220.127.116.11.0.2.ip6.arpa, a.0.1.0.0.2.ip6.arpa, a.18.104.22.168.2.ip6.arpa, a.22.214.171.124.2.ip6.arpa, b.0.1.0.0.2.ip6.arpa, b.126.96.36.199.2.ip6.arpa, b.188.8.131.52.2.ip6.arpa
(a total of 425 delegations, 185/8 is in de-bogonising: no operational impact)
|13:45||Decision to reload zone files from backup storage|
|14:00||Discovery that backups are not available|
|14:15||Decision to cold start the provisioning system. Because the state of the remaining zone files available was unclear, we decide to rebuild all zone files from scratch.|
Start DNS provisioning system from scratch (empty zone files).
By mistake we do not disable transfers to the authoritative servers.
Empty zones for entire reverse tree start propagating
Impact on whole of reverse DNS tree, limited initially by caching
|16:00||Reports of reverse tree breakage start to come in|
|16:00 - 20:00||Investigation of problems and considering possible workarounds for slow provisioning system cold start|
|20:00||Found incidental backup of zone files with state of 13/6/2012 13:30 UTC|
|20:15||Stopped DNS provisioning system. Reloaded DNS provisioning system with data from backup files. ERX related zones are missing from these backups, as are the above mentioned ip6.arpa delegations.||
missing: 0.4.1.0.0.2.ip6.arpa, 185.in-addr.arpa, 184.108.40.206.0.2.ip6.arpa, 220.127.116.11.2.ip6.arpa, 18.104.22.168.0.2.ip6.arpa, 22.214.171.124.0.2.ip6.arpa, 126.96.36.199.0.2.ip6.arpa, 188.8.131.52.0.2.ip6.arpa, 184.108.40.206.0.2.ip6.arpa, a.0.1.0.0.2.ip6.arpa, a.220.127.116.11.2.ip6.arpa, a.18.104.22.168.2.ip6.arpa, b.0.1.0.0.2.ip6.arpa, b.22.214.171.124.2.ip6.arpa, b.126.96.36.199.2.ip6.arpa
(together containing a total of 425 delegations)
|Authoritative servers reloading. However, a race condition in the provisioning system causes the zone serial numbers for two zones to be incorrectly updated. Therefore two large zones (212.in-addr.arpa and 213.in-addr.arpa) are propagating in an incomplete form. This causes severe breakage for these zones. In total approx. 6% of the reverse delegations are affected during this period||Restored zone files start propagating for all but the below mentioned parent zones (state of 13/6/2012 13:30UTC). Due to negative caching, impact on restored zones may have been prolonged. Details of impacted zones below|
|Affected 6.1% of total reverse DNS delegations Parent zones 212.in-addr.arpa, and 213.in-addr.arpa are distributed incompletely. Affected: 43% in delegations in 212.in-addr.arpa, 54% of delegations in 213.in-addr.arpa. In total 33,996 delegations affected in these parent zones. ERX zones. ERX import zones: 4,426 delegations accross 22 zones absent during this period. ERX exports: updates delayed Above mentioned missing zones in ip6.arpa (475 delegations total) are still lacking during this period. RFC 2317 delegations: a total of 31 RFC 2317 delegations are lacking the associated CNAME records at this time.|
|20:30||Restarted DNS provisioning system, starting with state of 13.30 UTC. The DNS provisioning system is still running at an unexpectedly low insertion rate.||At this time we believed the remaining impact to be limited to a small number of ERX imported zones, and a limited number of ip6.arpa zones. The problems with 212.- and 213.in-addr.arpa went unnoticed until early morning of Thursday 14 June|
|Thur, 14 June|
|7:00||First reports received about remaining breakage|
|7:00 - 10:00||Investigations of reported remaining issues|
|10:45||We discover the that 212./213.in-addr.arpa are incomplete due to the above mentioned race condition. After updating serial numbers, zones 212./213.in-addr.arpa start propagating properly again.||Zones 212.in-addr.arpa and 213.in-addr.arpa, complete and up to date to the then-current state, start propagating again. ERX import zones (4426 delegations), ip6.arpa (475 delegations) and RFC 2317 zones (31 delegations) still not restored|
|16:00||Based on RIPE DB dump of 14/6/2012 0.00h, all regular zones are restored, incl. ip6.arpa zones|
|16:00- 16:30||Processing of updates for period after 00:00 14/6/2012|
|16:30||All updates processed for all zones, with the exception of ERX zones||All regular zones restored and current. ERX import zones (4426 delegations) and RFC 2317 zones (31 delegations) still not restored|
|19:30||Restart processing of ERX delegations (much slower than anticipated)|
|20:00||Majority of ERX zones handled||ERX import zones (2 delegations) and rfc2317 zones|
|Fri, 15 June|
|7:30||All regular zones restored including last remaining 2 ERX zones||All zones, incl. ERX imports, fully functional, with the exception of 31 RFC 2317 delegations that were not discovered to be missing their CNAME records.|
|Mon, 18 June|
|11:00||Discovered error with 31 delegations lacking RFC 2317 CNAME records|
|13:45||Restored remaining RFC 2317 delegations|
Immediate Actions Taken
We acknowledge that our operational performance was not up to the standards the RIPE community expects from the RIPE NCC in this instance and apologise for the considerable inconvenience caused. We have taken the following immediate steps and we will take further actions once the events have been fully analysed:
- We have started ad-hoc backups of the bind zone files
- We have re-emphasised the 4-eyes principle in case of operational irregularities
- We have clarified the service announcements procedures