Find out how we used RIPE Atlas to test and fix reachability to 126.96.36.199
Recently we announced our fast, privacy-centric DNS resolver 188.8.131.52, supported by our global network. As you can see 184.108.40.206 is very easy to remember, which is both a blessing and a curse. In the time leading up to the announcement of the resolver service we began testing reachability to 220.127.116.11, primarily using the RIPE Atlas probes. The RIPE Atlas project is an extensive collection of small monitoring devices hosted by the public around the world. Currently there are over 10,000 active probes hosted in over 3,000 networks, giving great vantage points for testing. We found large numbers of probes unable to query 18.104.22.168, but successfully able to query 22.214.171.124 in almost all cases. 126.96.36.199 is the secondary address we have assigned for the resolver, to allow clients who are unable to reach 188.8.131.52 to be able to make DNS queries.
This blog focuses on IPv4. We provide four IPs (two for each IP address family) in order to provide a path toward the DNS resolver independent of IPv4 or IPv6 reachability.
184.108.40.206/8 was assigned in 2010 to APNIC, before this time it was unassigned space.
% IANA WHOIS server % for more information on IANA, visit http://www.iana.org % This query returned 1 object inetnum: 220.127.116.11 - 18.104.22.168 organisation: APNIC status: ALLOCATED whois: whois.apnic.net changed: 2010-01 source: IANA https://www.iana.org/whois?q=22.214.171.124%2F8
Unassigned, however is not the same as reserved for private use!
What we found
To put it simply, 126.96.36.199 was BROKEN! The good news is, for most users 188.8.131.52 is now reachable. We’ve worked hard to ensure that issues get resolved and continue to contact operators to resolve issues quickly. We’re confident we can get everything cleaned up, but this is a stark reminder that you shouldn’t hijack IP addresses not assigned to you. We found over 1,000 probes out of just over 10,000 were unable to make DNS queries to 184.108.40.206 successfully. Some of this was due to single large networks having reachability issues, for example a large operator in Germany has nearly 350 probes connected, all of them failing. The methodology for testing was very simple:
- Run a DNS lookup measurement towards 220.127.116.11
- Find probes where the lookup fails
- Run a traceroute with affected probes
- Analyse the result
The results were quite mixed, but fell into three main causes:
- Built-in 18.104.22.168 ISP routers using 22.214.171.124 as an internal IP address, preventing queries from reaching the real 126.96.36.199
- Blackholing 188.8.131.52 ISPs statically configuring a route for 184.108.40.206 inside their networks, preventing traffic leaving their network, either through routing internally, or by sending the packets to null0
- Filtering 220.127.116.11 ISPs dropping packets on ingress or egress to/from their network when sourced/destined from/to 18.104.22.168
Of these three main causes, the majority of cases were either 1, 2 or both! Several ISPs even had route loops internally, where they were advertising 22.214.171.124 inside their network, but had no actual path to it, so packets loop around and around.
Time to get fixing
Once we had narrowed it down for each group of probes we began contacting ISPs for clarification on what was happening, several networks responded very quickly reporting they had removed an internal route to 126.96.36.199, which in most cases was the beginning and end of the matter. There were plenty of networks which took their time to respond, but in the end did the same, removing the internal routes left there for legacy testing reasons.
All of those fixes were great to make, but most were quite uninteresting, what was more interesting was finding cases from issue number 1, CPEs (customer premises equipment) aka home routers, gateways and wireless access points. With the help from the folks at Sonic we were quickly able to identify that the Pace 5268 a common xDSL modem deployed primarily in the United States (including wide usage on AT&T) uses 188.8.131.52/29 for internal communication. We requested comment from AT&T’s noc, but have not heard anything from them. We did however receive a response from them via social media:
Independent investigation confirms the findings:
The same finding was made with the D-Link DMG-6661, which was reported to us by a user from Brazil connected to Vivo FTTH.
Another user in Argentina connected to Telefónica found the issue on the Mitrastar GPT-2541GNAC.
It appears this CPE has been deployed in many of Telefónica's networks internationally.
We noticed similar behaviour on a large portion of probes connected to Orange France, we contacted them and received a swift response that the CPE team was investigating the issue. After providing more details they came back to us with a statement.
We have escalated your alert inside ORANGE France.
Our CPE team has analyzed the issue which is now well understood. This problem only impacts a subset of our CPEs. The commitment of the fix is currently on going and a deployment on our CPEs will follow.
In case of complaints notified by our customers in the meantime we have prepared a communication to inform them that we are currently fixing the issue.
The CPE in question is the Livebox, although it's not clear which versions are affected, it should be resolved by Orange across all affected devices. Users in Poland reported the same issues as users in France, likely due to Orange deploying the same models across multiple networks.
By far my favourite response was from the friendly folks at Telenor:
I have corrected all routers in our network now that had an awful old solution that is now obsolete. Thank you Cloudflare for the help to get it done!
Obsolescence is inevitable, but the desire to speedily fix such occurrences is great to see.
These are by no means all devices that have issues, but some of the wider deployed ones. The current list we have of affected devices is:
- Pace (Arris) 5268
- D-Link DMG-6661
- Technicolor C2100 Series
- Mitrastar GPT-2541GNAC
- Askey RTF3507VW-N1
- Calix GigaCenter
- Nomadix (model(s) unknown)
- Xerox Phaser multi-function printer
- See below :)
If you have a device that is affected, please let us know in the comments. A good example of this is a super-low latency with only 1 hop to 184.108.40.206:
Traceroute to 220.127.116.11 (18.104.22.168), 48 byte packets 1 22.214.171.124 1dot1dot1dot1.cloudflare-dns.com AS13335 8.301ms 1.879ms 1.836ms
Who else has been mis-using 126.96.36.199 more than others?
Using the RIPE Atlas probes gives excellent visibility into residential and business internet connections, however they’re connected via a cable, so this rules out another use-case, WiFi access points. After very little research we quickly came across Cisco mis-using 188.8.131.52, a quick search for “cisco 184.108.40.206” brought up numerous articles where Cisco are squatting on 220.127.116.11 for their Wireless LAN Controllers (WLC). It’s unclear if Cisco officially regards 18.104.22.168/8 as bogon space, but there are lots of examples that can be found on their community websites giving example bogon lists that include the /8. It mostly seems to be used for captive portal when authenticating to the wireless access point, often found in hotels, cafés and other public WiFi hotspot locations.
Here are some interesting statistics from before we started contacting operators and after we have fixed many issues. It’s staggering to see how fixing some key networks increased availability by almost 20% in Europe & North America!
We began testing the availability of 22.214.171.124 on the 23rd of March, in Europe and North America it was only around 91%.
126.96.36.199 availability from Europe and North America, 23rd of March
By the 3rd of April, our work cleaning up the space had pushed the availability up to 97%.
188.8.131.52 availability from Europe and North America, 3rd of April
For the rest of the world, excluding Europe and North America, availability to 184.108.40.206 was only 73% on the 23d of March.
220.127.116.11 availability for the World (Europe and North America excluded), 23rd of March
By the 3rd of April we've made a tonne of progress and managed to clean up enough of the bad routing that availability was up to 85%.
18.104.22.168 availability for the World (Europe and North America excluded), 3rd of April
We are continuing to work with ISPs and CPE manufacturers to clean up bad routing globally. Our goal is for 22.214.171.124 to be properly routed and available for 100% of Internet users.
Above images from Catchpoint analytics
The last public analysis was done in 2010 by RIPE and APNIC. At the time, 126.96.36.199/24 was 100 to 200Mb/s of traffic, most of it being audio traffic. In March, when Cloudflare announced 188.8.131.52/24 and 184.108.40.206/24, ~10Gbps of unsolicited background traffic appeared on our interfaces.
The most targeted IPs were 220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199. When searching these IPs, we notice it is usually misconfiguration or hardcoded IPs.
The destination port is usually 80/443, but also other variants of HTTP ports (8000, 8080, et al.) using UDP and TCP, seemingly trying to setup a proxy connection. Some of the traffic was also DHCP/BOOTP, iperf and syslog.
Some IPs are also the target of DDoS attacks (even before we announced the new service publicly). Analysing by source port we saw NTP and memcached, usually reaching 5Gbps for a few minutes. The short duration of the attack shows that could be a hardcoded IP in a botnet before it starts sending traffic to a specific target.
We also noticed daily patterns where 4 IPs receive the same amount of traffic (±50Mbps).
All of this bad traffic is unrelated to DNS, simple, unsolicited background traffic.
It was clear from the start that we’d have our work cut out, especially with CPE vendors, where a firmware update would be required. What was impressive was the willingness of operators to collaborate with us to clean up the legacy misconfiguration. It was clear that 188.8.131.52 needed a lot of cleaning in order to be globally accessible. We decided six months before the 1st of April release date we'd commit the network and support resources to that task. Now that 184.108.40.206 is live, we’re thankful to all the networks and hardware companies who have assisted us in this effort. We’re not done, nor are others.
The RIPE Atlas project was immensely useful in testing reachability from as many networks around the world as possible. If you’d like to help the project, please consider hosting a probe. Some networks are not covered with at least one probe, you can see if your ISP has a probe here, sorted by country.
Particular thanks to the following operators who were responsive and helped clean up issues quickly.
- LG Telecom
- Liquid Telecom
- Telecom Italia
- Turk Telekom
We still have work to do, contacting operators that we see issues with, in the meantime you should be able to use our second IP address of 220.127.116.11, which has far fewer issues. Don't forget, both of our IPv6 addresses too: 2606:4700:4700::1001 and 2606:4700:4700::1111.
Do you still have reachability issues to 18.104.22.168? You can find more information at our community forum. We’d also recommend reporting such issues to your ISP, they may already be aware of issues, or they may need you to report it to them to start investigating. Whichever is true, making them aware is especially helpful, operators are not always receptive to reports from external parties.
This was originally posted on the Cloudflare blog.