Many network operators have completed the rollout of RPKI Route Origin Validation and others are in the process of doing so. In this post I want to shed some light on a common misconception, the problem it causes, and how to remedy it.
RPKI Route Origin Validation (ROV) involves a number of components. In most cases, these include:
- An RPKI validator which periodically synchronises with the RPKI repository system and produces VRP's (Validated ROA Payloads)
- An RTR server, which then picks up those VRP's and distributes them to BGP routers via RTR (the RPKI to Router protocol)
- BGP routers, which consider those VRP's for the routing decision (by dropping RPKI-invalids)
A BGP router will make a decision of whether a prefix received via BGP is (RPKI-) Invalid, based on the VRPs from the RTR server. If the RTR server is unavailable and the BGP router has no VRPs in the cache for comparison, it will consider all BGP prefixes to be "NotFound" - just as if no covering ROA's were present in the RPKI repository system. So no prefixes would be discarded in this case and therefore we would "only" lose the benefit of ROV, which is acceptable. So this configuration could be considered fail-safe.
A Common Misconception
Because of the fact that even when all RTR servers die simultaneously we still fail safely (falling back to NotFound), a common misconception is that the entire software stack is completely fail-safe and no harm can be done when some of it fails. Because of this, a network operator may arrive at the erroneous conclusion that neither redundancy nor monitoring is really required (or a priority). Unfortunately, this is not true and other failure scenarios in the software stack have to be considered.
Every part of the software stack can fail for whatever reason, including crashes, hangs, other bugs, out-of-memory conditions and user error. When an RTR server doesn't get fresh data anymore, it will keep serving obsolete VRPs in most cases. An RTR server could also serve obsolete content due to a bug. If the operator does not become aware of this, production BGP routers will keep making important routing decisions based on stale data. This can go unnoticed for a long time and troubleshooting can be complex.
This is not a far fetched scenario, stale VRPs on routers in production networks have already happened in multiple RTR server implementations (see GoRTR serves stale VRPs and RIPE RPKI-Validator-3 servers stale VRPs).
A number of RTR implementations are running in the same process as the RPKI validator itself, which improves some of the failure scenarios. When no new inputs are available, other RTR implementations are specifically designed to serve stale data for the sake of availability (and compromising everything else). I believe this is not a good choice, because if combined with lack of monitoring (which none of the documentations sufficiently clarify), this will easily lead to the problems explained above. None of the RTR server implementation choices change the fact that every RTR server should be monitored for stale VRPs.
Every RTR server should be monitored for stale VRPs.
To monitor the health of an RTR server, one proposal is to periodically check the RTR server for serial changes and generate alerts when the serial becomes stale.
A simple implementation of this is rtrcheck, a shell script that compares the serial number of the current rtrdump with the previous run, and if the serial number did not change, it fails (prints outputs to stderr, non-zero exit code). It can be used in nagios (exit codes are designed for that) or a simple cronjob. It could also be combined with external services like healthchecks.io.
Check the rtrcheck github project for more details.
rtrcheck requires jq and rtrdump, as it compares the outputs of those tools. A better solution would be a full python implementation (without dependencies), so it would not require additional installation steps and could be used on many platforms by simply downloading a single file. But for now, only rtrcheck requiring rtrdump and jq is available.
Is it Enough?
Monitoring the RTR server for changing RTR serial is not enough though. Syslog messages, traps and status from RPKI validators (as well as validation runtime), RTR servers and BGP routers should be monitored as well for operationally relevant errors.
Syslog messages, traps and status from RPKI validators (as well as validation runtime), RTR servers and BGP routers should be monitored as well for operationally relevant errors.
Installing an RPKI validator and an RTR server (or a daemon which handles both) - even in a redundant setup - and connecting them to BGP production routers does not suffice for a robust ROV deployment. Every single RTR server needs to be monitored and alarms need to be raised when an RTR server is serving stale VRPs. The impact of stale VRPs in production networks becomes more important the longer it goes unnoticed.