Roland Bless

How To Never Lose Control Over Your Network

Author image
Roland Bless

12 min read

0
Article lead image

The complexity of modern networks is increasing, as is the challenge of network management and control. Solutions that rely on AI only seem to exacerbate the issue and today, even well-managed large networks are seeing the consequences. This article presents KIRA, a zero-touch control plane connectivity solution for mastering increasing complexity.


Networks and networked services are becoming increasingly complex. Alongside this development, the interdependencies required for network management, control, and operations are quite high today. For example, flow monitoring data is stored in databases and made available to management components that then interact with the network. Thus, a composition and chain of several services are used to operate networks. If one of these services fails, it may cause others to fail as well1, complicating the analysis of the root cause of the failure2. The situation is exacerbated by the growing number of networked devices that require control and management. Current efforts to integrate AI-based mechanisms and Network Digital Twins add even more complexity. These efforts will result in multiple intertwined control loops at different levels, requiring extensive monitoring data to be processed.

Notable outages of large providers such as Meta3, Amazon4, Cloudflare5, Rogers6, or KDDI7 indicate that even well-managed network infrastructures may experience failures that are hard to recover from due to the complexity of controlling the network infrastructure and intricate interdependencies1.

Especially, the outages at Meta, Cloudflare, and Rogers were caused by (mostly BGP) configuration errors. At Meta and Rogers the resulting failures prevented network operators from accessing their control network, thereby inhibiting them from diagnosing the root cause and fixing the failure. Regaining control over the network is then a tedious process, potentially requiring direct on-site access to the devices in order to revert the erroneous changes3.

These examples illustrate the problem, clearly motivating the need for resilient control and management solutions that avoid manual configuration and are able to adapt to changing conditions and self-heal. Several approaches such as Autonomic Networking and Management8,9, Zero Touch Management10, and Self-Driving Management11 strive for more autonomic (i.e., self-managing8) solutions. However, they assume that control plane connectivity is available.

Control plane connectivity

Basically, there are two different ways to provide control plane connectivity. Large and complex networks make it nearly impractical to use separate out-of-band control plane networks (CPNs) that need their own devices, setup, and configuration. Out-of-band CPNs need to be "highly available, easy to manage and maintain, and cost effective"12, but come with the burden of installing and operating two distinct networks. An in-band CPN uses the same links for control as for transporting the data packets and is cheaper, but comes at the cost of potential circular dependencies on connectivity13. Both CPN variants need a connectivity solution that often requires a routing solution for larger networks (smaller networks may simply use a link-layer solution). Orion12 and EBB13 use a hybrid CPN approach, i.e., a mixture of in-band and out-of-band CPNs. EBB uses Open/R as fallback connectivity solution. In-band CPNs nevertheless require prioritising traffic of the routing protocol that provides the connectivity. One additional problem is scalability. For example, many existing routing solutions require the introduction of areas for scalability; e.g., Open/R will probably not work without them beyond 10,000 nodes. However, areas need to be configured before the routing protocol can establish CPN connectivity, thereby introducing a cyclic dependency and excluding zero-touch solutions.

This article proposes introducing a resilient invariant for control and network management that has no configuration dependencies (zero-touch) and provides control plane connectivity unconditionally (the underlying connectivity must be working to some extent though). When this solution is deployed, it would make network management and control more robust, because it always offers the possibility to reset devices, their configurations, and services to a well-known state in case of failures. It therefore serves as a connectivity invariant to bootstrap networked resources and services as well as to recover from failures. Today's and future networks are possibly too complex to preclude configuration mistakes and resulting failures. Tools to audit configuration changes are used but have shown to fail in practice3,5, too.

It is probably time to start a transition to resilient network management and control that allows failures to happen but is able to recover from them reliably by using the control plane connectivity invariant that is only self-dependent.

KIRA as invariant for control plane connectivity

KIRA14,15 is a scalable zero-touch routing architecture that provides IPv6 connectivity without any manual configuration across all different kinds of topologies. Zero-touch does not only mean without manual configuration, but also includes adaptivity. In this context KIRA adapts automatically to different underlay topologies and link or node failures in a self-organising manner. It is ID-based, i.e., network resources keep their address even while changing their connectivity in the topology, e.g., by moving across the topology or becoming multi-homed. It provides self-generated addresses (currently using a 16-bit ULA prefix and a 112 bit NodeID that is randomly generated), therefore it does not need any other address assignment mechanism for building its connectivity.

It builds a control plane fabric on top an underlying (usually link layer) topology as illustrated in Figure 1. Control and management entities can exert control over their resources on top of this connectivity, e.g., by creating control connections and sending commands to the resources or gathering monitoring data and so on.

Figure 1: KIRA constructs a Control Plane Fabric in a Zero-touch Manner

Design and features

KIRA consists of a two tier architecture shown in Figure 2. The Routing Tier consists of the ID-based routing protocol R2/Kad that is based on the Kademlia peer-to-peer overlay approach. It uses XOR as distance metric between NodeIDs, path vectors, and source routing. Its routing tables are only growing with O(log(n)), where n is the number of nodes in the network. Therefore it scales very well, but as trade-off routes may incur some stretch (i.e., they are longer compared to the shortest path). However, the average stretch is acceptable for general control plane traffic. Moreover, a route to a contact in the routing table of a node is converging to a shortest path route. Therefore, KIRA prioritises connectivity over route efficiency. KIRA includes a scheme for path rediscovery in case links or nodes fail. More details of how R2/Kad works can be found in the paper15 and the Internet-Draft14.

Source routing is very robust, but may imply high per packet overhead. Therefore, KIRA uses a Forwarding Tier for efficiently forwarding data packets of the CPN, i.e., the actual control and management traffic. PathIDs are used instead of source routes and consist of a hash of the NodeIDs along the complete path. PathIDs are precomputed in a 2-hop vicinity, therefore only for paths longer than 5 hops PathIDs must be installed in intermediate systems in order to swap the PathIDs similar to label swapping. KIRA's Forwarding Tier can simply use existing forwarding plane technology that is IPv6 capable and can also SRv6 for encapsulation. As indicated in Figure 2, control and management applications can simply use the IPv6 connectivity provided by KIRA, i.e., they can use any transport protocol that works with IPv6, so they do not need to be adapted to work with KIRA.

KIRA has some interesting properties that make it very well suited as connectivity solution for control plane fabrics:

  • KIRA is loop-free in the sense that packets will never cycle in an "endless" loop (as with hop-by-hop routing that employs hop limit mechanisms as mitigation). Moreover, it is loop-free even during convergence.
  • It enables a per-node decision on stretch/memory trade-off by allowing to put additional entries into its routing table, e.g., to get more efficient routes to resources that need to be controlled by a control application running on a KIRA node.
  • It is multi-path capable due to small routing tables and expressiveness of the source routes.
  • It can support fast reroute without loops due to source routes.
  • It supports different routing metrics.
  • It has a built-in route flapping prevention in the sense that it will not alternate between two equally good routes.
  • It includes a specific end-system mode for non-routing nodes.
  • It supports Domains so that its confines routes to and via nodes within the corresponding domain. Domains may be defined by topological or organizational criteria. However, they have to be assigned to nodes, which could part of an automatic onboarding procedure. The NodeID stays the same in all domains and there exists always a global domain that can be used for further configuration. Domains may also be statically assigned together with the software image or by the geographical location.

Additional services

KIRA also supports additional services that are useful for control and management. They are not part of KIRA itself, but run as tightly coupled modules as shown in Figure 2. First, it supports a Distributed Hash Table (DHT) that can be used to store (key, value) pairs, e.g., to provide a simple name service that maps human readable names to NodeIDs (that are randomly generated). The DHT can also help to register and find service instances, i.e., it enables service discovery. Second, KIRA provides a very efficient topology discovery mechanism called KeLLy16 that can be used for controller placement, service orchestration, or creating areas and so on. Topology discovery is an essential part of modern architectures1,13. These supporting and tightly integrated add-on services make it easily possible to bootstrap management and control entities, letting them rendezvous and self-organise for executing truly distributed control.

Figure 2: KIRA's Two Tier Architecture

Conclusions

Using KIRA as an invariant for control plane connectivity can be seen as an enabler for much more robust network management operations. Network operators do not need to know how their current network looks in advance as KIRA is able to provide up-to-date topology information. The provided connectivity can be used as a base for all management and control tasks, and its tightly integrated add-on services provide a perfect basis for autonomic network management solutions. Network operators should never lose control over their network infrastructure and should be able to recover from even the most complex failure scenarios. The next step toward realising this vision would be the open standardisation of KIRA by the IETF so that it is available in all networked devices that need to be managed or controlled. The protocol specification as Internet-Draft and running code are available (please see KIRA's project page for more resources and information), but we need proponents to put this work forward within the IETF. Since the operator community would benefit the most from such a solution, we kindly ask for feedback (e.g., here in the comments, directly via e-mail or by joining our discussion list) and ideally support to put this work forward towards real deployment.


Notes

  1. A. Krentsel et al. A decentralized sdn architecture for the wan. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 938–953, New York, NY, USA, 2024. https://doi.org/10.1145/3651890.3672257
  2. Amazon Web Services. Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region. https://aws.amazon.com/message/12721/, Dec. 2021
  3. Engineering at Meta. More details about the October 4 outage. https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/, Oct 2021.
  4. Washington Post. Amazon Web Services’ third outage in a month exposes a weak point in the Internet’s backbone. https://www.washingtonpost.com/business/2021/12/22/amazon-web-services-experiences-another-big-outage/, Dec. 2021
  5. T. Strickx and J. Hartman. Cloudflare outage on June 21, 2022. The Cloudflare Blog https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/, June 2022.
  6. Wikipedia. 2022 Rogers Communications outage. https://en.wikipedia.org/wiki/2022_Rogers_Communications_outage, Sept. 2022.
  7. KDDI Corporation. The July 2 Communication Failure and Our Response. https://www.kddi.com/english/important-news/20220729_01/, July 2022.
  8. M. Behringer, M. Pritikin, S. Bjarnason, A. Clemm, B. Carpenter, S. Jiang, and L. Ciavaglia. Autonomic Networking: Definitions and Design Goals. RFC 7575 (Informational), June 2015
  9. T. B. Meriem, R. Chaparadza, B. Radier, S. Soulhi, J.-A. Lozano-López, and A. Prakash. GANA – Generic Autonomic Networking Architecture. ETSI Whitepaper No. 16, Oct. 2016. ISBN 979-10-92620-10-8
  10. E. Coronado, R. Behravesh, T. Subramanya, A. Fernàndez-Fernàndez, M. S. Siddiqui, X. Costa-Pérez, and R. Riggio. Zero Touch Management: A Survey of Network Automation Solutions for 5G and 6G Networks. IEEE Communications Surveys & Tutorials, 24(4):2535–2578, 2022
  11. K. Dzeparoska, N. Beigi-Mohammadi, A. Tizghadam, and A. Leon-Garcia. Towards a Self-Driving Management System for the Automated Realization of Intents. IEEE Access, 9:159882–159907, 2021.
  12. A. D. Ferguson et al. Orion: Google’s Software-Defined Networking Control Plane. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 83–98. USENIX Association, Apr. 2021
  13. M. Denis et al. EBB: Reliable and Evolvable Express Backbone Network in Meta. In Proceedings of the ACM SIGCOMM 2023 Conference, page 346–359, New York, NY, USA, 2023.
  14. R. Bless: Kademlia-directed ID-based Routing Architecture (KIRA), Internet-Draft draft-bless-rtgwg-kira-02, March 2025, https://datatracker.ietf.org/doc/draft-bless-rtgwg-kira/
  15. R. Bless, M. Zitterbart, Z. Despotovic, and A. Hecker. KIRA: Distributed Scalable ID-based Routing with Fast Forwarding. In 2022 IFIP Networking Conference (IFIP Networking), pages 1–9, Catania, Italy, June 2022. https://ieeexplore.ieee.org/document/9829816
  16. P. Seehofer, R. Bless, H. Mahrt, and M. Zitterbart. Scalable and Efficient Link Layer Topology Discovery for Autonomic Networks. In 2023 19th International Conference on Network and Service Management (CNSM), pages 1–9, Nov. 2023. https://opendl.ifip-tc6.org/db/conf/cnsm/cnsm2023/1570931423.pdf
0

About the author

Author image
Roland Bless Based in Karlsruhe, Germany

Dr. Roland Bless is Associate Professor and senior researcher at the Institute of Telematics at KIT (Karlsruhe Institute of Technology) in Germany. He studied Computer Science at the University of Karlsruhe until 1996 and got his PhD in the area Quality-of-Service Management in 2002. In 2009 he finished his Habilitation at the KIT Department of Informatics. Since 1998 he is active in the Internet Standardization. His research interests are Highly Scalable Zero-Touch Routing, Quality-of-Service, Congestion Control, Network Control and Management as well as Quantum Internet. Dr. Bless is member of Gesellschaft für Informatik, ACM SIGCOMM, IEEE ComSoc, and ISOC.

Comments 0