When ToR Instability Collapses Cross-Rack Redundancy Without Breaching SLA

This article examines a two-rack deployment where ToR instability quietly collapsed redundancy while SLA metrics remained within bounds, revealing how service-level monitoring can mask structural risk.

In many small and mid-scale deployments, cross-rack redundancy exists logically but not structurally. Replicas are distributed across racks, ToRs are independent, and SLA dashboards remain green. The design appears resilient.

The underlying assumption is that placing replicas in different racks creates separate failure domains. In single-homed designs, that assumption only holds while each rack switch remains stable.

Service-level monitoring measures user impact. It does not model how failure domains evolve under instability. When instability does not trigger clean link-down events, a topology can shift from multi-rack independence to single-rack dependency without breaching any SLA threshold.

The following field case illustrates how this class of design behaves under partial ToR instability.

The topology assumption

The deployment followed a common pattern:

Two racks with:

One database replica in each rack
Independent ToRs
Shared L3 core
Single-homed hosts
SLA-based monitoring

Figure 1: Two-rack layout (single-homed servers)

On the logical diagram, cross-rack redundancy existed. Writes could continue if one replica failed. From the service perspective, the system was rack-resilient.

At the physical layer, however, each rack was a single failure domain. All hosts in a rack depended on a single ToR. There was no MLAG between racks and no independent L2 path across them.

Logical distribution was therefore assumed to imply structural independence.

The first structural shift

The first observable symptom appeared at the application layer when replica-02 in rack-2 became unreachable. TCP sessions timed out, service discovery marked it unhealthy, and replication lag on the primary began to increase. However, primary and replica-01 in rack-1 remained reachable. Write traffic continued. From a service perspective, availability appeared intact.

Although the application was still serving traffic, rack-2 could no longer function as an independent unit. From a redundancy perspective, it had already failed.

Structurally, the number of independent failure domains had already reduced from two to one. The deployment moved from two rack-level failure domains to one. All remaining replicas were now located in rack-1. Inter-rack failover capability disappeared even though no SLA threshold was crossed.

This was not yet an outage. It was a resilience shift.

No monitoring rule treated this contraction of failure domains as a severity event.

Coupling of control and data planes

Shortly afterwards, IPMI access to replica-02 disappeared.

Switch state:
Interface: up
MAC present in FDB
ARP incomplete
No ICMP reply

As we see, the interface remained up and the MAC entry was present, but ARP resolution failed.

Replica-02 carried both its database interface and its management VLAN through ToR-2. As forwarding behaviour on that switch degraded, both data-plane and control-plane connectivity disappeared together.

This behaviour was consistent with rack switch forwarding instability rather than a host crash.

The underlying modelling issue was hidden coupling: management and data planes shared the same ToR dependency. As a result, there was no independent management path and no out-of-band separation. When the ToR degraded, visibility and data access degraded simultaneously.

Why SLA monitoring did not react

From the outside, the system still appeared healthy:

HTTP 5xx ≈ 0.2%
p50 ≈ 40 ms
p95 ≈ 130 ms
p99 widened to 620–900 ms
SLA (1.5 s) remained within bounds

Reads targeting replica-02 failed and were retried against replica-01, increasing retry volume and widening tail latency. However, requests continued to complete within SLA limits.

Threshold-based monitoring therefore behaved exactly as designed: it measured user-visible impact. What it did not measure was structural safety.

Retry amplification masked the contraction of effective failure domains. SLA compliance, in this case, did not imply resilience.

To understand why, we need to look below the service layer.

Network-Layer Degradation Without Link Failure

At the network layer, instability became explicit. ToR-2 logs showed repeated link state transitions:

LINK-3-UPDOWN: Interface Gi1/0/18 down
LINK-3-UPDOWN: Interface Gi1/0/18 up

Uplink behaviour showed:

LACP renegotiation detected

Interface counters reported:

CRC errors: 1487 and increasing over a short interval
Error rate rising between successive counter reads

Taken together, CRC growth and repeated LACP renegotiation cycles pointed to uplink instability - even though there were no sustained interface-down events.

In single-homed rack designs, CRC growth without a clean link failure leads to partial forwarding degradation: traffic degrades while interfaces remain administratively up. The result is packet loss and intermittent reachability rather than an immediate outage.

Additional hosts in rack-2 soon began experiencing packet loss. Because cross-rack placement at the application layer had not created independent L2/L3 paths, everything in rack-2 still depended on a single ToR.

Instability did not equal link failure, and forwarding degradation did not equal outage. Yet effective independence had already been lost.

Reboot loop as exposure event

The instability eventually progressed into a reboot loop on ToR-2. Logs showed:

control-plane crash detected
watchdog timeout
system restarting

After restart:

port-channel down
re-negotiating LACP

Each restart reinitialised control-plane state and forwarding tables. During these intervals, L2 reachability for rack-2 hosts disappeared, ARP resolution failed, and traffic was effectively blackholed until convergence completed. Rack-1, however, remained stable and traffic continued flowing.

Importantly, the reboot loop did not create the structural weakness - it exposed it. The failure-domain contraction had already occurred when forwarding instability first isolated rack-2.

Failure-domain contraction

Between the initial replica loss and switch stabilisation, the deployment underwent failure-domain contraction.

As rack-2 lost forwarding stability, database redundancy effectively collapsed into rack-1 and inter-rack failover capability disappeared. Retry volume increased and tail latency widened, yet service thresholds remained within limits.

Logical redundancy remained configured, but physical independence did not.

Because each rack was single-homed, the effective failure domains collapsed inside rack-2. Retry logic delayed visible failure while simultaneously masking the underlying topology collapse.

This was not a service outage. It was a loss of structural resilience.

Observability and modelling implications

In this case, monitoring focused primarily on service-level metrics:

request success rate
latency percentiles
replica health

What it did not model were structural properties of the deployment, such as:

replica distribution across physical failure domains
contraction of rack diversity as a first-class signal
retry amplification as structural drift
ToR instability as a severity trigger without link-down events

From a topology perspective, however, the deployment had already collapsed into a single effective failure domain.

This pattern tends to appear in environments where:

logical redundancy exists at the application layer
servers are single-homed
ToRs are independent but not cross-coupled
monitoring is SLA-driven

In such designs, logical redundancy does not guarantee structural independence.

Network-level questions

This case raises practical questions: Is rack diversity visible in your telemetry? Would your NOC detect the loss of rack-level independence before users notice? Do CRC growth and LACP renegotiations without link-down events trigger investigation? And at what point does logical redundancy become structural dependence?

From a service perspective, the system remained within SLA throughout the event. From a topology perspective, however, the deployment had already collapsed into a single effective failure domain before any threshold was breached.

This case illustrates how logical redundancy at the application layer can persist while structural independence at the network layer has already been lost. Failure-domain contraction, in such designs, can occur without producing an immediate outage signal.

When ToR Instability Collapses Cross-Rack Redundancy Without Breaching SLA

Ruslan Seyidov

The topology assumption

The first structural shift

Coupling of control and data planes

Why SLA monitoring did not react

Network-Layer Degradation Without Link Failure

Reboot loop as exposure event

Failure-domain contraction

Observability and modelling implications

Network-level questions

About the author

Comments 0

When ToR Instability Collapses Cross-Rack Redundancy Without Breaching SLA

Ruslan Seyidov

Share

The topology assumption

The first structural shift

Coupling of control and data planes

Why SLA monitoring did not react

Network-Layer Degradation Without Link Failure

Reboot loop as exposure event

Failure-domain contraction

Observability and modelling implications

Network-level questions

Share

About the author

Comments 0