<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>OVN-Kubernetes on Elastocera</title>
    <link>https://elastocera.com/tags/ovn-kubernetes/</link>
    <description>Recent content in OVN-Kubernetes on Elastocera</description>
    <image>
      <title>Elastocera</title>
      <url>https://elastocera.com/images/forest-og.jpg</url>
      <link>https://elastocera.com/images/forest-og.jpg</link>
    </image>
    <generator>Hugo -- 0.157.0</generator>
    <language>en</language>
    <lastBuildDate>Mon, 06 Apr 2026 10:00:00 -0300</lastBuildDate>
    <atom:link href="https://elastocera.com/tags/ovn-kubernetes/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Hidden Reliability Risks in Multi-Cluster Kubernetes</title>
      <link>https://elastocera.com/posts/hidden-reliability-risks-multi-cluster-kubernetes/</link>
      <pubDate>Mon, 06 Apr 2026 10:00:00 -0300</pubDate>
      <guid>https://elastocera.com/posts/hidden-reliability-risks-multi-cluster-kubernetes/</guid>
      <description>Failure patterns in multi-cluster Kubernetes systems: boundaries that collapse, hidden dependencies, and distributed failure modes.</description>
        <enclosure url="https://elastocera.com/images/mycelium-og.jpg" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<p>Multi-cluster Kubernetes is often introduced as a solution to failure.</p>
<p>In practice, it does something more subtle.</p>
<p><strong>It changes the shape of failure.</strong></p>
<p>Failures do not disappear.<br>
They stop being local, predictable, and contained.<br>
They become distributed, indirect, and delayed.</p>
<p>The most dangerous part is not the failure itself.</p>
<p>These failure modes share a pattern: they rarely appear in architecture diagrams, do not violate best practices, and <strong>only become visible under specific lifecycle events</strong>.</p>
<p>Usually at the worst possible moment.</p>
<p>This is not a tooling problem.<br>
<strong>It is a systems behavior problem.</strong></p>
<p>What follows are recurring patterns observed in real multi-cluster environments.</p>
<h3 id="namespace-collisions-and-cascading-deletions">Namespace Collisions and Cascading Deletions</h3>
<p>Namespaces are designed to be boundaries.</p>
<p>In multi-cluster systems, they often become something else.</p>
<p><strong>They become coupling points.</strong></p>
<p>The shift happens quietly.</p>
<p>When a namespace starts representing identity, such as a cluster inside <span class="tooltip-term" data-tooltip="Red Hat Advanced Cluster Management for Kubernetes. A centralized management platform that provides governance, policy enforcement, and lifecycle management across multiple clusters through a hub-and-spoke architecture.">ACM</span>, it stops being just a container of resources.</p>
<p>It becomes part of the <span class="tooltip-term" data-tooltip="Control plane: the set of components responsible for managing and coordinating the state of a Kubernetes cluster. It decides what runs, where it runs, and how it recovers. When the control plane is unavailable, the system continues operating but loses the ability to change or respond to new conditions."> control plane </span>.</p>
<p>A common pattern emerges:</p>
<ul>
<li>One system uses namespaces to represent clusters.</li>
<li>Another uses namespaces for workload isolation.</li>
<li><strong>Both assume they control the lifecycle of those namespaces.</strong></li>
</ul>
<p>Nothing appears wrong during normal operation.</p>
<p>The conflict only appears when something is removed.</p>
<hr>
<p><strong>Deletion is where the illusion breaks.</strong></p>
<p>Kubernetes behaves correctly.<br>
A namespace is deleted, and everything inside it disappears.</p>
<p>The failure is not in the platform.</p>
<p>It is in the assumption that the namespace had a single meaning.</p>
<p>This is how cascading deletion emerges.</p>
<p>A lifecycle operation in one context <strong>silently destroys resources owned by another</strong> (<a href="https://elastocera.com/field-notes/illusion-of-isolation/" class="fn-ref" title="The Illusion of Isolation">FN-0004</a>).</p>
<p>In environments using <span class="tooltip-term" data-tooltip="HyperShift enables Hosted Control Planes where the control plane runs as pods on a hosting cluster instead of dedicated nodes, reducing cost but concentrating risk.">HyperShift</span>, this pattern becomes more visible.</p>
<p>When cluster identity and control plane resources share the same namespace, a single detach operation can remove both.</p>
<blockquote>
<p>When a boundary carries more than one meaning, it becomes a failure propagation mechanism.</p>
</blockquote>
<p>The mitigation is often described as naming conventions.</p>
<p>That is only the surface.</p>
<p><strong>The real solution is architectural:</strong></p>
<ul>
<li>Separate identity from lifecycle.</li>
<li>Ensure that each boundary maps to a single responsibility.</li>
<li>Treat namespace design as part of the system model.</li>
</ul>
<h3 id="the-hub-cluster-as-a-concentration-of-risk">The Hub Cluster as a Concentration of Risk</h3>
<p>Multi-cluster management introduces a central point of coordination.</p>
<p>In <span class="tooltip-term" data-tooltip="Multicluster Engine for Kubernetes provides the core capabilities for cluster lifecycle, discovery, and agent-based management used by ACM.">MCE</span> and ACM environments, this is the hub cluster.</p>
<p>It is often described as a control plane.</p>
<p><strong>In practice, it behaves as a concentration point for risk.</strong></p>
<p>Managed clusters continue running even if the hub is unavailable.</p>
<p>This creates a sense of resilience.</p>
<p>But resilience at the workload level <strong>hides fragility at the management level.</strong></p>
<p>When the hub becomes unavailable, the system loses its ability to change:</p>
<ul>
<li>No deployments.</li>
<li>No policy enforcement.</li>
<li>No <span class="tooltip-term" data-tooltip="Reconciliation: the continuous process by which Kubernetes compares the current state of the system with the desired state and makes corrections. Without reconciliation, configuration changes, scaling decisions, and recovery actions stop being applied."> reconciliation </span> toward desired state.</li>
</ul>
<p>This creates a different kind of failure.</p>
<p><strong>Not an outage, but a loss of control</strong> (<a href="https://elastocera.com/field-notes/hidden-spofs-platform-layers/" class="fn-ref" title="Hidden SPOFs in Platform Layers">FN-0002</a>).</p>
<p>Over time, <span class="tooltip-term" data-tooltip="Drift: the gradual divergence between the intended state of a system and its actual state. Drift accumulates silently through missed updates, expired credentials, and unenforced policies, often becoming visible only during incidents or audits."> drift </span> accumulates:</p>
<ul>
<li>Configurations diverge.</li>
<li>Security policies stop being enforced.</li>
<li>Certificates expire without rotation.</li>
</ul>
<p>The system becomes inconsistent with itself.</p>
<hr>
<p>At the center of this risk is <span class="tooltip-term" data-tooltip="A distributed key-value store that holds all Kubernetes cluster state. Loss of quorum renders the control plane unavailable.">etcd</span>.</p>
<p>When it fails, the system does not degrade gracefully.</p>
<p><strong>It stops coordinating.</strong></p>
<p>The hub is not just infrastructure.</p>
<p><strong>It defines whether the system can evolve.</strong></p>
<h3 id="infrastructure-dependencies-that-scale-the-hahahugoshortcode39s9hbhb">Infrastructure Dependencies That Scale the <span class="tooltip-term" data-tooltip="Blast radius: the total scope of impact when a failure occurs. In distributed systems, the blast radius determines how many services, clusters, or users are affected by a single point of failure. Shared infrastructure increases the blast radius because one failure propagates to every system that depends on it."> Blast Radius </span></h3>
<p>Multi-cluster architectures suggest isolation.</p>
<p>Separate clusters. Separate <span class="tooltip-term" data-tooltip="Failure domain: a boundary within which a failure is expected to be contained. In theory, each cluster is its own failure domain. In practice, shared dependencies like DNS, identity, and certificates allow failures to cross those boundaries."> failure domains </span>.</p>
<p><strong>This assumption breaks when clusters share infrastructure.</strong></p>
<p>Services like DNS, identity, and certificate authorities operate below Kubernetes.</p>
<p>An example is <span class="tooltip-term" data-tooltip="Red Hat Identity Management, based on FreeIPA. Provides centralized DNS, authentication, and certificate services across environments.">IdM</span>.</p>
<p>When these systems fail, the impact is not localized.</p>
<p>It spreads across every dependent cluster.</p>
<p>The symptoms are indirect:</p>
<ul>
<li>DNS issues appear as application failures.</li>
<li>Certificate problems appear as authentication errors.</li>
<li>Content delivery issues appear as deployment failures.</li>
</ul>
<p>Organizations using <span class="tooltip-term" data-tooltip="Red Hat Satellite provides local content mirrors for packages and container images, commonly used in disconnected environments.">Satellite</span> experience this clearly.</p>
<p>If the mirror fails, every cluster stops receiving updates.</p>
<p>The pattern is consistent.</p>
<blockquote>
<p>Shared infrastructure synchronizes failure.</p>
</blockquote>
<p>Clusters are no longer independent.</p>
<p><strong>They become coupled through what they depend on.</strong></p>
<h3 id="operator-and-catalog-drift">Operator and Catalog Drift</h3>
<p>Consistency across clusters is assumed.</p>
<p><strong>In practice, it slowly erodes.</strong></p>
<p>Operators evolve through <span class="tooltip-term" data-tooltip="Operator Lifecycle Manager manages installation and updates of Kubernetes operators using catalogs and subscriptions.">OLM</span>.</p>
<p>Clusters update at different times.</p>
<p>Catalogs diverge.</p>
<p>Each cluster remains internally consistent.</p>
<p><strong>The system as a whole does not.</strong></p>
<p>The problem appears when systems interact.</p>
<p>Workloads move.<br>
Policies apply across clusters.</p>
<p>Differences become visible:</p>
<ul>
<li><span class="tooltip-term" data-tooltip="Custom Resource Definitions define schemas for Kubernetes extensions. Changes between versions can introduce incompatibilities.">CRDs</span> no longer match.</li>
<li>Defaults differ.</li>
<li>APIs behave differently.</li>
</ul>
<p>The system appears unpredictable.</p>
<p><strong>In reality, it is inconsistent.</strong></p>
<blockquote>
<p>Drift is not a failure event. It is a gradual loss of alignment.</p>
</blockquote>
<p>Without governance, it is inevitable (<a href="https://elastocera.com/field-notes/governance-drift/" class="fn-ref" title="Governance Drift">FN-0007</a>).</p>
<h3 id="network-assumptions-that-break-at-scale">Network Assumptions That Break at Scale</h3>
<p>Networking appears stable.</p>
<p>Until scale exposes hidden interactions.</p>
<p>In <span class="tooltip-term" data-tooltip="OVN-Kubernetes is the default OpenShift networking plugin, using overlay networking based on Open Virtual Network.">OVN-Kubernetes</span> network trafic is encapsulated using <span class="tooltip-term" data-tooltip="Geneve is a tunneling protocol that encapsulates packets, adding overhead and affecting MTU and performance.">Geneve</span>.</p>
<p>At the same time, NIC optimizations like <span class="tooltip-term" data-tooltip="Large Receive Offload and Generic Receive Offload aggregate packets to improve throughput, but can interfere with encapsulation.">LRO and GRO</span> modify packet handling.</p>
<p>These mechanisms interact in non-obvious ways.</p>
<p>Packets are not consistently dropped.</p>
<p><strong>They are intermittently lost.</strong></p>
<p>The pattern is subtle.</p>
<p>From the application perspective, the system feels unreliable.</p>
<p>From the system perspective, <strong>everything looks healthy.</strong></p>
<p><span class="tooltip-term" data-tooltip="Maximum Transmission Unit defines the largest packet size that can be transmitted without fragmentation.">MTU</span> mismatches amplify the problem.</p>
<p>Encapsulation reduces effective packet size.</p>
<p>Different environments behave differently.</p>
<blockquote>
<p>When abstractions hide lower layers, they also hide their failure modes (<a href="https://elastocera.com/field-notes/abstractions-simplify-usage-not-operation/" class="fn-ref" title="Abstractions Simplify Usage, Not Operation">FN-0006</a>).</p>
</blockquote>
<h3 id="what-to-do-about-it">What To Do About It</h3>
<p>These patterns tend to emerge from the same place.</p>
<p><strong>A gap between how systems are designed and how they actually behave</strong> (<a href="https://elastocera.com/field-notes/operational-knowledge-vs-architectural-knowledge/" class="fn-ref" title="Operational Knowledge vs Architectural Knowledge">FN-0003</a>).</p>
<p>Most architectures describe structure.<br>
But failures follow behavior.</p>
<p>Closing that gap is not a matter of adding more configuration.<br>
It requires a shift in perspective.</p>
<p>From components to interactions.<br>
From definitions to dynamics.</p>
<p>Boundaries, for example, only work when they carry a single meaning.<br>
When they don’t, they become translation layers for failure.</p>
<p>Control planes are another blind spot.<br>
They are often treated as abstractions.</p>
<p>They are not.</p>
<p>They are dependencies.<br>
And when they fail, they fail across everything they touch.</p>
<p>Infrastructure also tends to disappear from view.<br>
Until it doesn’t.</p>
<p>What looks like isolation at the application layer<br>
can still share the same underlying paths.</p>
<p>And those paths define how failure moves.</p>
<p>Consistency, in this context, is never accidental.<br>
It has to be enforced deliberately.</p>
<p>Which leaves one final question.</p>
<p>Not whether the system works.</p>
<p>But how it fails.</p>
<p><strong>Because that is what reveals its true shape</strong> (<a href="https://elastocera.com/field-notes/the-first-incident-test/" class="fn-ref" title="The First Incident Test">FN-0015</a>).</p>
<h3 id="conclusion">Conclusion</h3>
<p>Multi-cluster Kubernetes does not reduce complexity.</p>
<p><strong>It redistributes it.</strong></p>
<p>What appears independent at the architectural level is often connected through shared dependencies, shared state, and shared assumptions. When failure propagates, it moves through those connections, not through the components themselves.</p>
<p>Reliability does not come from architecture alone.</p>
<p><strong>It comes from understanding behavior.</strong></p>
<p>The most dangerous risks are not hidden because they are rare.</p>
<p><strong>They are hidden because they look correct.</strong></p>
<p>The question is not whether these patterns exist.</p>
<p>They do.</p>
<p>The question is when they will surface. And whether that moment is controlled, or accidental.</p>
<hr>
<h3 id="architectural-continuity">Architectural Continuity</h3>
<p>Multi-cluster architectures redistribute failure across boundaries that were designed for isolation but behave as propagation paths.</p>
<p>The patterns described here, namespace collisions, hub concentration, infrastructure coupling, operator drift, and network interactions, <strong>are not edge cases</strong>. They are structural properties of systems that share more than their architecture diagrams reveal.</p>
<blockquote>
<p>Boundaries that carry more than one meaning become failure propagation mechanisms.
Systems that share infrastructure synchronize failure.
Consistency that is not enforced is eventually lost.</p>
</blockquote>
<p>Understanding these patterns is the first step. Translating them into governance and risk language is the next.</p>
<p><strong>Continue with</strong>: <a href="/posts/platform-governance-control-system/">Platform Governance as a Control System in Multi-Cluster Kubernetes</a></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
