<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Kubernetes on Elastocera</title>
    <link>https://elastocera.com/tags/kubernetes/</link>
    <description>Recent content in Kubernetes on Elastocera</description>
    <image>
      <title>Elastocera</title>
      <url>https://elastocera.com/images/forest-og.jpg</url>
      <link>https://elastocera.com/images/forest-og.jpg</link>
    </image>
    <generator>Hugo -- 0.157.0</generator>
    <language>en</language>
    <lastBuildDate>Fri, 22 May 2026 10:00:00 -0300</lastBuildDate>
    <atom:link href="https://elastocera.com/tags/kubernetes/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The DR Number Almost No One Records</title>
      <link>https://elastocera.com/posts/kubernetes-dr-strategies-fail-real-enterprises/</link>
      <pubDate>Fri, 22 May 2026 10:00:00 -0300</pubDate>
      <guid>https://elastocera.com/posts/kubernetes-dr-strategies-fail-real-enterprises/</guid>
      <description>Disaster recovery has three measurable states. Most organizations record only the first. The Validation Gap is the calculable distance between declared and tested capability, and starting in 2025, it is becoming a regulatory exposure.</description>
        <enclosure url="https://elastocera.com/images/tardigrade-og.jpg" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<p>Disaster recovery has three numbers.</p>
<p>Almost no organization records all three.</p>
<p>The first is the number written into the plan. The second is the number measured during exercises, if exercises happen. The third is the number observed during real incidents.</p>
<p>The distance between them is the only metric that matters. It is also the metric that almost no one calculates.</p>
<hr>
<h3 id="the-three-states-of-dr-capability">The Three States of D.R. Capability</h3>
<p><span class="tooltip-term" data-tooltip="Disaster Recovery (D.R.): the set of policies, tools, and procedures designed to recover technology infrastructure and systems after a disruptive event. In Kubernetes environments, D.R. encompasses cluster recovery, data replication, identity and certificate restoration, and the network infrastructure required to reestablish operations."> Disaster recovery </span> capability exists in three forms simultaneously, and the three forms produce three different numbers.</p>
<ol>
<li>
<p><strong>Declared capability</strong>: the <span class="tooltip-term" data-tooltip="Recovery Point Objective (RPO): the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means the organization accepts losing up to 1 hour of data. Recovery Time Objective (RTO): the maximum acceptable duration of downtime before business impact becomes critical."> RPO and RTO </span> values written into the D.R. plan. These are typically inherited from compliance requirements, business expectations, or vendor templates. They are aspirational by construction.</p>
</li>
<li>
<p><strong>Tested capability</strong>: the actual recovery time and data loss observed during the most recent end-to-end exercise, if such an exercise has been performed. This is the measurement that most closely approximates real recovery, but only if the exercise conditions are realistic.</p>
</li>
<li>
<p><strong>Observed capability</strong>: the actual recovery time and data loss measured during a real incident. This is the only number with no theoretical component. It is also the number that the organization discovers it has, rather than the number it had planned for.</p>
</li>
</ol>
<p>The three numbers are rarely the same. The distance between them is the <strong>Validation Gap</strong>, and it is the most actionable measurement in disaster recovery.</p>
<blockquote>
<p>A plan that has not been tested has only one number. A plan that has been tested has two. A plan that has survived an incident has three. Most organizations operate with one and assume it represents the others.</p>
</blockquote>
<hr>
<h3 id="calculating-the-validation-gap">Calculating the Validation Gap</h3>
<p>The Validation Gap is calculable, not estimable. Three inputs produce the number:</p>
<ul>
<li>
<p><strong>Base gap</strong>: the difference, in hours, between Tested RTO and Declared RTO. A plan declaring 4 hours that tested at 9 hours has a base gap of 5 hours.</p>
</li>
<li>
<p><strong>Decay coefficient</strong>: a multiplier reflecting how stale the test is. Months since the last exercise multiplied by the platform&rsquo;s change velocity. A stable platform might use 0.05 per month. A platform under active migration might use 0.15 per month. Twelve months on a stable platform produces a coefficient of 0.6. Twelve months on a fast-changing platform produces 1.8.</p>
</li>
<li>
<p><strong>Adjusted gap</strong>: base gap multiplied by (1 + decay coefficient). The same 5-hour base gap, on a stable platform tested 12 months ago, becomes 8 hours. On a fast-changing platform, it becomes 14 hours.</p>
</li>
</ul>
<p>A D.R. plan with no recent test has a Validation Gap equal to the entire declared RTO, regardless of how confident the plan reads. The numbers are aspirational, not validated.</p>
<p>The Validation Gap is paid in currency. The product of the adjusted gap and the platform&rsquo;s hourly business value is the <strong>unpriced exposure</strong> the organization is carrying. For a platform supporting US$ 200,000 per hour in transactions, an adjusted gap of 8 hours represents US$ 1.6 million in exposure that has been declared as covered but is not measurably so.</p>
<p>According to the Cockroach Labs State of Resilience 2025 report, only 20 percent of executives feel their organizations are fully prepared to prevent or respond to outages, and organizations average 86 hours of unplanned outage per year. Most of those hours are paid against a Validation Gap that was never calculated.</p>
<blockquote>
<p>The Validation Gap is paid in full during the first incident. Until then, it accumulates without being charged.</p>
</blockquote>
<p><strong>Executive implication:</strong> Ask the platform team for three numbers: the declared RTO, the most recently tested RTO, and the date of that test. The adjusted Validation Gap, multiplied by the platform&rsquo;s hourly business value, is the line item the organization is carrying without recording it.</p>
<hr>
<h3 id="why-the-number-is-not-recorded">Why the Number Is Not Recorded</h3>
<p>The Validation Gap is rarely calculated, and the reason is structural rather than technical.</p>
<p>D.R. exercises, when they happen, are typically scoped narrowly. A cluster is recovered. A database is restored. A failover is demonstrated. None of these individually measure end-to-end recovery, because the dependencies that determine real recovery (identity infrastructure, certificate authorities, container registries, DNS, network paths) live outside the cluster boundary. The structural failure modes of these layers are documented in <a href="/posts/hidden-reliability-risks-multi-cluster-kubernetes/">The Hidden Reliability Risks in Multi-Cluster Kubernetes</a> and <a href="/posts/spofs-modern-cloud-native-architectures/">The SPOFs You Did Not Design</a>. What matters here is that an exercise that does not include them measures something other than D.R. capability (<a href="https://elastocera.com/field-notes/operational-knowledge-vs-architectural-knowledge/" class="fn-ref" title="Operational Knowledge vs Architectural Knowledge">FN-0003</a>).</p>
<p>When exercises do happen, results are usually narrated rather than measured. &ldquo;The exercise was successful&rdquo; is not a number. The actual elapsed time, the deviations from the runbook, the dependencies that failed to activate, and the coordination overhead consumed before recovery began are all measurable. They are also rarely written down.</p>
<p>The optimism cascade (<a href="https://elastocera.com/field-notes/assumed-readiness/" class="fn-ref" title="Assumed Readiness">FN-0024</a>) compounds this. The platform team reports the cluster is ready. The security team reports identity is ready. The network team reports DNS is ready. Each report is true within its scope. None of them validate the chain. The organization is preparing for an incident in pieces while incidents arrive whole.</p>
<p>The team that wrote the plan is rarely the team executing it eighteen months later. Knowledge transfer artifacts describe intent, not the operational details required to act on it (<a href="https://elastocera.com/field-notes/available-knowledge-not-applied/" class="fn-ref" title="Available Knowledge Is Not Applied Knowledge">FN-0017</a>). A runbook that worked when its author was on call may fail under any other rotation.</p>
<blockquote>
<p>Tested recovery is recovery in ideal conditions. Real recovery is recovery in degraded ones. The Validation Gap is the distance between them.</p>
</blockquote>
<p><strong>Executive implication:</strong> D.R. governance requires authority across team boundaries. Without a designated owner with cross-functional mandate, every exercise will reflect the readiness of the strongest individual team and ignore the dependencies between teams.</p>
<hr>
<h3 id="from-internal-metric-to-regulatory-exposure">From Internal Metric to Regulatory Exposure</h3>
<p>Until recently, the Validation Gap was a useful internal measurement that almost no organization computed. Starting in 2025, it has begun to acquire regulatory weight.</p>
<p>The <span class="tooltip-term" data-tooltip="DORA (Digital Operational Resilience Act): EU regulation 2022/2554, in force across the European Union from January 17, 2025. Applies to financial entities and their critical ICT third-party service providers. Requires evidence of tested recovery capability, structured incident reporting, and threat-led penetration testing for significant entities."> Digital Operational Resilience Act </span> (DORA) entered into force across the European Union on January 17, 2025. Its requirements are explicit:</p>
<ul>
<li><strong>Articles 24-25</strong> require digital operational resilience testing, including scenario-based exercises with documented outcomes that demonstrate the capability of recovery, not just its plan.</li>
<li><strong>Articles 26-27</strong> require <span class="tooltip-term" data-tooltip="Threat-Led Penetration Testing (TLPT): adversary-simulation exercises required by DORA every three years for significant financial entities, conducted by accredited testers using current threat intelligence. The objective is to validate operational resilience under realistic attack conditions, not to confirm controls in isolation."> threat-led penetration testing </span> every three years for significant entities, conducted by accredited testers under conditions that approximate realistic adversary behavior.</li>
<li><strong>Articles 17-23</strong> require ICT-related incident reporting, including a four-hour initial notification window for major incidents.</li>
<li><strong>Articles 28-30</strong> require ICT third-party risk management, including contractual evidence that critical providers (cloud platforms among them) meet equivalent resilience standards.</li>
</ul>
<p>For Kubernetes environments operating regulated workloads, these requirements translate the Validation Gap from an internal metric into a finding category. A plan that exists in a wiki article without measured exercise results does not satisfy DORA. A test that recovers a single cluster in isolation does not satisfy a scenario-based requirement. Incident detection and reporting must be instrumented to meet the four-hour notification window, which constrains the design of observability and incident response tooling.</p>
<p>DORA is the most explicit example. It is not the only one.</p>
<p>The <span class="tooltip-term" data-tooltip="NIS2 Directive (EU 2022/2555): in force across the European Union from October 2024. Expands the scope of cybersecurity and operational resilience requirements to essential and important entities across multiple sectors. Mandates risk management measures including business continuity, incident handling, and supply chain security."> NIS2 Directive </span> entered into force in October 2024 with a broader scope than DORA, covering essential and important entities across energy, transport, banking, healthcare, digital infrastructure, and public administration. It mandates risk management measures explicitly including business continuity and incident handling. In the United States, the SEC&rsquo;s cybersecurity disclosure rule (Item 1.05 of Form 8-K, effective late 2023) requires public companies to disclose material cybersecurity incidents within four business days. Banking sector guidance from the OCC, FRB, and FDIC continues to tighten heightened standards for operational resilience.</p>
<p>The pattern across all of these is structural:</p>
<blockquote>
<p>Regulators no longer ask whether a plan exists. They ask whether the plan has been tested, by whom, under what conditions, and with what measured outcome.</p>
</blockquote>
<p>The Validation Gap is the metric that answers that question. An organization that has not calculated it is now exposed not only to operational risk, but to regulatory finding risk, and increasingly to public disclosure obligations.</p>
<p><strong>Executive implication:</strong> If the organization operates under DORA, NIS2, SEC cybersecurity disclosure, or any sectoral resilience framework, the Validation Gap has stopped being optional. The audit no longer ends when the plan is reviewed. It ends when the test results are reviewed.</p>
<hr>
<h3 id="how-to-start-recording">How to Start Recording</h3>
<p>The transition from declared D.R. to validated D.R. is structural, not procedural. It changes what an exercise is, who runs it, and how its results are recorded.</p>
<p><strong>Exercises must be timed and end to end.</strong> A test that recovers a single cluster in isolation does not validate enterprise D.R. The exercise must include identity restoration, certificate validation, image availability, network reachability, and application-level recovery. The clock starts when the simulated incident is declared and stops when business operations are confirmed.</p>
<p><strong>The team executing must not be the team that wrote the plan.</strong> The on-call rotation, not the original author, should drive the exercise. This surfaces the gap between documented intent and operationally usable instructions (<a href="https://elastocera.com/field-notes/the-first-incident-test/" class="fn-ref" title="The First Incident Test">FN-0015</a>).</p>
<p><strong>Conditions should be realistic, not ideal.</strong> Recovery exercises in pristine environments validate the procedure under conditions that will not exist during a real incident. Introducing controlled degradation (removed access to a documented system, simulated unavailability of a dependency, partial information about the failure mode) reveals failure modes that pristine tests hide (<a href="https://elastocera.com/field-notes/governance-drift/" class="fn-ref" title="Governance Drift">FN-0007</a>).</p>
<p><strong>Results must be measured, not narrated.</strong> The actual RTO, the actual RPO, the failures encountered, the recovery deviations from the runbook, and the time spent in coordination are the measurements that close the Validation Gap. &ldquo;The exercise was successful&rdquo; is not a measurement.</p>
<p><strong>The Validation Gap must be recorded as a number, alongside the declared RTO.</strong> When leadership reviews the D.R. plan, both numbers should be visible. The declared value alone is no longer sufficient evidence of capability.</p>
<p><em>For an executive-focused treatment of these patterns specifically in Red Hat OpenShift environments, see <a href="/posts/openshift-dr-strategies-fail-executive-level/">Why Most OpenShift D.R. Strategies Fail at Executive Level</a>.</em></p>
<hr>
<h3 id="architectural-continuity">Architectural Continuity</h3>
<p>Disaster recovery is not the document that an auditor reviews. It is the number that the organization is willing to record alongside the declared one.</p>
<blockquote>
<p>Declared capability is a hypothesis.
Tested capability is a measurement.
The Validation Gap is the distance the organization is carrying without recording it.</p>
</blockquote>
<p>The tardigrade survives the vacuum of space, radiation a thousand times the human limit, temperatures from near absolute zero to 150 degrees Celsius. None of those capabilities are inferred. Each was measured under controlled conditions before the organism was claimed to possess them. Resilience that survives measurement is the only resilience that can be relied upon. Resilience that has only been described will be measured during the first incident, at the moment when the cost of measurement is highest and the time to act on it is shortest.</p>
<hr>
<h3 id="references">References</h3>
<ol>
<li>
<p>Cockroach Labs, &ldquo;<a href="https://www.cockroachlabs.com/blog/the-state-of-resilience-2025-reveals-the-true-cost-of-downtime/">The State of Resilience 2025: Confronting Outages, Downtime, and Organizational Readiness</a>&rdquo;, 2024.</p>
</li>
<li>
<p>European Union, <a href="https://eur-lex.europa.eu/eli/reg/2022/2554/oj">Regulation (EU) 2022/2554 (Digital Operational Resilience Act)</a>, in force January 17, 2025.</p>
</li>
<li>
<p>European Union, <a href="https://eur-lex.europa.eu/eli/dir/2022/2555/oj">Directive (EU) 2022/2555 (NIS2 Directive)</a>, in force October 2024.</p>
</li>
<li>
<p>U.S. Securities and Exchange Commission, <a href="https://www.sec.gov/rules/final/2023/33-11216.pdf">Cybersecurity Risk Management, Strategy, Governance, and Incident Disclosure</a>, final rule, July 2023.</p>
</li>
</ol>
]]></content:encoded>
    </item>
    <item>
      <title>Shadow Infrastructure</title>
      <link>https://elastocera.com/field-notes/shadow-infrastructure/</link>
      <pubDate>Fri, 27 Mar 2026 18:00:00 -0300</pubDate>
      <guid>https://elastocera.com/field-notes/shadow-infrastructure/</guid>
      <description>Field observation on internal platform resources that operate outside the visible infrastructure model used by operators.</description>
        <enclosure url="https://elastocera.com/images/forest-og.jpg" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<h2 id="observation">Observation:</h2>
<p>Modern platforms often contain internal infrastructure that is not visible in the primary operational model used by administrators.</p>
<p>These resources include internal networks, control-plane communication paths, service networks, operator-managed components, and reconciliation controllers.</p>
<p>They exist to support platform behavior rather than application workloads, and are frequently created automatically during cluster deployment.</p>
<p>Because they are not part of the infrastructure model operators typically reason about, they remain largely invisible until they interact with external resources or cause unexpected conflicts.</p>
<h2 id="implication">Implication:</h2>
<p>When failures involve these internal mechanisms, troubleshooting becomes difficult because the affected infrastructure exists outside the mental model used to design and operate the environment.</p>
<p>Platform architectures increasingly depend on infrastructure that operators do not directly see.</p>
<hr>
<p><em>Part of the Field Notes series documenting operational patterns observed in real-world platform architectures.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>The Abstraction Tax</title>
      <link>https://elastocera.com/field-notes/the-abstraction-tax/</link>
      <pubDate>Tue, 24 Mar 2026 18:00:00 -0300</pubDate>
      <guid>https://elastocera.com/field-notes/the-abstraction-tax/</guid>
      <description>Field observation on the operational cost introduced by platform abstraction layers.</description>
        <enclosure url="https://elastocera.com/images/forest-og.jpg" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<h2 id="observation">Observation:</h2>
<p>Every abstraction layer hides complexity from the user while introducing additional operational mechanics behind the scenes.</p>
<p>Controllers reconcile desired state.<br>
Operators manage lifecycle logic.<br>
Networking overlays create new routing paths.</p>
<p>These mechanisms remain mostly invisible during normal operation.</p>
<p>They become visible only when something fails.</p>
<h2 id="implication">Implication:</h2>
<p>The operational overhead created by abstraction layers can be understood as an abstraction tax: a cost paid by the platform team in exchange for simplified interfaces offered to users.</p>
<p>This dynamic often contributes to <strong>Operational Gravity (<a href="https://elastocera.com/field-notes/operational-gravity/" class="fn-ref" title="Operational Gravity">FN-0014</a>)</strong>, where complexity gradually accumulates around platform teams.</p>
<hr>
<p><em>Part of the Field Notes series documenting operational patterns observed in real-world platform architectures.</em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Cloud-Native, Same Old Fragility</title>
      <link>https://elastocera.com/posts/cloud-native-same-old-fragility/</link>
      <pubDate>Mon, 23 Mar 2026 10:00:00 -0300</pubDate>
      <guid>https://elastocera.com/posts/cloud-native-same-old-fragility/</guid>
      <description>Why modern distributed systems still fail in simple ways, and what we are no longer seeing.</description>
        <enclosure url="https://elastocera.com/images/spider-og.jpg" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<blockquote>
<p>Modern systems are distributed.<br>
But fragility didn’t disappear.<br>
It just became harder to see.</p>
</blockquote>
<p>They run across <span class="tooltip-term" data-tooltip="Cluster: a group of nodes managed by a container orchestrator like Kubernetes. Region: a geographic deployment zone within a cloud provider. Provider: the cloud platform itself (AWS, Azure, GCP). Distribution across these layers increases availability in theory, but multiplies failure surfaces in practice."> clusters, regions, providers </span>.
They are <span class="tooltip-term" data-tooltip="Observable: instrumented with metrics, logs, and traces. Containerized: packaged in isolated runtime units (containers). Orchestrated: managed by platforms like Kubernetes that automate scheduling, scaling, and recovery. These properties are often mistaken for resilience, but they describe operational convenience, not fault tolerance."> observable, containerized, orchestrated </span>.</p>
<p>They look resilient.</p>
<p>And yet, they still fail in surprisingly simple ways.</p>
<p>Not because distribution failed.<br>
But because <strong>our understanding didn’t evolve with it</strong>.</p>
<h2 id="the-illusion-of-resilience">The Illusion of Resilience</h2>
<p>Cloud-native architectures are often assumed to be resilient by default.</p>
<p>They are not.</p>
<p>What we actually built are systems that:</p>
<ul>
<li>scale well</li>
<li>deploy fast</li>
<li>look observable</li>
</ul>
<p>But resilience is something else entirely.<br>
And we rarely design for it.</p>
<blockquote>
<p>A system is not resilient because it is distributed.
It is resilient because it can survive the loss of what it depends on (<a href="https://elastocera.com/field-notes/the-layer-illusion/" class="fn-ref" title="The Layer Illusion">FN-0013</a>).</p>
</blockquote>
<p>And most systems today cannot.</p>
<h2 id="the-happy-path-trap">The Happy Path Trap</h2>
<p>Most systems are designed around success.</p>
<p>Requests succeed.<br>
Dependencies respond.<br>
Flows complete.</p>
<p>Failure exists.<br>
But as an afterthought.</p>
<ul>
<li>generic retries</li>
<li>vague error handling</li>
<li>logs that assume context</li>
</ul>
<blockquote>
<p>If your system only knows how to succeed, failure becomes undefined behavior (<a href="https://elastocera.com/field-notes/the-first-incident-test/" class="fn-ref" title="The First Incident Test">FN-0015</a>).</p>
</blockquote>
<p>This is where fragility begins.</p>
<p>Not in infrastructure.<br>
In assumptions.</p>
<h2 id="the-illusion-of-testing">The Illusion of Testing</h2>
<p>Modern delivery pipelines create confidence.</p>
<p>But often, it is misplaced.</p>
<p>We test components in isolation.<br>
We mock dependencies.<br>
We simulate behavior, not reality.<br>
We validate expected outputs.</p>
<p>And then we assume the system will behave.</p>
<blockquote>
<p>Mocks don’t fail like real systems do (<a href="https://elastocera.com/field-notes/illusion-of-isolation/" class="fn-ref" title="The Illusion of Isolation">FN-0004</a>).</p>
</blockquote>
<p>Integration is where reality lives.<br>
And it is often the least tested part.</p>
<blockquote>
<p>Passing tests prove consistency, not correctness under stress.</p>
</blockquote>
<h2 id="hidden-spofs-in-plain-sight">Hidden SPOFs in Plain Sight</h2>
<p><span class="tooltip-term" data-tooltip="SPOF (Single Point of Failure): any component whose failure causes the entire system or a critical path to become unavailable. In cloud native architectures, SPOFs are often hidden behind layers of abstraction: shared DNS resolvers, centralized identity providers, or a single observability pipeline."> Single points of failure </span> did not disappear (<a href="https://elastocera.com/field-notes/hidden-spofs-platform-layers/" class="fn-ref" title="Hidden SPOFs in Platform Layers">FN-0002</a>).</p>
<p>They became harder to see.</p>
<h3 id="dns">DNS</h3>
<p>The most fundamental layer of the internet.</p>
<p>Still misconfigured.<br>
Still under-tested.<br>
Still capable of bringing entire systems down.</p>
<blockquote>
<p>The most critical systems are often the least questioned.</p>
</blockquote>
<h3 id="observability">Observability</h3>
<p>Dashboards are everywhere.</p>
<p>But visibility is not understanding.</p>
<p>When the observability stack fails (or lacks context), diagnosis becomes guesswork.</p>
<blockquote>
<p>A system is observable until it fails outside the path it was designed to show.</p>
</blockquote>
<h3 id="external-dependencies">External Dependencies</h3>
<p>Modern systems rely on external services:</p>
<ul>
<li><span class="tooltip-term" data-tooltip="Identity Provider (IdP): a service that authenticates users and issues tokens or credentials used by applications to authorize access. Examples include Active Directory, Okta, Keycloak, and cloud native IAM services. A failure in the IdP can lock users and services out of every system that depends on it."> identity providers </span></li>
<li><span class="tooltip-term" data-tooltip="CI/CD (Continuous Integration / Continuous Delivery): automated pipelines that build, test, and deploy software. When the CI/CD platform itself fails, teams lose the ability to ship fixes, including fixes for the incident that caused the CI/CD failure."> CI/CD platforms </span></li>
<li>third-party APIs</li>
</ul>
<p>Failures in these integrations are not just technical.</p>
<p>They are organizational.</p>
<blockquote>
<p>Failures in integrated systems don’t just break flows, they break ownership.</p>
</blockquote>
<p>No one knows who should fix the problem.<br>
So no one does it fast enough.</p>
<h2 id="cognitive-fragility">Cognitive Fragility</h2>
<p>As systems evolved, so did abstraction.</p>
<p>Platforms simplified complexity.<br>
Interfaces reduced <span class="tooltip-term" data-tooltip="Cognitive load: the mental effort required to understand and operate a system. In software engineering, high cognitive load means engineers must hold too many details in working memory to reason about system behavior. Abstractions reduce cognitive load by hiding complexity, but they also hide failure modes."> cognitive load </span>.</p>
<p>This is necessary.</p>
<p>But it also distances decision-making from reality.</p>
<p>And it comes with a cost.</p>
<blockquote>
<p>Abstractions reduce cognitive load, but they also hide the system (<a href="https://elastocera.com/field-notes/abstractions-simplify-usage-not-operation/" class="fn-ref" title="Abstractions Simplify Usage, Not Operation">FN-0006</a>).</p>
</blockquote>
<p>Over time, this creates <strong>cognitive blind spots</strong> (<a href="https://elastocera.com/field-notes/the-abstraction-tax/" class="fn-ref" title="The Abstraction Tax">FN-0010</a>):</p>
<ul>
<li>dependencies no one maps</li>
<li>behaviors no one understands</li>
<li>failure modes no one anticipates</li>
</ul>
<blockquote>
<p>You cannot reason about what you cannot see.</p>
</blockquote>
<p>And when the system fails:</p>
<blockquote>
<p>The system breaks, and the organization struggles to respond.</p>
</blockquote>
<h2 id="not-another-disaster-recovery-problem">Not Another Disaster Recovery Problem</h2>
<p>This is not primarily a recovery problem.</p>
<p>It is an understanding problem.<br>
And understanding does not scale by default.</p>
<p><span class="tooltip-term" data-tooltip="Disaster Recovery (D.R.): the set of policies, tools, and procedures designed to recover technology infrastructure after a disruptive event. D.R. strategies often assume that failures are well understood and isolated, an assumption that breaks down in distributed systems where causality is diffuse and dependencies are poorly mapped."> Disaster recovery strategies </span> often assume we know what failed.</p>
<p>In reality, we often don’t.</p>
<blockquote>
<p>You can’t recover from failures you don’t understand.</p>
</blockquote>
<p><em>For a deeper look into recovery strategies, see our previous notes on <a href="/posts/openshift-dr-strategies-fail-executive-level">disaster recovery</a>.</em></p>
<h2 id="closing">Closing</h2>
<p>We built distributed systems.<br>
But not distributed understanding.</p>
<p>And so, fragility remains.</p>
<p>Not where we used to look.<br>
But exactly where we stopped looking.</p>
<h2 id="fragility-map">Fragility Map</h2>

<div id="graph-elastocera-map" style="width:100%; height:600px; min-height:500px;"></div>

<script>
document.addEventListener("DOMContentLoaded", function () {

  const raw = `\n[\n  \u007b \u0022data\u0022: \u007b \u0022id\u0022: \u0022wild\u0022, \u0022label\u0022: \u0022Systems in the Wild\u0022, \u0022type\u0022: \u0022concept\u0022, \u0022url\u0022: \u0022\/series\/systems-in-the-wild\u0022, \u0022clickable\u0022: true \u007d\u007d,\n  \u007b \u0022data\u0022: \u007b \u0022id\u0022: \u0022dc\u0022, \u0022label\u0022: \u0022Distributed Cognition\u0022, \u0022type\u0022: \u0022concept\u0022 \u007d\u007d,\n  \u007b \u0022data\u0022: \u007b \u0022id\u0022: \u0022patterns\u0022, \u0022label\u0022: \u0022Patterns\u0022, \u0022type\u0022: \u0022pattern\u0022 \u007d\u007d,\n  \u007b \u0022data\u0022: \u007b \u0022id\u0022: \u0022notes\u0022, \u0022label\u0022: \u0022Architecture Notes\u0022, \u0022type\u0022: \u0022note\u0022 \u007d\u007d,\n\n  \u007b \u0022data\u0022: \u007b \u0022source\u0022: \u0022wild\u0022, \u0022target\u0022: \u0022dc\u0022 \u007d\u007d,\n  \u007b \u0022data\u0022: \u007b \u0022source\u0022: \u0022dc\u0022, \u0022target\u0022: \u0022patterns\u0022 \u007d\u007d,\n  \u007b \u0022data\u0022: \u007b \u0022source\u0022: \u0022patterns\u0022, \u0022target\u0022: \u0022notes\u0022 \u007d\u007d\n]`;

  let elements;
  try {
    elements = JSON.parse(raw);
  } catch (e) {
    console.error("Erro ao parsear graph:", e);
    return;
  }

  const cy = cytoscape({
    container: document.getElementById("graph-elastocera-map"),

    elements: elements,

    userZoomingEnabled: false,

    style: [
    {
        selector: 'node',
        style: {
        'label': 'data(label)',
        'color': '#fff',
        'text-valign': 'center',
        'text-halign': 'center',
        'font-size': '14px',
        'text-outline-width': 2,
        'text-outline-color': '#111',
        'background-color': '#666',
        }
    },
    {
        selector: 'node[type="reality"]',
        style: { 'background-color': '#1f77b4' }
    },
    {
        selector: 'node[type="cognition"]',
        style: { 'background-color': '#17becf' }
    },
    {
        selector: 'node[type="pattern"]',
        style: { 'background-color': '#2ca02c' }
    },
    {
        selector: 'node[type="note"]',
        style: { 'background-color': '#d62728' }
    },
    {
        selector: 'node[type="concept"]',
        style: { 'background-color': '#9467bd' }
    },
    {
        selector: 'edge',
        style: {
        'line-color': '#888',
        'width': 2,
        'curve-style': 'bezier'
        }
    },
    {
      selector: 'node.hovered',
      style: {
        'border-width': 4,
        'border-color': '#ffffff'
      }
    },
    {
      selector: 'node[?url]',
      style: {
        'border-width': 2,
        'border-color': '#ffffff',
        'border-opacity': 0.4
      }
    },
    {
      selector: 'node.hovered-clickable',
      style: {
        'border-width': 4,
        'border-color': '#ffffff',

        'shadow-blur': 30,
        'shadow-color': '#ffffff',
        'shadow-opacity': 1,

        'cursor': 'pointer'
      }
    }    
    ],

    layout: {
    name: 'cose',
    animate: false,
    padding: 20,
    fit: true
    }
  });

  function startPulse(node) {
    if (!node.data('url')) return;

    node.animate({
      style: { 'border-opacity': 1 }
      
    }, {
      duration: 800
    }).animate({
      style: { 'border-opacity': 0.5 }
    }, {
      duration: 800,
      complete: function() { startPulse(node); }
    });
  }

  cy.on('tap', 'node', function(evt) {
    const url = evt.target.data('url');
    if (url && url.startsWith('/')) {
      window.location.href = url;
    }
  });

  cy.on('mouseover', 'node', function(evt) {
    const node = evt.target;
    node.stop();
    node.addClass('hovered');

    if (node.data('url')) {
      node.addClass('hovered-clickable');
      cy.container().style.cursor = 'pointer';
    }

    node.animate({
      position: { x: node.position('x'), y: node.position('y') - 6 }
    }, { duration: 120 });
  });

  cy.on('mouseout', 'node', function(evt) {
    const node = evt.target;
    node.stop();
    node.removeClass('hovered');
    node.removeClass('hovered-clickable');
    cy.container().style.cursor = 'default';

    node.animate({
      position: { x: node.position('x'), y: node.position('y') + 6 }
    }, { 
      duration: 120,
      complete: function() {
        startPulse(node);
      }
    });
  });

  cy.ready(function() {
    cy.nodes('[url]').forEach(n => {
      startPulse(n);
    });
  });
});
</script>

]]></content:encoded>
    </item>
    <item>
      <title>Abstractions Simplify Usage, Not Operation</title>
      <link>https://elastocera.com/field-notes/abstractions-simplify-usage-not-operation/</link>
      <pubDate>Thu, 12 Mar 2026 18:00:00 -0300</pubDate>
      <guid>https://elastocera.com/field-notes/abstractions-simplify-usage-not-operation/</guid>
      <description>Field observation on how platform abstractions reduce user complexity while increasing operational depth.</description>
        <enclosure url="https://elastocera.com/images/forest-og.jpg" length="0" type="image/jpeg"/>
      <content:encoded><![CDATA[<h2 id="observation">Observation:</h2>
<p>Platform abstractions reduce cognitive load for users.</p>
<p>A developer deploying an application rarely needs to understand how scheduling, networking, storage provisioning, or cluster lifecycle actually work.</p>
<p>The interface becomes simple: deploy, expose, scale.</p>
<p>However, the operational side of the platform moves in the opposite direction.</p>
<p>Each abstraction layer introduces additional controllers, reconciliation loops, networking paths, and state dependencies that must be understood when something fails.</p>
<h2 id="implication">Implication:</h2>
<p>Abstractions successfully simplify usage, but they rarely simplify operation.</p>
<p>Instead, operational complexity becomes concentrated in the platform team responsible for maintaining the abstraction itself (<a href="https://elastocera.com/field-notes/the-abstraction-tax/" class="fn-ref" title="The Abstraction Tax">FN-0010</a>).</p>
<hr>
<p><em>Part of the Field Notes series documenting operational patterns observed in real-world platform architectures.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
