Resilience

Macro photograph of a tardigrade on wet moss

The DR Number Almost No One Records

Disaster recovery has three numbers. Almost no organization records all three. The first is the number written into the plan. The second is the number measured during exercises, if exercises happen. The third is the number observed during real incidents. The distance between them is the only metric that matters. It is also the metric that almost no one calculates. The Three States of D.R. Capability Disaster recovery capability exists in three forms simultaneously, and the three forms produce three different numbers. ...

Coral colony of independent polyps sharing one underlying skeleton

The SPOFs You Did Not Design

Single points of failure are one of the oldest concepts in systems engineering. They are also one of the most misunderstood in modern architectures. Cloud-native platforms were designed to eliminate them. Redundancy, replication, distribution across zones and regions. The assumption is that if no single component is irreplaceable, the system has no SPOF. That assumption is structurally incomplete. What changed is not the presence of single points of failure. What changed is where they live, how they manifest, and why they remain invisible until an incident exposes them. ...

Cloud-Native, Same Old Fragility

Modern systems are distributed. But fragility didn’t disappear. It just became harder to see. They run across clusters, regions, providers . They are observable, containerized, orchestrated . ...

Hidden SPOFs in Platform Layers

Observation: Resilience engineering focuses on application workloads. The platform layers those workloads depend on, like identity providers, container registries, DNS resolvers and certificate authorities, are often treated as stable infrastructure rather than independent failure domains (FN-0004). Implication: Workload resilience is bounded by the resilience of the platform beneath it. A highly available application running on a shared, unexamined registry is only as resilient as that registry. Part of the Field Notes series documenting operational patterns observed in real-world platform architectures.

FN-0002

Why Most OpenShift DR Strategies Fail at Executive Level

Most enterprise OpenShift disaster recovery strategies are designed to satisfy audits, not to survive real incidents. They describe recovery procedures, declare RPO and RTO targets, and satisfy audit checklists. What they rarely do is demonstrate recovery capability under realistic conditions. This distinction matters more than it appears. Having a D.R. plan and having D.R. capability are fundamentally different things. The first is a document. The second is a measurable organizational competence that requires investment, testing, and continuous validation. ...