Tardigrade under electron microscope

The DR Number Almost No One Records

Disaster recovery has three numbers. Almost no organization records all three. The first is the number written into the plan. The second is the number measured during exercises, if exercises happen. The third is the number observed during real incidents. The distance between them is the only metric that matters. It is also the metric that almost no one calculates. The Three States of D.R. Capability Disaster recovery capability exists in three forms simultaneously, and the three forms produce three different numbers. ...

May 22, 2026 · 9 min · 1804 words · Andre Rocha
SPOFs in Modern Cloud-Native Architectures

The SPOFs You Did Not Design

Single points of failure are one of the oldest concepts in systems engineering. They are also one of the most misunderstood in modern architectures. Cloud-native platforms were designed to eliminate them. Redundancy, replication, distribution across zones and regions. The assumption is that if no single component is irreplaceable, the system has no SPOF. That assumption is structurally incomplete. What changed is not the presence of single points of failure. What changed is where they live, how they manifest, and why they remain invisible until an incident exposes them. ...

May 4, 2026 · 9 min · 1870 words · Andre Rocha
Cost Optimization vs Risk Concentration

Cost Optimization vs Risk Concentration in Hosted Control Planes

Hosted control planes are presented as a cost optimization strategy. They are also a risk consolidation strategy. The industry treats these as separate conversations. One belongs to FinOps reports. The other belongs to architecture reviews. ...

May 1, 2026 · 7 min · 1484 words · Andre Rocha
Hidden Reliability Risks

The Hidden Reliability Risks in Multi-Cluster Kubernetes

Multi-cluster Kubernetes is often introduced as a solution to failure. In practice, it does something more subtle. It changes the shape of failure. Failures do not disappear. They stop being local, predictable, and contained. They become distributed, indirect, and delayed. The most dangerous part is not the failure itself. These failure modes share a pattern: they rarely appear in architecture diagrams, do not violate best practices, and only become visible under specific lifecycle events. ...

April 6, 2026 · 6 min · 1170 words · Andre Rocha