Anatomy of a 6-hour Kubernetes ingress outage

Mon, 09 Feb 2026 12:00:00 +0200

TL;DR. A backend deployment lost all healthy pods. nginx active health checks marked the upstream pool empty. That pool was the default_server for ports 80 and 443, so every unmatched hostname returned 502 for 6 hours and 38 minutes. The trigger was Kubernetes-side. The blast radius was a configuration choice we made years ago. The post-incident fixes were almost all on the nginx side.

We had a P0 outage on a public ingress tier. Two redundant nginx instances, both showing the same symptoms, both serving production traffic to dozens of hostnames. This is the writeup, sanitised and reduced to the parts that generalise.

Site Reliability on Project Wintermute

Anatomy of a 6-hour Kubernetes ingress outage