Anatomy of a 6-hour Kubernetes ingress outage

TL;DR. A backend deployment lost all healthy pods. nginx active health checks marked the upstream pool empty. That pool was the default_server for ports 80 and 443, so every unmatched hostname returned 502 for 6 hours and 38 minutes. The trigger was Kubernetes-side. The blast radius was a configuration choice we made years ago. The post-incident fixes were almost all on the nginx side.

We had a P0 outage on a public ingress tier. Two redundant nginx instances, both showing the same symptoms, both serving production traffic to dozens of hostnames. This is the writeup, sanitised and reduced to the parts that generalise.

Symptom ¶

Connection refused or 502 Bad Gateway on every request, on every hostname routed through the affected ingress pair. nginx error logs grew at multiple gigabytes per hour, dominated by two patterns:

no live upstreams while connecting to upstream
upstream timed out (110: Connection timed out) while connecting to upstream

Roughly 4.2 million no live upstreams errors and 593 thousand timeouts on instance 1, with near-identical numbers on instance 2. Identical patterns on two independent ingress pods is a strong signal that the cause is not on the ingress side.

Architecture ¶

Internet
  -> Cloud LB
    -> nginx ingress pods (2x, identical config)
      -> Kubernetes nodes via NodePorts
         :31820 -> default-app  (default_server for :80 and :443)
         :32443 -> https-apps
         :30100 -> static-apps
         :31808 -> internal-services
         :8080  -> dns-resolved-app   (resolved via cluster DNS)

The critical detail: the upstream behind NodePort :31820 was the default_server for both port 80 and port 443. Anything that did not match a more specific server block landed on it.

Root cause ¶

The :31820 backend pods went unresponsive at 06:44 UTC. nginx had active health checks configured against them:

check interval=5000 rise=2 fall=5 timeout=3000 type=http;
check_http_send "GET / HTTP/1.0\r\nHost: example.com\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;

After 5 consecutive failed checks (fall=5, 25 second window), every server in the upstream pool was marked DOWN. The pool was now empty.

Because that pool was the default_server for both ports, every request that did not match a more specific server block hit an empty upstream and returned 502. That cascaded:

Pool empty: all unmatched hostnames returned 502.
Health check failures kept logging at full request rate, blowing past 1.5 GB of error logs per instance.
Other upstreams started timing out as connection budgets and worker time were saturated by the failing default.
A second upstream was configured with cluster DNS (server some-app.namespace.svc.cluster.local:8080 resolve;). When CoreDNS came under pressure, that resolution also failed, taking another batch of hostnames down.

The trigger was on the Kubernetes side: the deployment behind :31820 lost all healthy pods. The blast radius was on the nginx side: a single backend pool was load-bearing for everything that did not have a specific match, and we had no breaker between “default backend down” and “every hostname returns 502”.

What we changed afterwards ¶

The interesting part. In rough priority order:

1. Tune timeouts so failure does not cascade ¶

Connect timeouts of 2 seconds were too aggressive against a backend already under stress. Each retry burned a worker slot, and pool exhaustion happened faster than the upstream had a chance to recover.

proxy_connect_timeout 5s;   # was 2s
proxy_send_timeout    15s;  # was 8s
proxy_read_timeout    15s;  # was 8s
send_timeout          5s;   # was 2s

Applied across 14 location blocks on both instances.

2. Tune `max_fails` and `fail_timeout` ¶

Original: max_fails=5 fail_timeout=10s. This means 5 consecutive failures mark a server down for 10 seconds, then it is retried. Under sustained backend trouble, this produces constant flapping: down, 10 seconds, retry, fail, down, 10 seconds.

Updated to max_fails=3 fail_timeout=30s. Fail faster, recover slower. The log spam during the next degraded period dropped by an order of magnitude.

3. Replace cluster DNS upstreams with explicit NodePort lists ¶

The DNS-resolved upstream meant ingress availability had a hard dependency on CoreDNS. We replaced it with the same explicit NodePort pattern used for everything else:

upstream some_app_upstream {
    least_conn;
    server 10.0.0.11:32100 max_fails=3 fail_timeout=30s;
    server 10.0.0.12:32100 max_fails=3 fail_timeout=30s;
    server 10.0.0.13:32100 max_fails=3 fail_timeout=30s;
    server 10.0.0.14:32100 max_fails=3 fail_timeout=30s;
    server 10.0.0.15:32100 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

Less elegant than DNS-based discovery, but availability is more important than elegance, and the node list churns rarely.

4. Cap upstream keepalive ¶

keepalive 10240 per worker on the default pool was historical and never benchmarked. With multiple workers it represented an enormous file descriptor budget. Reduced to 256, no measurable latency impact, much smaller memory footprint and FD pressure under stress.

5. Alert on “no live upstreams” ¶

Add a log-based alert that pages on a non-zero rate of no live upstreams in nginx error logs. This is the canary that should never appear at sustained rate, and when it does the response time should be minutes, not hours. We also added a synthetic check on a few high-value hostnames so we are not relying entirely on log signals.

6. Fix a latent SSL configuration bug ¶

While reading the configs we found one server block listening on 443 ssl with no ssl_certificate directives at all, surviving silently because the hostname was not heavily used. The reload-time check did not catch it because of the way the include order worked. Fixed and added a CI lint that runs nginx -t against the rendered config.

7. Fix the actually-broken backend ¶

Behind all this, the trigger was a node-level issue on the cluster: a CrashLoopBackOff in the deployment behind :31820 plus a node that had been quietly dead for 19 days reducing capacity below the threshold needed to absorb the failures. The deployment was fixed, the dead node restored, and we added a job that flags any node that has been NotReady for more than an hour.

Lessons that generalise ¶

Catchall traffic should not depend on a single upstream pool. Either run two parallel default backends with weighting, or have nginx return a static “we are sorry” page when the default pool is empty rather than failing every request.
Health-check tuning is half the system. Fast detection is good, slow recovery is bad. The two parameters fight each other and the defaults are rarely right.
DNS-based service discovery on the data path is a coupling. Use it intentionally, not by default. If it is on the path, treat the resolver as a tier-1 service.
Error logs over a gigabyte per hour are not a log volume problem. They are a “you are not alerting on the actual symptom” problem.

Trade-offs in the fixes ¶

Slower failure detection. max_fails=3 plus longer timeouts means a genuinely dead backend takes a few extra seconds to be marked down. Acceptable in exchange for not flapping.
Static node lists. Replacing DNS with explicit IPs adds a config update step when the cluster is rescaled. We accept this because the rescaling cadence is low and the availability gain is high. Automate the regeneration if you cannot.
Lower keepalive. Slightly more connection setup churn at high RPS. Negligible at our traffic levels.

Bottom line ¶

The cluster-side trigger was rare. The blast radius was a configuration choice we made years earlier and forgot we had made. The interesting fix was always the second one: making the ingress survive a backend failure without taking everything else with it. Treat catchalls as load-bearing infrastructure, alert on the canary, and rehearse the failure path before it happens for real.