<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Site Reliability on Project Wintermute</title><link>https://wintermutecore.com/tags/site-reliability/</link><description>Recent content in Site Reliability on Project Wintermute</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 09 Feb 2026 12:00:00 +0200</lastBuildDate><atom:link href="https://wintermutecore.com/tags/site-reliability/index.xml" rel="self" type="application/rss+xml"/><item><title>Anatomy of a 6-hour Kubernetes ingress outage</title><link>https://wintermutecore.com/posts/kubernetes-ingress-outage-postmortem/</link><pubDate>Mon, 09 Feb 2026 12:00:00 +0200</pubDate><guid>https://wintermutecore.com/posts/kubernetes-ingress-outage-postmortem/</guid><description>&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; A backend deployment lost all healthy pods. nginx active health checks marked the upstream pool empty. That pool was the &lt;code&gt;default_server&lt;/code&gt; for ports 80 and 443, so every unmatched hostname returned 502 for 6 hours and 38 minutes. The trigger was Kubernetes-side. The blast radius was a configuration choice we made years ago. The post-incident fixes were almost all on the nginx side.&lt;/p&gt;
&lt;p&gt;We had a P0 outage on a public ingress tier. Two redundant nginx instances, both showing the same symptoms, both serving production traffic to dozens of hostnames. This is the writeup, sanitised and reduced to the parts that generalise.&lt;/p&gt;</description></item></channel></rss>