Engineering posts on Project Wintermute

Migrating Go test helpers from Azure SDK v1 to v2 (track 2)

Thu, 09 Apr 2026 09:00:00 +0200

TL;DR. Track 2 of azure-sdk-for-go is not a search-and-replace from track 1. Plan for a credential rework, a client factory pass, and a nil-safety sweep on every pointer chain. Lock the patterns in CI lints so the next batch ships cleaner than the previous one.

Microsoft azure-sdk-for-go track 1 packages (rooted at github.com/Azure/azure-sdk-for-go/services/...) have been on the deprecation path for a while. Track 2 (the armXxx packages under github.com/Azure/azure-sdk-for-go/sdk/...) is the supported surface and the only one with active fixes.

Tag-based AWS resource cleanup: patterns that actually scale

Fri, 03 Apr 2026 11:00:00 +0200

TL;DR. Name and time filters are not enough for safe AWS bulk cleanup. Use tags as the primary signal, expect ListTagsForResource to be your bottleneck, enforce tagging at provisioning time, and run an audit job that flags untagged resources so the policy stays honest.

The “delete a lot of AWS resources at once” problem shows up in every account: CI sandboxes, expired test estates, dev environments forgotten about, ad-hoc reproductions left behind. Bulk cleanup tools that target this exist and work well. Used carelessly any of them is a footgun. Used carefully with tag filtering, they become one of the most useful pieces of cost discipline you can ship.

Building a daily data pipeline with Dagu, Python, and a JSONL data lake

Wed, 18 Mar 2026 08:00:00 +0200

TL;DR. Three stages (ingest, index, alert), one stage per script, JSONL as the index format, a flat dedup state file, Dagu for scheduling. Boring, reliable, and dramatically less work than reaching for a heavyweight orchestrator.

There is a class of pipeline that does not deserve a Spark cluster, an Airflow deployment, or a multi-tenant orchestrator. It is the “fetch a few hundred records from an API every day, index them, alert on the interesting ones” job. We have built this kind of thing dozens of times. Here is the shape that has held up.

k3s on Hetzner: notes from running production clusters

Thu, 05 Mar 2026 14:00:00 +0200

TL;DR. k3s on Hetzner is a strong cost-control move when you are willing to operate the cluster. Mind the Flannel MTU on Hetzner private networks, separate stateless and stateful workloads at the storage layer, keep observability minimal but real, and treat backups as a tested practice rather than a config setting.

A managed Kubernetes service is the right answer for most teams. When it is not the right answer (cost, control, locality of data), self-hosted k3s on a low-cost provider like Hetzner is one of the better options. We have run several clusters of this shape in production for over a year. This post is the set of decisions that have held up.

Speeding up GitHub Actions lint pipelines for large Go codebases

Thu, 12 Feb 2026 10:00:00 +0200

TL;DR. Lint on a large Go monorepo went from 63 seconds to about 25 seconds on warm cache, with macOS skipped on branches. Five changes: concurrency group, conditional OS matrix, combined cache restore and save, explicit go mod download, and incremental golangci-lint --new-from-rev. None require a self-hosted runner.

A large Go codebase makes the CI lint stage the part developers feel most: every push, on every branch. Lint feedback that takes a minute and a half kills iteration speed and quietly trains people to push less often, which is the opposite of what you want.

Anatomy of a 6-hour Kubernetes ingress outage

Mon, 09 Feb 2026 12:00:00 +0200

TL;DR. A backend deployment lost all healthy pods. nginx active health checks marked the upstream pool empty. That pool was the default_server for ports 80 and 443, so every unmatched hostname returned 502 for 6 hours and 38 minutes. The trigger was Kubernetes-side. The blast radius was a configuration choice we made years ago. The post-incident fixes were almost all on the nginx side.

We had a P0 outage on a public ingress tier. Two redundant nginx instances, both showing the same symptoms, both serving production traffic to dozens of hostnames. This is the writeup, sanitised and reduced to the parts that generalise.