Building a daily data pipeline with Dagu, Python, and a JSONL data lake

Wed, 18 Mar 2026 08:00:00 +0200

TL;DR. Three stages (ingest, index, alert), one stage per script, JSONL as the index format, a flat dedup state file, Dagu for scheduling. Boring, reliable, and dramatically less work than reaching for a heavyweight orchestrator.

There is a class of pipeline that does not deserve a Spark cluster, an Airflow deployment, or a multi-tenant orchestrator. It is the “fetch a few hundred records from an API every day, index them, alert on the interesting ones” job. We have built this kind of thing dozens of times. Here is the shape that has held up.

Slack on Project Wintermute

Building a daily data pipeline with Dagu, Python, and a JSONL data lake