Tag-based AWS resource cleanup: patterns that actually scale


TL;DR. Name and time filters are not enough for safe AWS bulk cleanup. Use tags as the primary signal, expect ListTagsForResource to be your bottleneck, enforce tagging at provisioning time, and run an audit job that flags untagged resources so the policy stays honest.

The “delete a lot of AWS resources at once” problem shows up in every account: CI sandboxes, expired test estates, dev environments forgotten about, ad-hoc reproductions left behind. Bulk cleanup tools that target this exist and work well. Used carelessly any of them is a footgun. Used carefully with tag filtering, they become one of the most useful pieces of cost discipline you can ship.

This post covers the tag-filter angle: why it matters, the operational gotchas at scale, and a default setup you can run unattended.

Two filters dominate cleanup tools by default. Both have failure modes.

  1. Name filters. Assume you actually named the resource. Plenty of AWS resources get default names from CloudFormation, CDK, or Terraform that are inconsistent across providers and uninteresting to humans.
  2. Time filters. Assume your “things to delete” partition cleanly by age. They do not. Long-lived debug instances, scheduled cluster snapshots, retained log groups all live across the time threshold.

Tag filters give you the only reliable signal: did the system that created this resource intend for cleanup? If Environment=ephemeral is present, delete. If not, leave it alone. Tags compose well with other filters: tag plus age, tag plus name, exclude-tag plus everything else.

This pattern only works if the tag is applied at create time. We come back to this below.

Tag retrieval cost varies wildly across services. Three rough categories:

  • In-line on the describe response. EventBridge buses, some DynamoDB calls. Zero extra API calls.
  • Separate ListTagsForResource call. CloudWatch alarms, log groups, EventBridge rules and schedules, and dozens of others. One extra API call per resource. This is where the cost lives.
  • Resource Groups Tagging API. Some legacy resources only expose tags through that API. Avoid where possible because it lags behind primary APIs by tens of seconds in our experience.

The ListTagsForResource route is the operational pain. With 50,000 alarms, you make 50,000 extra API calls just to filter, before deleting anything. Filter runs are not free.

ListTagsForResource is rate-limited separately from describe, often more aggressively. You can describe a thousand resources happily, then hit throttles when you fan out into tag fetches. The AWS Go SDK adaptive retry mode helps, but on tight CI time budgets you want to:

  • Reduce worker count for tag-heavy services.
  • Use a tag-broad include rule to cut the candidate list before per-resource fetch (where the resource API supports it).
  • Pre-filter via the Resource Groups Tagging API for very large estates, then drive the cleanup tool with the resulting ARN list.

A tag added through the AWS Console can take a few seconds to appear via the API. Rarely a problem in production, but if you write a “tag, then immediately nuke” script you will hit it. Add a short delay, or query the same API you will filter on before kicking off cleanup.

CloudWatch Alarms returns metric and composite alarms separately. The first implementation usually duplicates the tag-fetch loop across both code paths. Extract the loop into a helper at the first opportunity. Two near-identical loops are how subtle behaviour drift creeps in when a third resource type ships with a copy-paste of the second one.

Tag converter helpers (tagsToMap and friends) need to tolerate nil. AWS SDK calls return nil slices for resources without tags. Filter logic should treat that as “empty tag set”, not “skip resource”. Otherwise an untagged resource silently slips past your tag-based exclude rule.

Baseline config for ephemeral CI accounts:

EBS:
  include:
    tags:
      - key: Environment
        value: ephemeral
ELBv2:
  include:
    tags:
      - key: ManagedBy
        value: integration-tests
CloudWatchAlarm:
  include:
    tags:
      - key: ci_run_id
        value: "^run-"
DynamoDB:
  exclude:
    tags:
      - key: Retention
        value: keep

Three rules:

  1. Only delete what is tagged Environment=ephemeral.
  2. Only delete CI-created alarms by run-id pattern.
  3. Never touch DynamoDB tables marked Retention=keep.

Combined with --older-than 12h, this is a safe daily sweep for any account where the convention is enforced at provisioning time.

Run results from one production CI account, daily for the last 60 days:

  • About 4,200 resources deleted per day on average.
  • Zero false-positive deletions of resources outside the CI run scope.
  • P99 run duration: 14 minutes. Most of that is ListTagsForResource against alarms.

Even with broad coverage in the cleanup tool, gaps remain. Some services have weak tagging APIs, some have effectively none. The realistic answer is a three-step setup:

  1. Tag everything at create time, with a tagging policy enforced via SCPs. No tag, no resource. Pre-validation in CI that any new Terraform module adds the required tags.
  2. Run the cleanup tool with tag include rules as the primary filter. Time and name filters become secondary.
  3. Run a separate audit job that flags resources without the required tags. This is the step most teams skip. Without it, untagged resources hide forever and the cleanup policy quietly stops working.

The audit job is dull, takes a day to write, and is the difference between “we have a tagging policy” and “we have a tagging policy that is enforced”. Run it weekly, post the count to a channel where someone notices, and react when the trend goes the wrong way.

  • More IAM scope. Tag-based filtering needs Read on the tags of every resource type the tool can touch. This expands the read scope of the cleanup IAM role. Acceptable in our experience because the role already has the destructive Delete permissions; the extra Read does not change the blast radius.
  • Slower runs. A name-only filter takes seconds. A tag-based filter on alarms can take ten minutes. We accept this because the alternative is occasionally deleting the wrong thing, which is far more expensive.
  • Tag drift. Tags get edited, lost on resource recreation, missed by manual operations. The audit step is what keeps this from rotting.

If you operate AWS at any meaningful scale, tag-based cleanup is the difference between a destructive tool you are afraid of and a daily housekeeping job you trust. Start with the tagging policy. Enforce it in IaC. Run the cleanup tool with tag includes. Run the audit. Repeat.

Most of the engineering effort in this space recently has been on extending tag-filter coverage to more AWS services so the rule shape stays consistent regardless of which service you target. That consistency is itself the feature.