← All posts

Everyone said go serverless. We kept the Kubernetes cluster that mostly sits idle

Authagonal·July 5, 2026
authinfrastructurekubernetesaksserverlessazurecost

The advice is everywhere and it sounds obviously right. You run authentication for tenants whose traffic is spiky. You pay for a Kubernetes cluster around the clock. Azure Container Apps will scale you to zero and bill you only for what you use. Stop paying for idle.

We took it seriously. We costed the migration to Azure Container Apps, looked hard at what it would buy us, and stayed on AKS. Here is the reasoning, because it is not the reasoning you would guess from the cluster's utilization graph.

The advice is right, for a workload we don't have

Serverless containers are an excellent fit for stateless, bursty request handling: each request is independent, instances are cattle, and zero traffic should mean zero bill. That describes a lot of web backends.

It does not describe an auth backplane. Ours isn't idle in the way a stateless app is idle. It holds cluster leadership and runs the coordination layer that decides which replica performs singleton work like the retention sweep and the webhook pass. (We elect that leader with a blob lease now, not gossip, which is its own story.) The cluster being quiet doesn't mean it has nothing to do; it means it's sitting ready to validate the next token and to keep being the leader. Quiet and idle are not the same state.

We killed the idle bill without moving

Here's the part that decided it: the cost we were trying to escape had a much cheaper fix than re-platforming.

The thing draining money was nodes running 24/7 while traffic was concentrated in working hours. So we made the dev cluster deallocate when nobody's using it (off overnight and on weekends, back up on demand) and switched the node pool to a cheaper SKU. That took out most of the idle spend serverless was promising to remove, and it cost us a config change rather than a migration. When the savings you want are achievable in place, re-platforming to chase the same number is a lot of risk for a delta you already captured.

The lesson generalizes: price the fix before you price the re-platform. "Stop paying for idle" is a goal, not an architecture, and the cheapest way to hit it is often a scheduler and a SKU, not a new runtime.

Scale-to-zero fights a backplane that has to hold a lease

Even setting cost aside, scale-to-zero is actively wrong for this workload in two ways.

First, cold starts land on the worst possible path. The request that pays the cold-start tax is somebody trying to log in, and "your sign-in was slow because our auth service was asleep" is not a sentence you want to ship.

Second, leadership and scale-to-zero are in direct conflict. You cannot hold a lease from an instance that has been scaled away. A backplane whose whole job is to always have exactly one live leader does not want a runtime whose whole job is to remove instances when they look quiet. We'd be fighting the platform's core behavior to preserve ours.

The runtime would have taken away three things we use

Migrating isn't free even where it works; you inherit the managed runtime's sandbox, and ours blocks things we depend on:

  • Twingate. We reach private resources over a Twingate connector. The managed container sandbox won't run it, so we'd be re-inventing private network access we already have.
  • Our build path. Parts of our pipeline lean on Docker-in-Docker and az acr build against a private registry, both of which the sandbox disallows. That's a build system to rebuild, not just a deploy target to change.
  • Cross-cloud control. We run an AWS standby that validates tokens the Azure primary issued, via federated JWKS, with Cloudflare as the failover arbiter. That needs networking and placement control a managed runtime deliberately hides.

Each of these is a workaround we'd have to write just to get back to where we already stand. The migration's true cost isn't the cutover; it's re-earning every capability the sandbox takes away.

Cost-saving has a tail, and we paid it

Staying isn't free of consequences either, and it's only honest to say so. Deallocating nodes at night gave us a failure we'd never have seen on an always-on cluster: a first-image-pull token race on cold nodes (the AKS issue 4052 one) that occasionally stalled a pod coming up after the cluster powered back on. We fixed it with a longer rollout timeout and, in prod, by not deallocating the way dev does.

That's the real shape of the trade. Every path has a tail of failure modes; the question is whether you can see and fix them. On AKS the cold-node race was inspectable and fixable. The classes of weird you inherit from a managed runtime's internals are the kind you file a support ticket about and wait.

The rule we took away

Two things, reusable beyond us. Don't migrate to escape a cost you can kill where you stand; cost the in-place fix first, because a scheduler and a cheaper SKU beat a re-platform most of the time. And match the runtime to the workload's shape: serverless rewards stateless and bursty and punishes stateful and always-on, and an auth core that holds your signing keys and your cluster's leadership is firmly the second kind.

We're not against serverless. Plenty of our stateless surface would be fine on it. But the part that holds your keys and decides who runs the destructive jobs is going to keep running on boring, always-on, inspectable infrastructure. Sometimes the idle-looking cluster is the cheap option, once you count everything you'd have to rebuild to leave it.

The always-on, inspectable infrastructure under your logins is a feature, not an oversight. It is exactly what you hand off when you run auth on Authagonal instead of operating the backplane yourself.