Your libc is your performance profile
When you ship a Docker image with nginx and two dozen custom modules, where does the performance come from?
We assumed the answer would involve our code — the 26 native Zig modules, the event loops, the shared-memory counters. We were wrong. The data told a different story: one where the modules are invisible, and every meaningful performance difference between our two images traces back to the C standard library, the OpenSSL build, and the memory allocator.
This post is about what we found, why it surprised us, and what it means if you’re choosing a base image for your own nginx deployment.
Two images, same nginx
We ship two Docker images. Both run the same stock nginx binary with the same 24 nginz modules loaded as .so files. The only difference is the base:
darkanchor/nginx:1.30.1
Debian trixie-slim- glibc 2.40
- Debian OpenSSL with AVX2
- ptmalloc2 allocator
- 164 MB compressed
darkanchor/nginx:1.30.1-alpine
Alpine 3.23- musl 1.2.5
- Alpine OpenSSL (no AVX2)
- musl malloc (eager unmapping)
- 26 MB compressed
164 megabytes versus 26. A 6× size difference. Trivial enough that you might reach for Alpine by default and never look back. But we needed to know: does that smaller image come with a performance cost? And if so, how much of it is our fault?
Two workloads, one conclusion
We ran two benchmarks, chosen to stress fundamentally different parts of the system.
The CPU-bound path: JWT verification. Pure computation. A tight loop of HMAC-SHA256 or RSA-2048 signature verification, with minimal I/O. If there’s a CPU overhead buried in the base image, this workload will surface it at full volume.
The I/O-bound path: dynamic upstream management. Six modules working together — cookie parsing, shared-memory upstream lookups, health check state reads, cache-tag recording, and a proxy round-trip to a loopback backend. The CPU sits mostly idle while the system waits for the upstream. If CPU overhead exists but gets masked by concurrency, this workload will show it.
Both workloads pointed to the same place.
The CPU-bound finding: RSA-2048 at c=1
Single-connection throughput for RS256 JWT verification — the most computationally expensive path we have. RSA-2048 modular exponentiation via OpenSSL. At one concurrent connection, there’s nowhere to hide:
A 17% gap. With identical nginx binaries, identical module code, identical configuration. The entire delta is RSA-2048 Montgomery multiplication — Debian’s OpenSSL enables AVX2 SIMD bignum arithmetic; musl’s bundled libcrypto uses a scalar implementation. Our modules are running the exact same instructions on both images. OpenSSL is doing something completely different.
We also ran a native baseline — nginx compiled directly on the host with ReleaseSmall Zig, no Docker at all. Trixie matched native within 0.5% (1,945 vs 1,955 RPS). The container boundary costs nothing at the RPS level. The gap is purely the libc.
Concurrency closes the gap
Single-connection benchmarks expose overhead. Production doesn’t run at c=1. Here’s the full JWT picture at c=8, the practical operating point:
| scenario | trixie | alpine | Δ |
|---|---|---|---|
| valid-hs256 | 13,646 | 12,969 | trixie +5% |
| valid-rs256 | 5,725 | 5,284 | trixie +8% |
| reject-wrong-secret | 13,831 | 15,964 | alpine +15% |
At 8 concurrent connections, Docker overhead disappears entirely — both images match or exceed the native baseline. The RSA gap narrows from 17% to 8%. And on the reject path, alpine actually pulls ahead. Concurrency reshuffles the rankings. The libc still matters, but less than you’d think from a c=1 microbenchmark.
The I/O-bound finding: proxy workloads at c=8
The dynamic-upstreams benchmark exercises six modules: dynamic-upstreams, healthcheck, upstream-balancer, cache-tags, cache-purge, and worker-events. Every request makes a loopback proxy round-trip to a Bun backend. The CPU sits at 26–32% utilisation — the bottleneck is I/O, not computation.
At c=1 the gap was dramatic — trixie 1,462 vs alpine 1,719, a 39% spread driven by Docker dispatch overhead compounded by the proxy round-trip. At c=8, concurrent connections overlap the I/O, and the gap collapses to ±10%. The CPU overhead from musl’s scalar string functions — 41% more instructions per request — is still there, but it’s buried under the proxy latency. The system waits for the upstream, not the CPU.
This is the practical operating point. Nobody runs production at c=1.
The integrity signal
One number in our data didn’t look right. The valid-claims JWT scenario showed trixie 58% faster than alpine at c=1 — too large to accept uncritically for a pure allocator difference. We flagged it immediately.
baseline
confirmed
We published the correction. If you’re going to make claims about performance, you have to trust the numbers enough to challenge the ones that don’t add up.
What this means
The deployment decision is simpler than the data suggests:
We set out to measure whether our modules slow down nginx. The answer, across two workloads and six scenario types, is that they don’t. The performance profile of a nginz deployment is the performance profile of its libc. The rest is concurrency, I/O, and the network.
That’s a good answer. It means we built clean modules. It means the foundation is solid. And it means we can spend our energy on what comes next — the AI gateway modules that will actually need every cycle we can give them.