Your libc is your performance profile

When you ship a Docker image with nginx and two dozen custom modules, where does the performance come from?

We assumed the answer would involve our code — the 26 native Zig modules, the event loops, the shared-memory counters. We were wrong. The data told a different story: one where the modules are invisible, and every meaningful performance difference between our two images traces back to the C standard library, the OpenSSL build, and the memory allocator.

This post is about what we found, why it surprised us, and what it means if you’re choosing a base image for your own nginx deployment.

Two images, same nginx

We ship two Docker images. Both run the same stock nginx binary with the same 24 nginz modules loaded as .so files. The only difference is the base:

darkanchor/nginx:1.30.1
Debian trixie-slim
glibc 2.40
Debian OpenSSL with AVX2
ptmalloc2 allocator
164 MB compressed


darkanchor/nginx:1.30.1-alpine
Alpine 3.23
musl 1.2.5
Alpine OpenSSL (no AVX2)
musl malloc (eager unmapping)
26 MB compressed

164 megabytes versus 26. A 6× size difference. Trivial enough that you might reach for Alpine by default and never look back. But we needed to know: does that smaller image come with a performance cost? And if so, how much of it is our fault?

Two workloads, one conclusion

We ran two benchmarks, chosen to stress fundamentally different parts of the system.

The CPU-bound path: JWT verification. Pure computation. A tight loop of HMAC-SHA256 or RSA-2048 signature verification, with minimal I/O. If there’s a CPU overhead buried in the base image, this workload will surface it at full volume.

The I/O-bound path: dynamic upstream management. Six modules working together — cookie parsing, shared-memory upstream lookups, health check state reads, cache-tag recording, and a proxy round-trip to a loopback backend. The CPU sits mostly idle while the system waits for the upstream. If CPU overhead exists but gets masked by concurrency, this workload will show it.

Both workloads pointed to the same place.

The CPU-bound finding: RSA-2048 at c=1

Single-connection throughput for RS256 JWT verification — the most computationally expensive path we have. RSA-2048 modular exponentiation via OpenSSL. At one concurrent connection, there’s nowhere to hide:

Trixie (glibc)

1,945

requests / sec

Alpine (musl)

1,630

requests / sec

−17%

Same nginx. Same modules. Same config.

Every cycle of that 17% gap is RSA-2048 Montgomery multiplication — AVX2 SIMD in Debian's OpenSSL vs scalar in musl's bundled libcrypto. Our modules run the exact same instructions on both images.

A 17% gap. With identical nginx binaries, identical module code, identical configuration. The entire delta is RSA-2048 Montgomery multiplication — Debian’s OpenSSL enables AVX2 SIMD bignum arithmetic; musl’s bundled libcrypto uses a scalar implementation. Our modules are running the exact same instructions on both images. OpenSSL is doing something completely different.

We also ran a native baseline — nginx compiled directly on the host with ReleaseSmall Zig, no Docker at all. Trixie matched native within 0.5% (1,945 vs 1,955 RPS). The container boundary costs nothing at the RPS level. The gap is purely the libc.

Concurrency closes the gap

Single-connection benchmarks expose overhead. Production doesn’t run at c=1. Here’s the full JWT picture at c=8, the practical operating point:

scenario	trixie	alpine	Δ
valid-hs256	13,646	12,969	trixie +5%
valid-rs256	5,725	5,284	trixie +8%
reject-wrong-secret	13,831	15,964	alpine +15%

At 8 concurrent connections, Docker overhead disappears entirely — both images match or exceed the native baseline. The RSA gap narrows from 17% to 8%. And on the reject path, alpine actually pulls ahead. Concurrency reshuffles the rankings. The libc still matters, but less than you’d think from a c=1 microbenchmark.

The I/O-bound finding: proxy workloads at c=8

The dynamic-upstreams benchmark exercises six modules: dynamic-upstreams, healthcheck, upstream-balancer, cache-tags, cache-purge, and worker-events. Every request makes a loopback proxy round-trip to a Bun backend. The CPU sits at 26–32% utilisation — the bottleneck is I/O, not computation.

Trixie (glibc)

7,279

RPS · sticky-read c=8

Alpine (musl)

6,375

RPS · sticky-read c=8

within ±10% of native at c=8

At c=1 the gap was dramatic — trixie 1,462 vs alpine 1,719, a 39% spread driven by Docker dispatch overhead compounded by the proxy round-trip. At c=8, concurrent connections overlap the I/O, and the gap collapses to ±10%. The CPU overhead from musl’s scalar string functions — 41% more instructions per request — is still there, but it’s buried under the proxy latency. The system waits for the upstream, not the CPU.

This is the practical operating point. Nobody runs production at c=1.

The integrity signal

One number in our data didn’t look right. The valid-claims JWT scenario showed trixie 58% faster than alpine at c=1 — too large to accept uncritically for a pure allocator difference. We flagged it immediately.

+58%

1,000 samples
baseline

→

+27.7%

5,000 samples
confirmed

The original run was warmup-inflated: glibc's ptmalloc2 pre-sizes arena bins aggressively on first allocation, giving trixie an artificial head start. At 5,000 samples both allocators reach steady state. The gap is real and structural — musl's allocator has higher per-call overhead for CJSON pool operations — but half the original estimate.

We published the correction. If you’re going to make claims about performance, you have to trust the numbers enough to challenge the ones that don’t add up.

What this means

The deployment decision is simpler than the data suggests:

🔐

If RSA crypto throughput matters, pick Debian. The AVX2 gap is real, structural, and not a compile flag you can toggle on musl.

📦

If image size or CVE surface matters more, pick Alpine. The 6× size advantage (26 MB vs 164 MB) is real, and the CPU overhead disappears under concurrency for proxy workloads.

⚡

The modules don't care either way. Every percentage point of difference traces to the base image, not to our code.

We set out to measure whether our modules slow down nginx. The answer, across two workloads and six scenario types, is that they don’t. The performance profile of a nginz deployment is the performance profile of its libc. The rest is concurrency, I/O, and the network.

That’s a good answer. It means we built clean modules. It means the foundation is solid. And it means we can spend our energy on what comes next — the AI gateway modules that will actually need every cycle we can give them.

→

Coming post: we attach perf stat to the running nginx workers — and discover that our single-worker benchmark was hiding a 55% cache-miss penalty. Two rounds of profiling, a cache-line audit, and a shared-memory restructuring later, the tax is gone.