All reportsStress Test Report

Gemma 4 26B MoE IT · CUDA 13.0

MoE FP8 node sweep from 32 to 64 concurrency with prefix caching and CPU KV offloading.

Gemma 4 26B A4B MoE IT · FP82026-06-11CUDA 13.0

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

Concurrency	Success	TTFT avg	P50	P90	P99	TPS	ITL avg	P95 ITL
32	100%	191.2	192.3	194.4	197.3	116.7	8.4	10.8
36	100%	133.7	137.1	138.5	138.9	97.5	10.1	13.0
40	100%	120.1	121.3	122.7	123.4	115.4	8.5	11.2
44	100%	152.1	154.9	157.1	158.3	95.2	10.3	13.1
48	100%	120.1	121.9	123.9	124.5	108.2	9.0	11.5
52	100%	240.6	234.5	258.7	261.0	88.3	11.1	13.8
56	100%	139.0	139.3	140.9	142.5	106.6	9.2	12.3
60	100%	145.9	146.9	150.3	152.4	94.6	10.3	17.0
64	100%	158.6	157.9	163.4	166.3	104.8	9.3	11.8

Key takeaways

Recommended max concurrency 64 · P99 TTFT under ~265 ms, per-request throughput averaging ~105 tokens/s, and P95 ITL under 18 ms.

Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.

Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 13.0 · vLLM V1 (nightly). Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.