All reportsStress Test Report
Gemma 4 26B MoE IT · CUDA 13.0
MoE FP8 node sweep from 32 to 64 concurrency with prefix caching and CPU KV offloading.
Gemma 4 26B A4B MoE IT · FP82026-06-11CUDA 13.0
Concurrency performance matrix
All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.
| Concurrency | Success | TTFT avg | P50 | P90 | P99 | TPS | ITL avg | P95 ITL |
|---|---|---|---|---|---|---|---|---|
| 32 | 100% | 191.2 | 192.3 | 194.4 | 197.3 | 116.7 | 8.4 | 10.8 |
| 36 | 100% | 133.7 | 137.1 | 138.5 | 138.9 | 97.5 | 10.1 | 13.0 |
| 40 | 100% | 120.1 | 121.3 | 122.7 | 123.4 | 115.4 | 8.5 | 11.2 |
| 44 | 100% | 152.1 | 154.9 | 157.1 | 158.3 | 95.2 | 10.3 | 13.1 |
| 48 | 100% | 120.1 | 121.9 | 123.9 | 124.5 | 108.2 | 9.0 | 11.5 |
| 52 | 100% | 240.6 | 234.5 | 258.7 | 261.0 | 88.3 | 11.1 | 13.8 |
| 56 | 100% | 139.0 | 139.3 | 140.9 | 142.5 | 106.6 | 9.2 | 12.3 |
| 60 | 100% | 145.9 | 146.9 | 150.3 | 152.4 | 94.6 | 10.3 | 17.0 |
| 64 | 100% | 158.6 | 157.9 | 163.4 | 166.3 | 104.8 | 9.3 | 11.8 |
Key takeaways
Recommended max concurrency 64 · P99 TTFT under ~265 ms, per-request throughput averaging ~105 tokens/s, and P95 ITL under 18 ms.
- Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
- Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.