All reportsStress Test Report

Gemma 4 26B MoE IT · CUDA 13.0

MoE FP8 node sweep from 32 to 64 concurrency with prefix caching and CPU KV offloading.

Gemma 4 26B A4B MoE IT · FP82026-06-11CUDA 13.0

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

ConcurrencySuccessTTFT avgP50P90P99TPSITL avgP95 ITL
32100%191.2192.3194.4197.3116.78.410.8
36100%133.7137.1138.5138.997.510.113.0
40100%120.1121.3122.7123.4115.48.511.2
44100%152.1154.9157.1158.395.210.313.1
48100%120.1121.9123.9124.5108.29.011.5
52100%240.6234.5258.7261.088.311.113.8
56100%139.0139.3140.9142.5106.69.212.3
60100%145.9146.9150.3152.494.610.317.0
64100%158.6157.9163.4166.3104.89.311.8

Key takeaways

Recommended max concurrency 64 · P99 TTFT under ~265 ms, per-request throughput averaging ~105 tokens/s, and P95 ITL under 18 ms.

  • Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
  • Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.
Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 13.0 · vLLM V1 (nightly). Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.