All reportsStress Test Report

Gemma 4 12B IT · CUDA 12.8 Baseline

Warm-cache concurrency sweep from 1 to 16 parallel requests.

Gemma 4 12B IT · FP82026-06-11CUDA 12.8

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

ConcurrencySuccessTTFT avgP50P90P99TPSITL avgP95 ITL
1100%63.861.668.574.238.625.426.5
2100%63.463.165.265.338.725.326.3
4100%70.666.879.791.138.325.626.6
8100%99.2102.5103.4103.538.325.626.5
12100%92.082.1106.8107.338.225.726.5
13100%101.3110.6111.4111.838.225.626.6
14100%115.2117.0117.8118.137.925.926.7
15100%126.6128.1129.2129.637.326.327.6
16100%132.1133.6134.6135.537.526.127.2

Key takeaways

Max recommended concurrency 16 · P99 TTFT stays under 136 ms and throughput holds ~37.5 tokens/s with 100% success rate.

  • Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
  • Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.
Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 12.8 · vLLM 0.6.3. Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.