All reportsStress Test Report
Gemma 4 12B IT · CUDA 12.8 Baseline
Warm-cache concurrency sweep from 1 to 16 parallel requests.
Gemma 4 12B IT · FP82026-06-11CUDA 12.8
Concurrency performance matrix
All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.
| Concurrency | Success | TTFT avg | P50 | P90 | P99 | TPS | ITL avg | P95 ITL |
|---|---|---|---|---|---|---|---|---|
| 1 | 100% | 63.8 | 61.6 | 68.5 | 74.2 | 38.6 | 25.4 | 26.5 |
| 2 | 100% | 63.4 | 63.1 | 65.2 | 65.3 | 38.7 | 25.3 | 26.3 |
| 4 | 100% | 70.6 | 66.8 | 79.7 | 91.1 | 38.3 | 25.6 | 26.6 |
| 8 | 100% | 99.2 | 102.5 | 103.4 | 103.5 | 38.3 | 25.6 | 26.5 |
| 12 | 100% | 92.0 | 82.1 | 106.8 | 107.3 | 38.2 | 25.7 | 26.5 |
| 13 | 100% | 101.3 | 110.6 | 111.4 | 111.8 | 38.2 | 25.6 | 26.6 |
| 14 | 100% | 115.2 | 117.0 | 117.8 | 118.1 | 37.9 | 25.9 | 26.7 |
| 15 | 100% | 126.6 | 128.1 | 129.2 | 129.6 | 37.3 | 26.3 | 27.6 |
| 16 | 100% | 132.1 | 133.6 | 134.6 | 135.5 | 37.5 | 26.1 | 27.2 |
Key takeaways
Max recommended concurrency 16 · P99 TTFT stays under 136 ms and throughput holds ~37.5 tokens/s with 100% success rate.
- Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
- Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.