All reportsStress Test Report

Gemma 4 12B IT · CUDA 12.8 Baseline

Warm-cache concurrency sweep from 1 to 16 parallel requests.

Gemma 4 12B IT · FP82026-06-11CUDA 12.8

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

Concurrency	Success	TTFT avg	P50	P90	P99	TPS	ITL avg	P95 ITL
1	100%	63.8	61.6	68.5	74.2	38.6	25.4	26.5
2	100%	63.4	63.1	65.2	65.3	38.7	25.3	26.3
4	100%	70.6	66.8	79.7	91.1	38.3	25.6	26.6
8	100%	99.2	102.5	103.4	103.5	38.3	25.6	26.5
12	100%	92.0	82.1	106.8	107.3	38.2	25.7	26.5
13	100%	101.3	110.6	111.4	111.8	38.2	25.6	26.6
14	100%	115.2	117.0	117.8	118.1	37.9	25.9	26.7
15	100%	126.6	128.1	129.2	129.6	37.3	26.3	27.6
16	100%	132.1	133.6	134.6	135.5	37.5	26.1	27.2

Key takeaways

Max recommended concurrency 16 · P99 TTFT stays under 136 ms and throughput holds ~37.5 tokens/s with 100% success rate.

Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.

Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 12.8 · vLLM 0.6.3. Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.