All reportsStress Test Report

Gemma 4 12B IT · CUDA 12.8 up to 32 Concurrency

Extended sweep from 1 to 32 parallel requests to find the pre-saturation ceiling.

Gemma 4 12B IT · FP82026-06-11CUDA 12.8

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

Concurrency	Success	TTFT avg	P50	P90	P99	TPS	ITL avg	P95 ITL
1	100%	64.2	62.1	68.9	74.5	38.2	25.7	26.8
2	100%	65.2	65.2	66.0	66.1	38.3	25.6	26.6
4	100%	81.3	83.4	97.0	97.1	38.1	25.8	26.6
8	100%	101.6	105.0	105.8	106.2	37.7	26.0	26.9
12	100%	105.1	107.3	108.1	108.2	37.5	26.2	27.5
16	100%	109.0	118.2	119.2	119.6	37.5	26.2	27.1
18	100%	135.9	137.4	138.6	138.7	38.6	25.4	26.5
20	100%	130.9	138.2	139.4	140.1	38.0	25.8	26.8
22	100%	150.4	151.5	152.8	153.1	38.1	25.7	26.9
24	100%	158.7	159.5	161.2	161.6	38.0	25.8	26.9
26	100%	158.8	163.6	164.7	165.8	38.4	25.5	26.5
28	100%	179.6	180.8	181.7	182.3	38.0	25.8	27.2
30	100%	161.6	167.4	169.3	169.8	37.4	26.2	27.2
32	100%	166.5	172.7	174.1	175.2	37.5	26.1	27.2

Key takeaways

Safe OpenRouter concurrency limit of 32 · P99 TTFT under ~185 ms with no saturation and 100% success rate.

Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.

Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 12.8 · vLLM 0.6.3. Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.