All reportsStress Test Report

Gemma 4 12B IT · CUDA 12.8 up to 64 Concurrency

Full-range sweep from 1 to 64 parallel requests to measure saturated behavior.

Gemma 4 12B IT · FP82026-06-11CUDA 12.8

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

Concurrency	Success	TTFT avg	P50	P90	P99	TPS	ITL avg	P95 ITL
1	100%	64.2	62.1	68.9	74.5	38.2	25.7	26.8
2	100%	65.2	65.2	66.0	66.1	38.3	25.6	26.6
4	100%	81.3	83.4	97.0	97.1	38.1	25.8	26.6
8	100%	101.6	105.0	105.8	106.2	37.7	26.0	26.9
12	100%	105.1	107.3	108.1	108.2	37.5	26.2	27.5
16	100%	109.0	118.2	119.2	119.6	37.5	26.2	27.1
18	100%	135.9	137.4	138.6	138.7	38.6	25.4	26.5
20	100%	130.9	138.2	139.4	140.1	38.0	25.8	26.8
22	100%	150.4	151.5	152.8	153.1	38.1	25.7	26.9
24	100%	158.7	159.5	161.2	161.6	38.0	25.8	26.9
26	100%	158.8	163.6	164.7	165.8	38.4	25.5	26.5
28	100%	179.6	180.8	181.7	182.3	38.0	25.8	27.2
30	100%	161.6	167.4	169.3	169.8	37.4	26.2	27.2
32	100%	177.8	181.0	182.5	185.8	37.6	26.1	27.2
36	100%	198.6	201.5	203.2	203.9	38.0	25.8	27.3
40	100%	220.6	221.5	223.1	224.1	38.0	25.8	27.0
44	100%	233.6	236.8	239.3	241.4	37.7	26.0	27.4
48	100%	250.5	251.2	253.8	255.3	36.7	26.7	28.2
52	100%	266.3	266.7	269.8	271.6	37.1	26.4	27.4
56	100%	270.5	271.6	274.9	276.1	36.9	26.5	27.7
60	100%	302.7	303.3	306.8	307.9	37.1	26.4	27.5
64	100%	287.6	288.2	291.6	293.5	36.8	26.6	27.8

Key takeaways

Node remains stable to 64 concurrency · P99 TTFT stays under ~310 ms and per-request throughput stays above 36 tokens/s.

Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.

Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 12.8 · vLLM 0.6.3. Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.