All reportsStress Test Report

Gemma 4 12B IT · CUDA 12.8 up to 64 Concurrency

Full-range sweep from 1 to 64 parallel requests to measure saturated behavior.

Gemma 4 12B IT · FP82026-06-11CUDA 12.8

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

ConcurrencySuccessTTFT avgP50P90P99TPSITL avgP95 ITL
1100%64.262.168.974.538.225.726.8
2100%65.265.266.066.138.325.626.6
4100%81.383.497.097.138.125.826.6
8100%101.6105.0105.8106.237.726.026.9
12100%105.1107.3108.1108.237.526.227.5
16100%109.0118.2119.2119.637.526.227.1
18100%135.9137.4138.6138.738.625.426.5
20100%130.9138.2139.4140.138.025.826.8
22100%150.4151.5152.8153.138.125.726.9
24100%158.7159.5161.2161.638.025.826.9
26100%158.8163.6164.7165.838.425.526.5
28100%179.6180.8181.7182.338.025.827.2
30100%161.6167.4169.3169.837.426.227.2
32100%177.8181.0182.5185.837.626.127.2
36100%198.6201.5203.2203.938.025.827.3
40100%220.6221.5223.1224.138.025.827.0
44100%233.6236.8239.3241.437.726.027.4
48100%250.5251.2253.8255.336.726.728.2
52100%266.3266.7269.8271.637.126.427.4
56100%270.5271.6274.9276.136.926.527.7
60100%302.7303.3306.8307.937.126.427.5
64100%287.6288.2291.6293.536.826.627.8

Key takeaways

Node remains stable to 64 concurrency · P99 TTFT stays under ~310 ms and per-request throughput stays above 36 tokens/s.

  • Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
  • Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.
Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 12.8 · vLLM 0.6.3. Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.