All reportsStress Test Report

Gemma 4 12B IT · CUDA 12.8 up to 32 Concurrency

Extended sweep from 1 to 32 parallel requests to find the pre-saturation ceiling.

Gemma 4 12B IT · FP82026-06-11CUDA 12.8

Concurrency performance matrix

All times in milliseconds, throughput in tokens per second. Success rate is the percentage of requests that completed without error.

ConcurrencySuccessTTFT avgP50P90P99TPSITL avgP95 ITL
1100%64.262.168.974.538.225.726.8
2100%65.265.266.066.138.325.626.6
4100%81.383.497.097.138.125.826.6
8100%101.6105.0105.8106.237.726.026.9
12100%105.1107.3108.1108.237.526.227.5
16100%109.0118.2119.2119.637.526.227.1
18100%135.9137.4138.6138.738.625.426.5
20100%130.9138.2139.4140.138.025.826.8
22100%150.4151.5152.8153.138.125.726.9
24100%158.7159.5161.2161.638.025.826.9
26100%158.8163.6164.7165.838.425.526.5
28100%179.6180.8181.7182.338.025.827.2
30100%161.6167.4169.3169.837.426.227.2
32100%166.5172.7174.1175.237.526.127.2

Key takeaways

Safe OpenRouter concurrency limit of 32 · P99 TTFT under ~185 ms with no saturation and 100% success rate.

  • Zero request failures. Every concurrency step completed with a 100% success rate; no OOM or service interruption was observed.
  • Stable token generation. Inter-token latency and per-request throughput remained in a narrow band even as concurrency scaled.
Test environment · NVIDIA RTX 5090 · 32 GB · CUDA 12.8 · vLLM 0.6.3. Measured 2026-06-11. These figures reflect the exact node configuration tested; results may vary with driver, engine, or workload changes.