The Transparent
Inference Endpoint
We publish live p50, p95, and p99 latency, retain zero prompts or completions, and disclose the infrastructure honestly. No queues to hide behind, no marketing promises to untangle.
An inference endpoint that doesn't hide the hard parts.
Built for developers who want transparency by default, not marketing claims.
Public Latency Metrics
We publish live p50, p95, and p99 TTFT and throughput numbers for every region. No vanity benchmarks — just the actual distribution your requests hit right now.
Zero Data Retention
Prompts and completions never touch disk. Everything stays in GPU memory and is destroyed the instant a request completes. We have no log retention to subpoena and no training pipeline to leak into.
Honest Hybrid-Cloud Failover
If our own capacity hiccups, traffic transparently fails over to verified third-party inference backends. We tell you when it happens, which provider handled it, and how latency shifted — no hidden routing magic.
Performance Integrity
Models run at the advertised precision, context length, and quantization. No silent downgrades under load, no truncation to save tokens. If something changes, the metrics page shows it.
Your data's entire lifecycle. It never touches a disk.
Every request flows through a memory-only pipeline and is destroyed the instant it completes. There is no logging stage to opt out of — persistence simply doesn't exist in the path.
User Request
Prompt + params
Routing Layer
Request ingress
Secure Edge
TLS / HTTPS
Inference Gateway
Direct-route proxy
In-Memory GPU
Paged KV, no disk
Immediate Destruction
0 bytes persisted
A flat line is the whole point.
On queue-based routing, first-token latency rises and falls with how deep the shared queue runs at that moment. A dedicated direct-route sidesteps the queue entirely, holding a flat line request after request.
Inferway model endpoints
Transparent, metered access to open-weight models. Every card shows live performance and the backend actually serving it.
Gemma 4 12B IT
12.4B parameters · FP8
Gemma 4 26B IT
26B parameters · FP8
Next on the Inferway roadmap. Target specs from the current lab evaluation:
- Gemma 4 26B IT · A4B MoE · FP8
- Target TTFT p50 ~159 ms
- Target throughput ~105 tok/s
Open weights, native precision, honest pricing.
Pay only for tokens, reconciled from metadata — never from your request content. Every model routes through the same dedicated endpoint, with more joining the catalog as we scale.
Gemma 4 12B IT
Gemma 4 26B IT
Route to an endpoint that shows its work.
Inferway publishes live latency percentiles, stores zero prompts or completions, and discloses limits upfront. Point your OpenAI-compatible client at Inferway and see the difference on the first token.