Transparent inference endpoint · online (last-known)

The Transparent
Inference Endpoint

We publish live p50, p95, and p99 latency, retain zero prompts or completions, and disclose the infrastructure honestly. No queues to hide behind, no marketing promises to untangle.

TTFT (last-known)
64ms
Throughput (last-known)
62 tok/s
Uptime 30d
Data Retained
0 bytes
Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8
Why Inferway

An inference endpoint that doesn't hide the hard parts.

Built for developers who want transparency by default, not marketing claims.

Public Latency Metrics

We publish live p50, p95, and p99 TTFT and throughput numbers for every region. No vanity benchmarks — just the actual distribution your requests hit right now.

Real-time percentiles

Zero Data Retention

Prompts and completions never touch disk. Everything stays in GPU memory and is destroyed the instant a request completes. We have no log retention to subpoena and no training pipeline to leak into.

Memory-only · no storage

Honest Hybrid-Cloud Failover

If our own capacity hiccups, traffic transparently fails over to verified third-party inference backends. We tell you when it happens, which provider handled it, and how latency shifted — no hidden routing magic.

Transparent fallback

Performance Integrity

Models run at the advertised precision, context length, and quantization. No silent downgrades under load, no truncation to save tokens. If something changes, the metrics page shows it.

No hidden degradation
Zero Data Retention

Your data's entire lifecycle. It never touches a disk.

Every request flows through a memory-only pipeline and is destroyed the instant it completes. There is no logging stage to opt out of — persistence simply doesn't exist in the path.

User Request

Prompt + params

Routing Layer

Request ingress

Secure Edge

TLS / HTTPS

Inference Gateway

Direct-route proxy

In-Memory GPU

Paged KV, no disk

Immediate Destruction

0 bytes persisted

No disk writes Page-level KV isolation No cross-session pollution
TTFT Consistency

A flat line is the whole point.

On queue-based routing, first-token latency rises and falls with how deep the shared queue runs at that moment. A dedicated direct-route sidesteps the queue entirely, holding a flat line request after request.

Queue-based routingInferway Transparent Endpoint
Time to First Token · ms
03006009001200
Direct-route typical
64 ms
Direct-route jitter
±6 ms
Queued peak *
1,240 ms
Consistency gain *
19.4×
* shared pool metrics are illustrative of average queue-based routing degradation under load
Production Models

Inferway model endpoints

Transparent, metered access to open-weight models. Every card shows live performance and the backend actually serving it.

Gemma 4 12B IT

12.4B parameters · FP8

Loading status…
64msbenchmark
TTFT p50
p95 294 ms
62.0tok/sbenchmark
Throughput p50
p95 38.0 tok/s
Public Beta (Free)

Gemma 4 26B IT

26B parameters · FP8

Planned

Next on the Inferway roadmap. Target specs from the current lab evaluation:

  • Gemma 4 26B IT · A4B MoE · FP8
  • Target TTFT p50 ~159 ms
  • Target throughput ~105 tok/s
Models & Pricing

Open weights, native precision, honest pricing.

Pay only for tokens, reconciled from metadata — never from your request content. Every model routes through the same dedicated endpoint, with more joining the catalog as we scale.

Live now

Gemma 4 12B IT

Flagship · FP8
Input / 1M tokens$0.05
Output / 1M tokens$0.10
Native FP8 — served at advertised precision, no downgrade under load
Transparent p50/p95/p99 latency metrics
~64ms typical TTFT
Zero prompts or completions retained
Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8
Route on OpenRouter
Coming Soon / Roadmap

Gemma 4 26B IT

A4B MoE · High-capacity
Input / 1M tokens$0.08
Output / 1M tokens$0.16
Mixture-of-Experts, deeper reasoning
Same direct-route guarantees
Advertised precision under load
Direct-to-architect support
Measured Jun 2026 · single RTX 5090 node · Gemma 4 26B MoE FP8
Route on OpenRouter
Start routing

Route to an endpoint that shows its work.

Inferway publishes live latency percentiles, stores zero prompts or completions, and discloses limits upfront. Point your OpenAI-compatible client at Inferway and see the difference on the first token.