Transparent inference endpoint · online (last-known)

The Transparent
Inference Endpoint

We publish live p50, p95, and p99 latency, retain zero prompts or completions, and disclose the infrastructure honestly. No queues to hide behind, no marketing promises to untangle.

Try on OpenRouter View Performance Dashboard

TTFT (last-known)

64ms

Throughput (last-known)

62 tok/s

Uptime 30d

—

Data Retained

0 bytes

Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8

Why Inferway

An inference endpoint that doesn't hide the hard parts.

Built for developers who want transparency by default, not marketing claims.

Public Latency Metrics

We publish live p50, p95, and p99 TTFT and throughput numbers for every region. No vanity benchmarks — just the actual distribution your requests hit right now.

Real-time percentiles

Zero Data Retention

Prompts and completions never touch disk. Everything stays in GPU memory and is destroyed the instant a request completes. We have no log retention to subpoena and no training pipeline to leak into.

Memory-only · no storage

Honest Hybrid-Cloud Failover

If our own capacity hiccups, traffic transparently fails over to verified third-party inference backends. We tell you when it happens, which provider handled it, and how latency shifted — no hidden routing magic.

Transparent fallback

Performance Integrity

Models run at the advertised precision, context length, and quantization. No silent downgrades under load, no truncation to save tokens. If something changes, the metrics page shows it.

No hidden degradation

Zero Data Retention

Your data's entire lifecycle. It never touches a disk.

Every request flows through a memory-only pipeline and is destroyed the instant it completes. There is no logging stage to opt out of — persistence simply doesn't exist in the path.

User Request

Prompt + params

Routing Layer

Request ingress

Secure Edge

TLS / HTTPS

Inference Gateway

Direct-route proxy

In-Memory GPU

Paged KV, no disk

Immediate Destruction

0 bytes persisted

No disk writes Page-level KV isolation No cross-session pollution

TTFT Consistency

A flat line is the whole point.

On queue-based routing, first-token latency rises and falls with how deep the shared queue runs at that moment. A dedicated direct-route sidesteps the queue entirely, holding a flat line request after request.

Queue-based routingInferway Transparent Endpoint

Time to First Token · ms

Direct-route typical

64 ms

Direct-route jitter

±6 ms

Queued peak *

1,240 ms

Consistency gain *

19.4×

* shared pool metrics are illustrative of average queue-based routing degradation under load

Production Models

Inferway model endpoints

Transparent, metered access to open-weight models. Every card shows live performance and the backend actually serving it.

Gemma 4 12B IT

12.4B parameters · FP8

Loading status…

64msbenchmark

TTFT p50

p95 294 ms

62.0tok/sbenchmark

Throughput p50

p95 38.0 tok/s

Public Beta (Free)

Try Sandbox API Docs

Gemma 4 26B IT

26B parameters · FP8

Planned

Next on the Inferway roadmap. Target specs from the current lab evaluation:

Gemma 4 26B IT · A4B MoE · FP8
Target TTFT p50 ~159 ms
Target throughput ~105 tok/s

Models & Pricing

Open weights, native precision, honest pricing.

Pay only for tokens, reconciled from metadata — never from your request content. Every model routes through the same dedicated endpoint, with more joining the catalog as we scale.

Live now

Gemma 4 12B IT

Flagship · FP8

Input / 1M tokens$0.05

Output / 1M tokens$0.10

Native FP8 — served at advertised precision, no downgrade under load

Transparent p50/p95/p99 latency metrics

~64ms typical TTFT

Zero prompts or completions retained

Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8

Route on OpenRouter

Coming Soon / Roadmap

Gemma 4 26B IT

A4B MoE · High-capacity

Input / 1M tokens$0.08

Output / 1M tokens$0.16

Mixture-of-Experts, deeper reasoning

Same direct-route guarantees

Advertised precision under load

Direct-to-architect support

Measured Jun 2026 · single RTX 5090 node · Gemma 4 26B MoE FP8

Route on OpenRouter

Start routing

Route to an endpoint that shows its work.

Inferway publishes live latency percentiles, stores zero prompts or completions, and discloses limits upfront. Point your OpenAI-compatible client at Inferway and see the difference on the first token.

Try on OpenRouter View live status