Routing & Resilience

Your app calls "smart". pLLM picks the model.

One route slug, many providers. Real-time latency-aware selection, silent failover, and self-healing circuit breakers — so outages, spikes, and cost tiers are a config change, not a deploy.

Live · acme-corp/production
last 30d
12.4M
requests routed
last 30d
0.82ms
route-decision p95
gateway overhead
3.2k
failovers triggered
silent recovery
99.8%
auto-recovery
no human paged
29%
cost savings
vs single-provider
The decision

What happens when a request hits "smart".

pllm-router · live
strategy: least-latency
01 incoming
POST /v1/chat/completions model: "smart"
02 resolve route · evaluate candidates
3 healthy · 0 degraded · 0 failed
model
p95
errors
weight
gpt-5 · openai
42ms 0.01% 60%
claude-4.6-sonnet · anthropic
58ms 0.02% 30%
gemini-2.5-pro · google
71ms 0.08% 10%
03 winner · dispatch
gpt-5 @ openai / us-east-2 / instance #3 decided in 0.82ms
42ms
winner p95
<1ms
router overhead
0
retry code in your app
Decision guide

Which strategy fits your traffic?

Each route picks a strategy. Strategies run at request time using real-time metrics, not static config.

Strategy How it picks State Best for
Least Latency
least-latency
Fastest p95 across healthy nodes Distributed EMA via Redis Latency-sensitive apps · chat UIs · real-time agents
Weighted RR
weighted-round-robin
Smooth proportional rotation In-memory counters Capacity-based distribution · multi-deployment setups
Priority
priority
Highest-priority healthy backend Static ordering Cost tiers · preferred provider · failover chains
Random
random
Uniform random across healthy Stateless All-equal providers · stateless gateway nodes
Configuration

Two steps. Admin API + standard SDK.

1. Define a route

admin API · no restart
javascript
# A route named "smart" — your app just calls model: "smart".
# pLLM picks the best backend automatically.

POST /api/admin/routes
{
  "name": "Smart",
  "slug": "smart",
  "strategy": "least-latency",
  "models": [
    { "model_name": "gpt-5",          "weight": 60, "priority": 100 },
    { "model_name": "claude-4.6",     "weight": 30, "priority":  80 },
    { "model_name": "gemini-2.5-pro", "weight": 10, "priority":  60 }
  ],
  "fallback_models": ["gpt-4o-mini", "claude-haiku"]
}

2. Use it in your app

OpenAI SDK · no change
python
from openai import OpenAI

client = OpenAI(
    base_url="https://pllm.company.com/v1",
    api_key="sk-..."
)

# Call the route slug — not a specific model.
# pLLM picks the best backend in real-time.
response = client.chat.completions.create(
    model="smart",                        # pLLM route
    messages=[{"role": "user", "content": "Analyze this data"}],
    stream=True,
)

# If gpt-5 is slow      → routes to claude-4.6
# If claude is down     → circuit opens, fails over to gemini-2.5
# If all primaries fail → fallback chain (gpt-4o-mini, claude-haiku)
# Your app never knows. Zero code changes.
Resilience

When things break, your app doesn't.

Three escalating layers of failover, plus a self-healing circuit breaker on every provider.

Failover ladder

Three layers, in order.

1
Instance retry

If an instance fails, pLLM tries another instance of the same model with 1.5× increasing timeouts.

2
Model failover

If all instances of a model fail, the route's strategy picks the next model in its list.

3
Fallback chain

If every model in the route is exhausted, pLLM walks the fallback_models chain as a last resort.

Each retry uses 1.5× increasing timeout. Up to 10 failover hops with loop detection.

Circuit breaker

Self-healing, no paging.

CLOSED · HEALTHY
normal

All traffic flows. Failure counter active.

3 consecutive failures
OPEN · UNHEALTHY
removed

Traffic blocked. Provider pulled from rotation. 30s cooldown.

cooldown elapsed
HALF-OPEN · TESTING
probe

One probe request. Success → closed. Failure → open.

0
Lines of retry code in your app
<30s
Automatic recovery window
10
Max failover hops per request