Router: provider/model registry, circuit breakers, rate limiting, budget tracking, and routing resolution.
Extracted from defaults.clj (provider/model metadata) and llm.clj (routing logic) to provide a single cohesive namespace for all routing concerns.
Router: provider/model registry, circuit breakers, rate limiting, budget tracking, and routing resolution. Extracted from defaults.clj (provider/model metadata) and llm.clj (routing logic) to provide a single cohesive namespace for all routing concerns.
(check-context-limit model messages)(check-context-limit model
messages
{:keys [output-reserve throw? context-limits]
:or {output-reserve DEFAULT_OUTPUT_RESERVE throw? false}})Checks if messages fit within model context limit.
Checks if messages fit within model context limit.
(context-limit model)(context-limit model context-limits)Returns provider-agnostic conservative context window size for a model.
Params:
model - String. Model name.
context-limits - Map, optional. Override map (merged defaults from config).
Returns: Integer. Conservative context tokens.
Returns provider-agnostic conservative context window size for a model. Params: `model` - String. Model name. `context-limits` - Map, optional. Override map (merged defaults from config). Returns: Integer. Conservative context tokens.
(count-and-estimate model messages output-text)(count-and-estimate model
messages
output-text
{:keys [pricing input-tokens api-usage api-style
cache-creation-ttl]
:as opts})Counts tokens and estimates cost in one call. Cost map separates uncached input, cached-read input, cache-write input, output, and total.
Counts tokens and estimates cost in one call. Cost map separates uncached input, cached-read input, cache-write input, output, and total.
(count-messages model messages)Counts tokens for a chat completion message array.
Counts tokens for a chat completion message array.
(count-tokens model text)Counts tokens for a given text string using the specified model's encoding.
Counts tokens for a given text string using the specified model's encoding.
Default idle-stream timeout (ms) for streaming HTTP responses. If no
SSE bytes arrive for this long the underlying InputStream is closed
and the call surfaces :svar.core/stream-idle-timeout. Distinct from
DEFAULT_TIMEOUT_MS (whole-request cap): the idle watchdog tolerates
arbitrarily long total durations as long as the stream keeps emitting
bytes (content deltas, SSE : ping keepalives, or even blank
separators — the watchdog resets on every .readLine, so it's
ping-aware for free).
2 minutes (120000 ms) is the considered sweet spot:
120_000 per-request, #959 ships 90s
default with ping-reset).ping interval (15-30 s) for safety.DEFAULT_TIMEOUT_MS doesn't
reliably fire on JDK 25 + HTTP/2 streaming.:idle-timeout-ms nil to disable.Disable per-call: (svar/ask-code! router {... :idle-timeout-ms nil}).
45 s default. Catches genuinely hung streams (no SSE bytes, no keepalive pings) in well under a minute while still allowing the provider to take up to ~40 s between tokens during extended reasoning. The original 120 s default let timeouts blow the whole per-task budget.
Default idle-stream timeout (ms) for streaming HTTP responses. If no
SSE bytes arrive for this long the underlying `InputStream` is closed
and the call surfaces `:svar.core/stream-idle-timeout`. Distinct from
`DEFAULT_TIMEOUT_MS` (whole-request cap): the idle watchdog tolerates
arbitrarily long total durations as long as the stream keeps emitting
bytes (content deltas, SSE `: ping` keepalives, or even blank
separators — the watchdog resets on every `.readLine`, so it's
ping-aware for free).
2 minutes (120000 ms) is the considered sweet spot:
- Matches Anthropic's own SDK proposal (anthropics/anthropic-sdk-
typescript#867 suggests `120_000` per-request, #959 ships 90s
default with ping-reset).
- ~4× Anthropic's published `ping` interval (15-30 s) for safety.
- Catches real hangs (e.g. z.ai glm streams that simply stop
sending body frames after headers) in 2 minutes instead of
forever — the original 5-minute `DEFAULT_TIMEOUT_MS` doesn't
reliably fire on JDK 25 + HTTP/2 streaming.
- Anthropic's documented worst case for legitimate extended
thinking on Opus 4.5 is ~185 s with zero events (see
anthropics/claude-agent-sdk-typescript#44). Callers running
extended-thinking workloads should bump this to 240-300 s, or
pass `:idle-timeout-ms nil` to disable.
Disable per-call: `(svar/ask-code! router {... :idle-timeout-ms nil})`.
45 s default. Catches genuinely hung streams (no SSE bytes, no
keepalive pings) in well under a minute while still allowing the
provider to take up to ~40 s between tokens during extended
reasoning. The original 120 s default let timeouts blow the whole
per-task budget.Default number of tokens to reserve for model output. 0 means no reservation — let the API handle overflow naturally.
Default number of tokens to reserve for model output. 0 means no reservation — let the API handle overflow naturally.
Default router-owned 429 policy.
:same-provider-delays-ms — sleep schedule for same-provider retries.
:fallback-after-ms — hard cap on wall time the 429 phase can
consume (measured from the FIRST 429
caught). When the configured delay vector
is exhausted OR elapsed ≥ budget, the
router falls back to the next provider.
Each delay clamps to remaining budget so
the loop never overshoots.
:respect-retry-after? — honor server Retry-After header value
in place of the configured delay; the
clamp to remaining budget still applies.
:fallback-provider? — when budget is exhausted, fall back to
the next provider/model. Set false to
surface the rate-limit error to the
caller instead.
The 60 000 ms (60 s) default tolerates Anthropic / OpenAI / z.ai Retry-After values that can land between 30-60 s under quota pressure on reasoning-heavy workloads, while still bounding the wait so a single user request cannot hang for minutes.
Default router-owned 429 policy.
`:same-provider-delays-ms` — sleep schedule for same-provider retries.
`:fallback-after-ms` — hard cap on wall time the 429 phase can
consume (measured from the FIRST 429
caught). When the configured delay vector
is exhausted OR elapsed ≥ budget, the
router falls back to the next provider.
Each delay clamps to remaining budget so
the loop never overshoots.
`:respect-retry-after?` — honor server `Retry-After` header value
in place of the configured delay; the
clamp to remaining budget still applies.
`:fallback-provider?` — when budget is exhausted, fall back to
the next provider/model. Set false to
surface the rate-limit error to the
caller instead.
The 60 000 ms (60 s) default tolerates Anthropic / OpenAI / z.ai
Retry-After values that can land between 30-60 s under quota
pressure on reasoning-heavy workloads, while still bounding the
wait so a single user request cannot hang for minutes.Default retry policy for transient HTTP errors.
Default retry policy for transient HTTP errors.
Default semantic-stream timeout (ms) for streaming HTTP responses.
Nil by default: disabled unless caller opts in. If enabled and bytes
keep arriving (SSE pings/comments) but no model/progress event arrives
for this long, the stream is closed and surfaced as
:svar.core/stream-semantic-timeout.
Distinct from DEFAULT_IDLE_TIMEOUT_MS: idle watches transport
liveness; semantic watches model progress. Enable per-call with e.g.
:semantic-timeout-ms 180000.
Default semantic-stream timeout (ms) for streaming HTTP responses. Nil by default: disabled unless caller opts in. If enabled and bytes keep arriving (SSE pings/comments) but no model/progress event arrives for this long, the stream is closed and surfaced as `:svar.core/stream-semantic-timeout`. Distinct from `DEFAULT_IDLE_TIMEOUT_MS`: idle watches transport liveness; semantic watches model progress. Enable per-call with e.g. `:semantic-timeout-ms 180000`.
Default HTTP request timeout in milliseconds (5 minutes). Reasoning models (e.g. glm-5-turbo) may need extended time for chain-of-thought.
Default HTTP request timeout in milliseconds (5 minutes). Reasoning models (e.g. glm-5-turbo) may need extended time for chain-of-thought.
Default time-to-first-token timeout (ms) for streaming HTTP responses.
Bounds the wait between sending the HTTP request and receiving response
headers. On fire, raises :svar.core/stream-ttft-timeout and the caller
thread's interrupt unparks the underlying CompletableFuture.get.
30 s default. Tight enough to surface stuck provider connections
inside one iteration (the original 90 s default sometimes wasted a
whole autoresearch iter waiting for headers), generous enough for
real reasoning cold starts — z.ai glm-5.1 has been observed sending
first headers between 8 and 22 s. Disable per-call with
:ttft-timeout-ms nil; pass a larger value for slow reasoning
models with long pre-stream queues.
Default time-to-first-token timeout (ms) for streaming HTTP responses. Bounds the wait between sending the HTTP request and receiving response headers. On fire, raises `:svar.core/stream-ttft-timeout` and the caller thread's interrupt unparks the underlying `CompletableFuture.get`. 30 s default. Tight enough to surface stuck provider connections inside one iteration (the original 90 s default sometimes wasted a whole autoresearch iter waiting for headers), generous enough for real reasoning cold starts — z.ai glm-5.1 has been observed sending first headers between 8 and 22 s. Disable per-call with `:ttft-timeout-ms nil`; pass a larger value for slow reasoning models with long pre-stream queues.
(estimate-cost model input-tokens output-tokens)(estimate-cost model input-tokens output-tokens pricing-map)(estimate-cost model input-tokens output-tokens pricing-map opts)Estimates USD cost with separate uncached input, cached input, cache creation, and output components. Rates are USD per 1M tokens.
input-tokens normally means provider prompt/input tokens. For OpenAI,
Z.ai, Gemini, and OpenRouter this includes cached-read tokens, so cached
reads are subtracted before normal input cost. For Anthropic normalized
usage, input_tokens excludes cache read/write tokens; pass
{:cache-tokens-in-input? false} to keep base input intact.
Estimates USD cost with separate uncached input, cached input, cache
creation, and output components. Rates are USD per 1M tokens.
`input-tokens` normally means provider prompt/input tokens. For OpenAI,
Z.ai, Gemini, and OpenRouter this includes cached-read tokens, so cached
reads are subtracted before normal input cost. For Anthropic normalized
usage, `input_tokens` excludes cache read/write tokens; pass
`{:cache-tokens-in-input? false}` to keep base input intact.(format-cost cost)Formats a cost value as a human-readable USD string.
Formats a cost value as a human-readable USD string.
(infer-model-metadata {:keys [name] :as model-map})Returns provider-independent model metadata. Looks up KNOWN_MODEL_METADATA first. Falls back to regex inference for unknown models. Explicit fields in model-map override inferred values.
Returns provider-independent model metadata. Looks up KNOWN_MODEL_METADATA first. Falls back to regex inference for unknown models. Explicit fields in model-map override inferred values.
Per-model static metadata. :reasoning? flags a model whose provider
accepts a reasoning-depth parameter. :reasoning-style (optional) pins the
wire shape to emit — see REASONING_LEVELS keys. When omitted, the style
is inferred from the provider's :api-style (:anthropic → anthropic
thinking, everything else → openai-effort).
Per-model static metadata. `:reasoning?` flags a model whose provider accepts a reasoning-depth parameter. `:reasoning-style` (optional) pins the wire shape to emit — see `REASONING_LEVELS` keys. When omitted, the style is inferred from the provider's `:api-style` (`:anthropic` → anthropic thinking, everything else → openai-effort).
(make-router providers)(make-router providers opts)Creates a router from a vector of provider maps.
Vector order = priority (first provider is highest priority). First model in provider vector = root model. Provider :base-url auto-resolved from KNOWN_PROVIDERS for known IDs. Model metadata auto-inferred from :name and merged with provider-scoped pricing/context. Duplicate provider :id values are a hard error.
opts - Optional map:
:network - {:timeout-ms N :max-retries N ...} router-level network defaults
:tokens - {:check-context? bool :pricing {} :context-limits {}} token defaults
:budget - {:max-tokens N :max-cost N} spend limits (nil = no limit)
:rate-limit - {:same-provider-delays-ms [...] :fallback-after-ms N ...}
:failure-threshold - Int. Failures before circuit opens (default: 5)
:recovery-ms - Int. Ms before open→half-open (default: 60000)
Example: (make-router [{:id :blockether :api-key <key> :models [{:name <model-a>} {:name <model-b>}]} {:id :openai :api-key <key> :models [{:name <model-a>} {:name <model-b>}]}] {:budget {:max-tokens 1000000 :max-cost 5.0}})
Creates a router from a vector of provider maps.
Vector order = priority (first provider is highest priority).
First model in provider vector = root model.
Provider :base-url auto-resolved from KNOWN_PROVIDERS for known IDs.
Model metadata auto-inferred from :name and merged with provider-scoped pricing/context.
Duplicate provider :id values are a hard error.
`opts` - Optional map:
:network - {:timeout-ms N :max-retries N ...} router-level network defaults
:tokens - {:check-context? bool :pricing {} :context-limits {}} token defaults
:budget - {:max-tokens N :max-cost N} spend limits (nil = no limit)
:rate-limit - {:same-provider-delays-ms [...] :fallback-after-ms N ...}
:failure-threshold - Int. Failures before circuit opens (default: 5)
:recovery-ms - Int. Ms before open→half-open (default: 60000)
Example:
(make-router [{:id :blockether :api-key <key>
:models [{:name <model-a>} {:name <model-b>}]}
{:id :openai :api-key <key>
:models [{:name <model-a>} {:name <model-b>}]}]
{:budget {:max-tokens 1000000 :max-cost 5.0}})(max-input-tokens model)(max-input-tokens model {:keys [output-reserve trim-ratio context-limits]})Calculates maximum input tokens for a model, reserving space for output.
Calculates maximum input tokens for a model, reserving space for output.
Best-effort flattened model context limits for legacy token utilities. When a model exists on multiple providers with different contexts, the most conservative context is used. Provider-aware code should use provider-model-context instead.
Best-effort flattened model context limits for legacy token utilities. When a model exists on multiple providers with different contexts, the most conservative context is used. Provider-aware code should use provider-model-context instead.
Best-effort flattened model pricing for legacy token utilities. When a model exists on multiple providers, the lowest total pricing is chosen. Provider-aware code should NOT use this — use provider-model-pricing instead.
Best-effort flattened model pricing for legacy token utilities. When a model exists on multiple providers, the lowest total pricing is chosen. Provider-aware code should NOT use this — use provider-model-pricing instead.
(normalize-model model-map)Normalizes a model entry: {:name "gpt-4o"} -> full provider-independent model metadata.
Normalizes a model entry: {:name "gpt-4o"} -> full provider-independent model metadata.
(normalize-provider idx provider-map)Normalizes a provider entry:
Normalizes a provider entry: - resolves :base-url from KNOWN_PROVIDERS if not provided - derives :priority from vector index - derives :root from first model - merges provider-independent model metadata with provider-scoped pricing/context
(normalize-reasoning-level v)Coerce any accepted spelling to a canonical :quick|:balanced|:deep keyword. Accepts:
:reasoning_effort migrations don't break).
Returns nil for unknown input.Coerce any accepted spelling to a canonical :quick|:balanced|:deep keyword.
Accepts:
- :quick / :balanced / :deep (keywords, case-insensitive)
- "quick" / "balanced" / "deep" (strings, case-insensitive)
- OpenAI-style aliases :low→:quick, :medium→:balanced, :high→:deep
(so `:reasoning_effort` migrations don't break).
Returns nil for unknown input.(provider-excluded-model? provider-id model-name)True when a provider-scoped catalog marks a model unavailable.
Provider config may add :exclude-models as exact model names and/or
:min-gpt-version such as [5 3] to hide older GPT family models.
True when a provider-scoped catalog marks a model unavailable. Provider config may add `:exclude-models` as exact model names and/or `:min-gpt-version` such as [5 3] to hide older GPT family models.
(provider-model-context provider-id model-name)Returns provider-scoped context window for provider/model, falling back to flattened MODEL_CONTEXT_LIMITS.
Returns provider-scoped context window for provider/model, falling back to flattened MODEL_CONTEXT_LIMITS.
(provider-model-entry provider-id model-name)Returns provider-scoped entry for a provider/model, or nil if excluded.
Composition:
:pricing-source if declared, else :id.:api-style, :reasoning-style, :json-object-mode?,
:extra-body, plus any pricing/context overrides).Overlay wins on conflicts. Pricing maps deep-merge so an overlay
can override a single rate without dropping :cache-read /
:cache-write from the catalog.
Returns provider-scoped entry for a provider/model, or nil if excluded.
Composition:
1. Catalog entry from models.dev (pricing, context, modalities,
capability flags, family, knowledge cutoff, release dates) —
looked up under `:pricing-source` if declared, else `:id`.
2. svar overlay from KNOWN_PROVIDER_MODELS (wire/policy keys:
`:api-style`, `:reasoning-style`, `:json-object-mode?`,
`:extra-body`, plus any pricing/context overrides).
Overlay wins on conflicts. Pricing maps deep-merge so an overlay
can override a single rate without dropping `:cache-read` /
`:cache-write` from the catalog.(provider-model-pricing provider-id model-name)Returns provider-scoped pricing for provider/model, falling back to flattened MODEL_PRICING.
Returns provider-scoped pricing for provider/model, falling back to flattened MODEL_PRICING.
(provider-model-visible? provider-id model-name)True when provider-scoped model filters allow model-name.
True when provider-scoped model filters allow `model-name`.
(reasoning-extra-body api-style model-map level)(reasoning-extra-body api-style model-map level {:keys [preserved-thinking?]})Translates an abstract reasoning level into provider-specific extra-body. Returns nil when:
level is nil / unknown:reasoning?Dispatches on the model's :reasoning-style first (explicit pin), falling
back to inference from api-style when the model doesn't declare one.
Callers pass the returned map through merge into their extra-body; silent nil keeps non-reasoning models untouched.
Four-arity form takes an opts map:
:preserved-thinking? — Z.ai-only. Emits clear_thinking: false inside
the :thinking block, asking the server to retain reasoning_content
from previous assistant turns (Preserved Thinking, GLM-5 / GLM-4.7).
Callers using this MUST echo the complete, unmodified reasoning_content
back to the API in subsequent assistant turns, otherwise cache hit
rates and model quality degrade. No-op on non-z.ai reasoning styles
and on the Coding Plan endpoint (which has preserved thinking on
server-side by default, but setting the flag explicitly is harmless).
Translates an abstract reasoning level into provider-specific extra-body.
Returns nil when:
- `level` is nil / unknown
- the selected model is not flagged `:reasoning?`
- the reasoning-style has no mapping in REASONING_LEVELS.
Dispatches on the model's `:reasoning-style` first (explicit pin), falling
back to inference from `api-style` when the model doesn't declare one.
Callers pass the returned map through merge into their extra-body; silent
nil keeps non-reasoning models untouched.
Four-arity form takes an opts map:
`:preserved-thinking?` — Z.ai-only. Emits `clear_thinking: false` inside
the `:thinking` block, asking the server to retain reasoning_content
from previous assistant turns (Preserved Thinking, GLM-5 / GLM-4.7).
Callers using this MUST echo the complete, unmodified reasoning_content
back to the API in subsequent assistant turns, otherwise cache hit
rates and model quality degrade. No-op on non-z.ai reasoning styles
and on the Coding Plan endpoint (which has preserved thinking on
server-side by default, but setting the flag explicitly is harmless).Abstract reasoning levels translated per reasoning-style. Vocabulary is intentionally provider-neutral — callers pass :quick|:balanced|:deep and svar picks the right on-the-wire shape for the selected model.
Sub-key semantics:
:openai-effort → flat top-level :reasoning_effort string.
Used by GPT-5.x, o-series, Gemini 2.5 via OpenAI gateway,
DeepSeek Reasoner, Copilot, and most OpenAI-compatible reasoners.
:anthropic-thinking → Claude thinking controls.
Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 use
adaptive thinking + output_config.effort. Older
Claude 4 models use manual budget_tokens.
:zai-thinking → binary :thinking {:type "enabled"|"disabled"} on
Z.ai / GLM-4.6+. No budget_tokens — thinking is on/off.
:quick disables, :balanced/:deep enable.
See also :preserved-thinking? below for the
clear_thinking: false flag that keeps reasoning
across assistant turns.
Abstract reasoning levels translated per reasoning-style.
Vocabulary is intentionally provider-neutral — callers pass :quick|:balanced|:deep
and svar picks the right on-the-wire shape for the selected model.
Sub-key semantics:
`:openai-effort` → flat top-level `:reasoning_effort` string.
Used by GPT-5.x, o-series, Gemini 2.5 via OpenAI gateway,
DeepSeek Reasoner, Copilot, and most OpenAI-compatible reasoners.
`:anthropic-thinking` → Claude thinking controls.
Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 use
adaptive thinking + output_config.effort. Older
Claude 4 models use manual budget_tokens.
`:zai-thinking` → binary `:thinking {:type "enabled"|"disabled"}` on
Z.ai / GLM-4.6+. No budget_tokens — thinking is on/off.
`:quick` disables, `:balanced`/`:deep` enable.
See also `:preserved-thinking?` below for the
`clear_thinking: false` flag that keeps reasoning
across assistant turns.(reset-budget! router)Resets the router's token/cost budget counters to zero.
Resets the router's token/cost budget counters to zero.
(reset-provider! router provider-id)Manually resets a provider's circuit breaker to :closed.
Manually resets a provider's circuit breaker to :closed.
(resolve-effective-model router)(resolve-effective-model router overrides)Resolves the effective routed model from the router, optionally applying routing overrides.
Returns a model descriptor map (or nil when no provider is available): {:name :reasoning? :provider :api-style :pricing :context :intelligence :speed ...}
overrides (optional map):
:optimize — :cost | :speed | :intelligence (translated to :prefer)
:provider — explicit provider id keyword
:model — explicit model name string
:reasoning — reasoning level keyword (implies reasoning-capable model)
(resolve-effective-model router) ;; root model (resolve-effective-model router {:optimize :cost}) ;; cheapest (resolve-effective-model router {:optimize :intelligence}) ;; frontier (resolve-effective-model router {:provider :openai :model "gpt-5-mini"}) ;; exact
Resolves the effective routed model from the router, optionally applying
routing overrides.
Returns a model descriptor map (or nil when no provider is available):
{:name :reasoning? :provider :api-style :pricing :context :intelligence :speed ...}
`overrides` (optional map):
:optimize — :cost | :speed | :intelligence (translated to :prefer)
:provider — explicit provider id keyword
:model — explicit model name string
:reasoning — reasoning level keyword (implies reasoning-capable model)
(resolve-effective-model router) ;; root model
(resolve-effective-model router {:optimize :cost}) ;; cheapest
(resolve-effective-model router {:optimize :intelligence}) ;; frontier
(resolve-effective-model router {:provider :openai :model "gpt-5-mini"}) ;; exact(resolve-routing router routing-opts)Resolves :routing opts to prefs for with-provider-fallback. Returns {:prefs prefs-map :error-strategy kw}. Throws on invalid provider/model combinations.
:reasoning in the routing opts (abstract level — :quick/:balanced/:deep
or strings/aliases) implies :require-reasoning? true in prefs, which
filters model selection to :reasoning? true models in resolve-model.
This makes {:optimize :cost :reasoning :deep} pick the cheapest
reasoning-capable model rather than silently dropping :deep when the
cost-cheapest model happens to be non-reasoning.
Resolves :routing opts to prefs for with-provider-fallback.
Returns {:prefs prefs-map :error-strategy kw}.
Throws on invalid provider/model combinations.
`:reasoning` in the routing opts (abstract level — :quick/:balanced/:deep
or strings/aliases) implies `:require-reasoning? true` in prefs, which
filters model selection to `:reasoning? true` models in `resolve-model`.
This makes `{:optimize :cost :reasoning :deep}` pick the cheapest
*reasoning-capable* model rather than silently dropping `:deep` when the
cost-cheapest model happens to be non-reasoning.(router-stats router)Returns cumulative + windowed stats for the router.
Returns cumulative + windowed stats for the router.
(select-provider router prefs)Returns [provider model-map] or nil. Read-only.
Cross-provider ranking: models are scored by :prefer first, provider
priority second. So :optimize :intelligence picks the frontier model
across the whole fleet; ties are broken by provider vector order.
Returns [provider model-map] or nil. Read-only. Cross-provider ranking: models are scored by `:prefer` first, provider priority second. So `:optimize :intelligence` picks the frontier model across the whole fleet; ties are broken by provider vector order.
(truncate-text model text max-tokens)(truncate-text model
text
max-tokens
{:keys [truncation-marker from] :or {from :end}})Truncates text to fit within a token limit. Uses proper tokenization to ensure accurate truncation.
Truncates text to fit within a token limit. Uses proper tokenization to ensure accurate truncation.
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |