The Deflection Stack
Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the Deflection Stack to reach the cloud.
Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
│ hit │ hit │ simple │ compressed │
▼ ▼ ▼ ▼ ▼
Response Response Local Response Optimised Prompt Cloud Response
Layers at a Glance
| Layer | Algorithm / Mechanism | What It Does | Typical Latency |
|---|---|---|---|
| L1a — Exact Cache | Fast Hashing (ahash) | Sub-millisecond duplicate detection. Traps infinite agent loops instantly. | < 1 ms |
| L1b — Semantic Cache | Cosine Similarity (Embeddings) | Computes mathematical meaning via pure-Rust candle models (all-MiniLM-L6-v2) to catch variations ("Price?" ≈ "Cost?"). | 1–5 ms |
| L2 — SLM Router | Neural Classification (LLM) | Triages intent using an embedded Small Language Model (e.g. Qwen-1.5B) to resolve simple data extraction tasks. | 50–200 ms |
| L2.5 — Context Optimiser | Instruction Dedup + Minify | Compresses repeated instruction files (CLAUDE.md, copilot-instructions.md) via session dedup and static minification to reduce cloud input tokens. | < 1 ms |
| L3 — Cloud Logic | Provider Chain + Retries | Routes surviving complex prompts to 23+ providers (OpenAI, Anthropic, Azure, Gemini, etc.) with ordered fallback chain, per-provider retry budgets, multi-key rotation, and quota enforcement. | Network-bound |
Layers 1a and 1b deflect 71% of repetitive agentic traffic (FAQ/agent loop patterns) and 38% of diverse task traffic before any neural inference runs.
Layer Details
L1a — Exact Cache
Algorithm: Fast hashing with ahash
L1a is the first line of defence. It computes a hash of the incoming prompt and checks it against an in-memory LRU cache (single-binary mode) or a shared Redis cluster (enterprise mode).
- Hit: Returns the cached response immediately (sub-millisecond).
- Miss: The request continues to L1b.
Cache keys are namespaced before hashing (native|prompt, openai|prompt, anthropic|prompt, etc.) to ensure one endpoint never returns another endpoint's response schema. On a cache hit, ChatResponse.layer is normalised to 1 regardless of which layer originally produced the response.
| Mode | Implementation |
|---|---|
| Minimalist | In-memory LRU (ahash + parking_lot) |
| Enterprise | Redis cluster (shared across replicas, async redis crate) |
L1b — Semantic Cache
Algorithm: Cosine similarity over sentence embeddings (all-MiniLM-L6-v2)
L1b catches semantically equivalent prompts that differ in wording. A sentence embedding is computed for the incoming prompt using a pure-Rust candle BertModel, then compared against the vector cache using cosine similarity.
- Hit (similarity above threshold): Returns the cached response (1–5 ms).
- Miss: The request continues to L2.
Embedding pipeline:
- Model:
sentence-transformers/all-MiniLM-L6-v2— 384-dimensional embeddings (~90 MB). - Runtime: Pure-Rust candle stack — zero C/C++ dependencies.
- Pooling: Mean pooling with attention mask, followed by L2 normalisation.
- Thread safety:
BertModelis wrapped instd::sync::Mutex; inference runs ontokio::task::spawn_blocking. - Architecture:
TextEmbedderis initialised once at startup, stored asArc<TextEmbedder>inAppState.
The vector cache is maintained in tandem with exact cache entries. Insertions and evictions update the index automatically, providing sub-millisecond vector search latency for thousands of embeddings.
| Mode | Implementation |
|---|---|
| Minimalist | In-process candle BertModel |
| Enterprise | External TEI sidecar (optional) |
L2 — SLM Router
Algorithm: Neural classification via Small Language Model
L2 runs a lightweight language model to classify the prompt's intent. Simple requests (data extraction, FAQ-style queries) can be resolved locally without reaching the cloud.
- Simple intent: Returns a locally generated response (50–200 ms).
- Complex intent: The request continues to L2.5.
- Disabled (
enable_slm_router = false): Layer is a no-op; request falls through to L3.
Two-phase execution (classify then generate):
Classification uses local_slm_url + local_slm_model — a CPU-friendly Ollama endpoint that is always-on and lightweight. The classifier input is built by extract_classifier_context(), which includes both the system prompt and the last user message so agentic tasks with short user turns and large system prompts are correctly identified as complex.
Answer generation uses layer2.sidecar_url + layer2.model_name — the heavier GPU sidecar, only invoked when the classifier returns a deflectable (SIMPLE / TEMPLATE / SNIPPET) result.
Both calls respect layer2.timeout_seconds; a timeout triggers a clean fallthrough to L3.
| Mode | Implementation |
|---|---|
| Minimalist | Embedded candle GGUF inference (e.g. Gemma-2-2B-IT, CPU) |
| Enterprise | Remote vLLM / TGI server (GPU pool) |
L2.5 — Context Optimiser
Algorithm: CompressionPipeline — Modular staged compression
Agentic coding tools (Copilot, Claude Code, Cursor) send large instruction files (CLAUDE.md, copilot-instructions.md, skills blocks) with every turn. L2.5 detects and compresses these payloads before they reach the cloud, saving input tokens on every L3 call.
Pipeline architecture (src/compression/):
L2.5 uses a modular CompressionPipeline with pluggable stages that execute in
order. Each stage is a stateless CompressionStage trait object. If a stage sets
short_circuit = true, subsequent stages are skipped.
Built-in stages (run in order):
- ContentClassifier — Gate stage: detects instruction vs conversational content. Short-circuits on conversational messages so downstream stages skip work.
- DedupStage — Session-aware cross-turn deduplication. Hashes instruction content per session; on repeat turns, replaces with a compact hash reference. Short-circuits on dedup hit.
- LogCrunchStage — Static minification: strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.
Adding custom stages:
Implement the CompressionStage trait and add your stage to the pipeline via
build_pipeline() in src/compression/optimize.rs.
Configuration:
| Variable | Default | Description |
|---|---|---|
ISARTOR__ENABLE_CONTEXT_OPTIMIZER | true | Master switch for L2.5 |
ISARTOR__CONTEXT_OPTIMIZER_DEDUP | true | Enable cross-turn instruction deduplication |
ISARTOR__CONTEXT_OPTIMIZER_MINIFY | true | Enable static minification |
Observability:
- Instrumented as:
layer2_5_context_optimizerspan in distributed traces. - Response header:
x-isartor-context-optimized: bytes_saved=<N>on optimised requests. - Span fields:
context.bytes_saved,context.strategy(e.g. "classifier+dedup", "classifier+log_crunch").
| Mode | Implementation |
|---|---|
| Minimalist | In-process CompressionPipeline (classifier → dedup → log_crunch) |
| Enterprise | In-process CompressionPipeline (extensible with custom stages) |
L3 — Cloud Logic
Algorithm: Ordered provider chain with per-provider retry budgets
L3 is the final layer. Only the hardest prompts — those not resolved by cache, SLM, or context optimisation — reach the external cloud LLMs.
Provider chain execution:
- Isartor evaluates quota for the current provider (daily/weekly/monthly token and cost windows). If the provider is over quota, the request either blocks (429), warns, or falls through to the next provider depending on the
action_on_limitpolicy. - The request is dispatched to the provider with retry logic (exponential backoff, jitter). Each provider has its own independent retry budget.
- On exhausting retries with a retry-safe upstream error (429, 5xx, timeout), Isartor advances to the next fallback provider in the chain.
- Successful responses are annotated with the
x-isartor-providerheader.
Multi-key rotation: Each provider can own an in-memory key pool. When multiple credentials are configured, keys are selected with round_robin or priority strategy. Only the rate-limited key is cooled down after 429/quota failures — other keys continue serving.
Supported providers (23+):
- Full client: OpenAI, Azure OpenAI, Anthropic, Copilot (GitHub), Gemini, Cohere, xAI
- OpenAI-compatible registry: Groq, Cerebras, Nebius, SiliconFlow, Fireworks, NVIDIA, Chutes, DeepSeek, Galadriel, Hyperbolic, HuggingFace, Mira, Moonshot, Ollama, OpenRouter, Perplexity, Together
Safety nets:
- Offline mode (
offline_mode = true): Blocks L3 routing explicitly with HTTP 503. - Stale fallback: On L3 failure, checks the namespaced exact-cache key first, then a legacy un-namespaced key for backward compatibility.
| Mode | Implementation |
|---|---|
| Minimalist | Direct to cloud providers via rig-core |
| Enterprise | Direct to cloud providers via rig-core |
How Layers Interact
The deflection stack is implemented as Axum middleware plus a final handler. For authenticated routes, the execution order is:
- Body buffer —
BufferedBodystores the request body so multiple layers can read it. - Request-level monitoring — Observability instrumentation.
- Auth — API key validation.
- Layer 1 cache — L1a exact match, then L1b semantic match.
- Layer 2 SLM triage — Intent classification and local response.
- Layer 2.5 context optimiser — Instruction dedup + minification via
CompressionPipeline. - Layer 3 handler — Cloud LLM fallback.
Implementation note: Axum middleware wraps inside-out — the last
.layer(...)added runs first. The stack order insrc/main.rsdocuments this explicitly and must be preserved.
Public health routes (/health, /healthz) intentionally bypass the deflection stack. The authenticated routes are /api/chat, /api/v1/chat, /v1/chat/completions, /v1/messages, and /v1beta/models/{model}:generateContent.
See Also
- Architecture — high-level system design and pluggable providers
- Architecture Decision Records — rationale behind the deflection stack design (ADR-001)
- Configuration Reference