Architecture

Pattern: Hexagonal Architecture (Ports & Adapters) Location: src/core/, src/adapters/, src/factory.rs

High-Level Overview

Isartor is an AI Prompt Firewall that intercepts LLM traffic and routes it through a multi-layer Deflection Stack. Each layer can short-circuit and return a response without reaching the cloud, dramatically reducing cost and latency.

Documentation contract: if an implementation changes this architecture, the request flow, supported surfaces, deployment shape, or other durable design assumptions, update this page and the ADR pages in the same patch. User-visible capability changes should also be reflected in the README.md feature list and the relevant docs under docs/ and docs-site/src/.

Request Surfaces

Isartor accepts traffic through seven inbound surfaces. All share the same deflection stack but keep cache keys namespaced by response shape so one endpoint never returns another endpoint's schema. Format detection is header-aware: Cursor and Kiro both hit /v1/chat/completions but receive their own isolated cache namespaces.

SurfaceRoute(s)Cache NamespaceDetection
NativePOST /api/chat, POST /api/v1/chatnativePath
OpenAI-compatiblePOST /v1/chat/completionsopenaiPath
Anthropic-compatiblePOST /v1/messagesanthropicPath
Gemini-nativePOST /v1beta/models/{model}:generateContent, :streamGenerateContentgeminiPath
Cursor IDEPOST /v1/chat/completionscursorX-Cursor-Checksum / X-Cursor-Client-Version / X-Ghost-Mode header
Kiro (AWS IDE)POST /v1/chat/completionskiroX-Kiro-Version / X-Kiro-Client-Id header
CONNECT proxyMITM intercept on allowlisted Copilot domains (:8081)openai (proxied)Port 8081

Streaming is a boundary concern: handlers always return canonical JSON. The cache middleware converts cached or downstream responses into surface-specific SSE (text/event-stream) when the client requests streaming ("stream": true). Cursor and Kiro use OpenAI-compatible SSE format.

MCP Server

Isartor also exposes a Model Context Protocol (MCP) server for tool integrations:

  • stdio mode: isartor mcp — used by Copilot CLI, Claude Desktop, Cursor IDE
  • HTTP/SSE mode: GET/POST/DELETE /mcp/ — used by any MCP-compatible client

The MCP server provides cache-lookup and cache-store tools so external agents can query and populate the deflection cache directly.

CONNECT Proxy

When started with isartor up copilot (or claude, antigravity), Isartor runs a transparent HTTPS CONNECT proxy on port 8081 alongside the API gateway on :8080. The proxy intercepts allowlisted GitHub Copilot domains, terminates TLS using a local CA (auto-generated at ~/.isartor/ca/), and reuses the same L1/L2/L3 cache and LLM layers for /v1/chat/completions traffic. This lets tools like gh copilot benefit from Isartor's deflection stack without any client-side configuration changes beyond an HTTPS proxy setting.

Pre-Routing Normalization

At the HTTP boundary, request-time model aliases are normalized to real provider model IDs before Layer 1 cache keys are built or Layer 3 routing runs. That keeps aliases like fast and their canonical target model on the same routing and cache path.

When operators need full payload troubleshooting, the outer monitoring middleware emits a separate opt-in JSONL request log with redacted auth headers. This is intentionally kept separate from normal tracing/startup logs.

For a detailed breakdown of the deflection layers, see the Deflection Stack page.

flowchart TD
    A[Request] --> B[Body Buffer + Monitoring]
    B --> C[Auth — L0]
    C --> D[MiniLM Router — L0.5]
    D --> E[Cache — L1a Exact / L1b Semantic]
    E --> F[SLM Router — L2]
    F --> G[Context Optimiser — L2.5]
    G --> H[Cloud Fallback Chain — L3]
    H --> I[Response]

    subgraph F_detail [L2.5 CompressionPipeline]
        direction LR
        F1[ContentClassifier] --> F2[DedupStage]
        F2 --> F3[LogCrunchStage]
    end

Layer 0 — Authentication & Concurrency

Layer 0 is the operational defense perimeter. It runs before any cache lookup or inference.

  • Authentication: API key validation via the X-API-Key header. When gateway_api_key is empty (the local-first default), authentication is disabled.
  • Body buffering: The BufferedBody middleware reads the request body once and stores a clone in request extensions. All downstream layers (cache key extraction, prompt parsing, monitoring, retries) read from this buffer instead of consuming the stream.
  • Monitoring: Root request-level OpenTelemetry tracing span, optional JSONL request logging.

Public health routes (/health, /healthz) and the MCP endpoint (/mcp/) bypass Layer 0 and the entire deflection stack.

Layer 0.5 — MiniLM Classifier Routing

An optional pre-cache routing pass can reuse the same in-process all-MiniLM-L6-v2 embedder used by L1b. Instead of running a second encoder, Isartor generates one embedding and scores four lightweight linear heads:

  • task_type
  • complexity
  • persona
  • domain

The classifier reads the buffered request body through extract_classifier_context(), which keeps enough agent/tool context to route coding-agent traffic even when the last user turn is short. Matched rules can:

  • prefer a specific provider earlier in the Layer 3 fallback chain
  • override the request model before cache-key generation and Layer 3 execution

Because provider-directed routing changes the semantics of the downstream answer, the cache key is prefixed with the selected provider fragment before L1 lookup and writeback. That prevents a classifier-routed provider response from colliding with the default-provider cache entry for the same prompt.

If classifier_routing.fallback_to_existing_routing = true, any classifier miss, load failure, or no-match path simply falls through to the existing routing behavior. If it is false, the gateway fails closed with 503 instead.

Layer 1 — Cache

Layer 1 has two sub-layers that execute in sequence:

L1a — Exact Cache

Fast-hash (ahash) lookup against an in-memory LRU or shared Redis cluster. Sub-millisecond on hit. Cache keys include the API surface namespace and an optional session scope (from x-isartor-session-id or similar headers) so different conversations and endpoints never cross-pollinate.

On a cache hit, ChatResponse.layer is normalized to 1 regardless of which layer originally produced the response.

L1b — Semantic Cache

Cosine similarity over 384-dimensional sentence embeddings from an in-process candle BertModel (all-MiniLM-L6-v2). Catches semantically equivalent prompts that differ in wording (e.g. "Price?" ≈ "Cost?").

Important: L1b semantic cache is intentionally disabled for /v1/messages (Anthropic/Claude Code traffic) because the large, repetitive system/context payloads caused false cache hits across different user questions. Exact cache (L1a) remains active for that surface.

ComponentMinimalistEnterprise
L1a Exact CacheIn-memory LRU (ahash + parking_lot)Redis cluster (shared across replicas)
L1b Semantic CacheIn-process candle BertModelExternal TEI sidecar (optional)

Layer 2 — SLM Router

Neural classification via a Small Language Model (e.g. Qwen-1.5B via llama.cpp sidecar). Classifies the prompt's intent and resolves simple data extraction tasks locally without reaching the cloud. Typical latency: 50–200 ms.

  • Disabled by default (enable_slm_router = false): Layer is a no-op; request falls through to L2.5.
  • Classifier modes: tiered (default, multi-level confidence) or binary (simple/complex split).

Two-phase execution:

PhaseConfigPurpose
Classificationlocal_slm_url + local_slm_modelLightweight CPU-friendly Ollama path (always-on). Sees system prompt + last user message for full agentic context.
Answer generationlayer2.sidecar_url + layer2.model_nameGPU sidecar. Only invoked when classification returns a deflectable intent.

Splitting the two phases means classification never competes with generation for GPU resources and degrades gracefully when the sidecar is busy. Both calls respect layer2.timeout_seconds.

ComponentMinimalistEnterprise
L2 SLM RouterEmbedded candle GGUF inference (CPU)Remote vLLM / TGI server (GPU pool)

Layer 2.5 — Context Optimiser

A modular CompressionPipeline with pluggable stages that reduce cloud input tokens by compressing repeated instruction payloads (CLAUDE.md, copilot-instructions.md, skills blocks).

Built-in stages (execute in order):

  1. ContentClassifier — Gate: detects instruction vs conversational content. Short-circuits on conversational messages.
  2. DedupStage — Session-aware cross-turn instruction deduplication. Hashes instruction content; on repeat turns, replaces with compact hash reference.
  3. LogCrunchStage — Static minification: strips comments, decorative rules, consecutive blank lines.

Each stage is a stateless CompressionStage trait object. Shared state (the InstructionCache) is passed as input. If a stage sets short_circuit = true, subsequent stages are skipped.

ComponentMinimalistEnterprise
L2.5 Context OptimiserIn-process CompressionPipelineIn-process CompressionPipeline (extensible with custom stages)

Layer 3 — Cloud Logic

Only the hardest prompts — those not resolved by cache, SLM, or context optimisation — reach Layer 3.

Provider Chain

The running AppState maintains an ordered provider chain: one primary provider plus zero or more fallback providers. Each provider keeps its own retry budget (exponential backoff with jitter). Isartor advances to the next provider only when the current one exhausts retries with a retry-safe upstream error (429, 5xx, timeout). Successful responses are annotated with x-isartor-provider so clients can see which upstream answered.

If a provider has no explicit api_key configured, the resolved chain also does a best-effort lookup in the encrypted local token store (~/.isartor/tokens/) before falling back to an empty key. That lets operators authenticate once with isartor auth <provider> and keep long-lived OAuth/API credentials out of isartor.toml.

Provider Registry

Layer 3 supports 23+ LLM providers through rig-core:

Full client: OpenAI, Azure OpenAI, Anthropic, Copilot (GitHub), Gemini, Cohere, xAI

OpenAI-compatible registry (shared runtime path with provider-specific default endpoints): Groq, Cerebras, Nebius, SiliconFlow, Fireworks, NVIDIA, Chutes, DeepSeek, Galadriel, Hyperbolic, HuggingFace, Mira, Moonshot, Ollama, OpenRouter, Perplexity, Together

Multi-Key Rotation

Each provider can own an in-memory key pool. When multiple credentials are configured, Isartor selects keys with round_robin or priority rotation and temporarily cools down only the rate-limited key after 429/quota-style failures. Key rotation is separate from provider-level fallback.

Stored OAuth Credentials

src/auth/ adds a provider-agnostic authentication layer:

  • OAuthProvider trait for device-flow, refresh-token, and manual API-key providers
  • TokenStore for AES-256-GCM encrypted credentials on disk
  • provider implementations for Copilot, Gemini, Kiro, Anthropic, and OpenAI
  • isartor auth <provider>, isartor auth status, and isartor auth logout <provider>

Copilot, Gemini, and Kiro use interactive device authorization. Anthropic and OpenAI do not expose a public device flow, so the same encrypted store is used for securely pasted API keys.

Optional Encrypted Config Sync

src/sync/ adds an opt-in config sync path for operators who use multiple machines:

  • isartor sync init creates a local sync profile with server URL, user hash, and encryption salt
  • isartor sync push filters the shareable subset of isartor.toml, encrypts it client-side, and uploads the encrypted blob
  • isartor sync pull downloads, decrypts, and merges only the syncable keys/tables back into the local config file
  • isartor sync serve runs the self-hostable zero-knowledge blob server

The sync server only stores { user_hash, salt, encrypted_blob, updated_at }. It never sees plaintext config. The synced subset includes provider/model settings, model aliases, fallback providers, and quota/pricing preferences. It explicitly excludes OAuth tokens, cache contents, usage history, bind addresses, and other machine-local runtime paths.

Provider Health

A small in-memory provider-health snapshot tracks request/error counts, last success/failure, and masked key-pool entries for the entire configured chain. Exposed via GET /debug/providers and isartor providers.

Health state now advances from three sources:

  • real routed Layer 3 successes/failures
  • manual dashboard connectivity tests (POST /api/admin/providers/test)
  • an optional background ping loop driven by provider_health_check_interval_secs (default 300, 0 disables)

Web Management Dashboard

An embedded single-page application (SPA) is served at /dashboard. The HTML, CSS, JavaScript, and logo PNG are compiled directly into the binary via include_str! / include_bytes! — no separate static-file directory or CDN required.

The dashboard has five tabs, each backed by authenticated JSON admin endpoints:

TabRoute(s)Key features
OverviewGET /api/admin/overviewDeflection rate sparkline (7-day SVG), uptime pill, L1a/L1b cache counts, quota-warning banner, provider/model cards, cost/savings
ProvidersGET /api/admin/providers
POST /api/admin/providers/test
Health per provider, key-pool status, connectivity test (latency + HTTP status), Add/Edit/Remove/Reorder modal flows, manual test updates health immediately
UsageGET /api/admin/usage
GET /api/admin/usage/breakdown
Window summary, daily request bar chart, per-provider/model breakdown table, per-provider quota status
Request LogGET /api/admin/requestsLast 100 JSONL request-log entries, expandable rows showing full JSON details
ConfigurationGET /api/admin/config
POST /api/admin/config
Form-based editor for all isartor.toml settings, including provider-health ping cadence; toml_edit write preserves comments; restart-required banner

All /api/admin/* routes require the gateway API key (X-API-Key header). The SPA stores the key in sessionStorage and never transmits it to any third party. Static assets (/dashboard/, /dashboard/logo.png) are served without authentication.

AppState carries a started_at: Instant field (set in AppState::new()) used by the overview endpoint to compute the gateway uptime.

Quota Enforcement

Per-provider quota enforcement is built on top of the usage-event stream (see below). Before a request is dispatched to Layer 3, Isartor projects the request's token/cost impact against the provider's configured daily, weekly, and monthly windows, then either warns, blocks with 429, or falls through to the next provider in the ordered fallback chain.

Stale Fallback

On L3 failure, the handler checks the namespaced exact-cache key first, then a legacy un-namespaced key for backward compatibility.

Offline Mode

When offline_mode = true, Layer 3 is blocked explicitly — returns HTTP 503 instead of silently pretending success.

Usage Analytics

Isartor records per-request provider/model usage events for both cloud calls and pre-L3 deflections. Events are persisted as append-only JSONL under usage_log_path, aggregated in-memory with retention pruning, and exposed through:

  • isartor stats --usage — CLI usage breakdown by provider/model
  • isartor stats --by-tool — CLI usage breakdown by client tool

Deflected requests are recorded as saved cost against the configured primary provider/model, while actual cloud calls record estimated prompt/completion usage against the provider/model that served the request.

Pluggable Trait Provider Pattern

All layers are implemented as Rust traits and adapters. Backends are selected at startup via ISARTOR__ environment variables — no code changes or recompilation required.

Rather than feature-flag every call-site, we define Ports (trait interfaces in src/core/ports.rs) and swap the concrete Adapter at startup. This keeps the Deflection Stack logic completely agnostic to the backing implementation.

Adding a New Adapter

  1. Define the struct in src/adapters/cache.rs or src/adapters/router.rs.
  2. Implement the port trait (ExactCache or SlmRouter).
  3. Add a variant to the config enum (CacheBackend or RouterBackend) in src/config.rs.
  4. Wire it in src/factory.rs with a new match arm.
  5. Write tests — each adapter module has a #[cfg(test)] mod tests.

No other files need to change. The middleware and pipeline code operate only on Arc<dyn ExactCache> / Arc<dyn SlmRouter>.

Scalability Model (3-Tier)

Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. The same binary serves all three tiers; the runtime behaviour is entirely configuration-driven.

Level 1 (Edge)           Level 2 (Compose)        Level 3 (K8s)
┌────────────────┐       ┌────────────────┐       ┌────────────────┐
│ Single Process  │       │ Firewall + GPU  │       │ N Firewall Pods │
│ memory cache    │──▶    │ Sidecar         │──▶    │ + Redis Cluster │
│ embedded candle │       │ memory cache    │       │ + vLLM Pool     │
│ context opt.    │       │ (optional)      │       │ (optional)      │
└────────────────┘       └────────────────┘       └────────────────┘

Key insight: Switching to cache_backend=redis unlocks true multi-replica scaling. Without it, each firewall pod maintains an independent cache.

See the deployment guides for tier-specific setup:

Directory Layout

src/
├── sync/
│   └── mod.rs               # Encrypted config sync, profile storage, blob server
├── auth/
│   ├── mod.rs               # OAuthProvider trait + provider registry
│   ├── device_flow.rs       # Shared RFC 8628 polling loop
│   ├── token_store.rs       # AES-GCM encrypted token persistence
│   └── providers/           # Copilot, Gemini, Kiro, Anthropic, OpenAI auth backends
├── core/
│   ├── mod.rs               # Re-exports + is_internal_endpoint()
│   ├── ports.rs             # Trait interfaces (ExactCache, SlmRouter)
│   ├── prompt.rs            # Stable prompt extraction for cache keys
│   ├── cache_scope.rs       # Session-aware cache key namespacing
│   ├── retry.rs             # Retry logic with exponential backoff
│   ├── usage.rs             # Usage event tracking + JSONL persistence
│   ├── quota.rs             # Per-provider quota enforcement
│   ├── request_logger.rs    # Opt-in JSONL request/response logging
│   └── context_compress.rs  # Re-export shim (backward compat)
├── adapters/
│   ├── cache.rs             # InMemoryCache, RedisExactCache
│   └── router.rs            # EmbeddedCandleRouter, RemoteVllmRouter
├── compression/
│   ├── pipeline.rs          # CompressionPipeline executor + CompressionStage trait
│   ├── cache.rs             # InstructionCache (per-session dedup state)
│   ├── optimize.rs          # Request body rewriting (JSON → pipeline → reassembly)
│   └── stages/
│       ├── content_classifier.rs  # Gate: instruction vs conversational
│       ├── dedup.rs               # Cross-turn instruction dedup
│       └── log_crunch.rs          # Static minification
├── middleware/
│   ├── body_buffer.rs       # BufferedBody preservation
│   ├── monitoring.rs        # OTel tracing + request logging
│   ├── auth.rs              # API key validation
│   ├── cache.rs             # L1a exact + L1b semantic cache
│   ├── slm_triage.rs        # L2 SLM intent classification
│   └── context_optimizer.rs # L2.5 compression entry point
├── providers/               # L3 provider implementations
│   └── copilot.rs           # GitHub Copilot token exchange
├── proxy/
│   ├── connect.rs           # HTTPS CONNECT interception
│   └── tls.rs               # Local CA generation + TLS termination
├── dashboard/
│   ├── mod.rs               # Admin API handlers + static SPA route
│   └── index.html           # Embedded single-page dashboard (compiled into binary)
├── formats/                 # Client wire-format adapters + translation helpers
├── handler.rs               # All API surface handlers + provider chain execution
├── state.rs                 # AppState: shared runtime wiring hub
├── factory.rs               # build_exact_cache(), build_slm_router()
├── config.rs                # AppConfig + all configuration types
├── errors.rs                # GatewayError formatting and error chains
├── mcp.rs                   # MCP server (stdio + HTTP/SSE)
├── anthropic_sse.rs         # Anthropic SSE streaming helpers
├── gemini_sse.rs            # Gemini SSE streaming helpers
└── openai_sse.rs            # OpenAI SSE streaming helpers

See Also