Welcome to Isartor

Open-source Prompt Firewall — deflect up to 95% of redundant LLM traffic before it leaves your infrastructure.

Pure Rust · Single Binary · Zero Hidden Telemetry · Air-Gappable

AI coding agents and personal assistants repeat themselves — a lot. Copilot, Claude Code, Cursor, and OpenClaw send the same system instructions, the same context preambles, and often the same user prompts across every turn. Standard API gateways forward all of it to cloud LLMs regardless.

Isartor sits between your tools and the cloud. It intercepts every prompt and runs a cascade of local algorithms — from sub-millisecond hashing to in-process neural inference — to resolve requests before they reach the network. Only the genuinely hard prompts make it through.

The Deflection Stack

Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the stack to reach the cloud.

Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
                 │ hit                │ hit                 │ simple             │ compressed                │
                 ▼                    ▼                     ▼                    ▼                           ▼
              Response             Response            Local Response     Optimised Prompt            Cloud Response

Layer	What It Does	Typical Latency
L1a — Exact Cache	Sub-millisecond duplicate detection via fast hashing. Traps infinite agent loops instantly.	< 1 ms
L1b — Semantic Cache	Catches meaning-equivalent prompts ("Price?" ≈ "Cost?") using pure-Rust embeddings.	1–5 ms
L2 — SLM Router	Triages intent with an embedded Small Language Model to resolve simple tasks locally.	50–200 ms
L2.5 — Context Optimiser	Compresses repeated instruction payloads (CLAUDE.md, copilot-instructions) via session dedup and minification.	< 1 ms
L3 — Cloud Logic	Routes surviving complex prompts to OpenAI, Anthropic, or Azure with fallback resilience.	Network-bound

Layers 1a and 1b deflect 71% of repetitive agentic traffic and 38% of diverse task traffic before any neural inference runs.

How It Works

Getting started with Isartor takes three steps:

1. Install

curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

Or use Docker:

docker run -p 8080:8080 ghcr.io/isartor-ai/isartor:latest

2. Connect

Point any OpenAI-compatible client at Isartor — just change the base URL:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-api-key",
)

Works with the official SDKs, LangChain, LlamaIndex, AutoGen, GitHub Copilot, OpenClaw, and any other OpenAI-compatible tool.

Recent OpenAI-compatible improvements for coding agents include:

GET /v1/models for model discovery
stream: true support on /v1/chat/completions with proper SSE chunks
tools, tool_choice, functions, and function_call passthrough
tool_calls preserved in upstream responses

3. Save

Isartor deflects repetitive and simple prompts locally. You keep the same responses, pay for fewer tokens, and get lower latency — with zero code changes beyond the URL.

Explore the Docs

🚀 Getting Started Install Isartor and send your first request.

🔌 Integrations Connect Copilot CLI, Cursor, Claude Code, and more.

📦 Deployment From a single binary to a multi-replica K8s cluster.

⚙️ Configuration Every environment variable and config key.

🏗️ Architecture Deep dive into the Deflection Stack and trait providers.

📊 Observability OpenTelemetry traces, Prometheus metrics, Grafana dashboards.

Installation

Isartor ships as a single statically linked binary — no runtime dependencies required.

macOS / Linux — Single Command (Recommended)

curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

Docker

The image ships a statically linked isartor binary and downloads the embedding model on first start (then reuses the on-disk hf-hub cache). No API key is needed for the cache layers.

docker run -p 8080:8080 ghcr.io/isartor-ai/isartor:latest

To persist the model cache across restarts (recommended):

docker run -p 8080:8080 \
  -e HF_HOME=/tmp/huggingface \
  -v isartor-hf:/tmp/huggingface \
  ghcr.io/isartor-ai/isartor:latest

To use Azure OpenAI for Layer 3 (recommended: Docker secrets via *_FILE). Important: ISARTOR__EXTERNAL_LLM_URL must be the base Azure endpoint only (no /openai/... path), e.g. https://<resource>.openai.azure.com:

# Put your key in a file (no trailing newline is ideal, but Isartor trims whitespace)
echo -n "YOUR_AZURE_OPENAI_KEY" > ./azure_openai_key

docker run -p 8080:8080 \
  -e ISARTOR__LLM_PROVIDER=azure \
  -e ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
  -e ISARTOR__AZURE_DEPLOYMENT_ID=<deployment> \
  -e ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
  -e ISARTOR__EXTERNAL_LLM_API_KEY_FILE=/run/secrets/azure_openai_key \
  -v $(pwd)/azure_openai_key:/run/secrets/azure_openai_key:ro \
  ghcr.io/isartor-ai/isartor:latest

The startup banner appears after all layers are ready (< 30 s on a modern machine).

Image size: ~120 MB compressed / ~260 MB on disk (includes all-MiniLM-L6-v2 embedding model, statically linked Rust binary).

Windows (PowerShell) — Single Command

irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex

Build from Source

git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release
./target/release/isartor up

Requires Rust 1.75 or later.

Verify Installation

Check that the binary is available:

isartor --version

Run the built-in demo. It works without an API key, but if you configure a provider first it also shows a live upstream round-trip:

isartor set-key -p groq
isartor check
isartor demo

Verify the health endpoint:

curl http://localhost:8080/health
# {"status":"ok","version":"0.1.0","layers":{...},"uptime_seconds":5,"demo_mode":true}

Quick Start

This guide walks you through starting Isartor, making your first request, observing a cache hit, and checking stats. If you haven't installed Isartor yet, see the Installation guide.

Guided Setup

For the smoothest first-run experience, use the setup wizard:

isartor setup

The wizard can:

choose your Layer 3 provider
collect the provider API key and model
optionally set the Isartor gateway API key
configure Layer 2 as disabled, embedded, or sidecar
connect one or more tools
run a final verification pass

The older explicit commands still work if you prefer scripting or manual control.

Starting Isartor

isartor up           # start the API gateway only
isartor up --detach  # start in background and return to the shell
isartor up copilot   # start gateway + CONNECT proxy for Copilot CLI

Other useful commands:

isartor init         # generate a commented config scaffold
isartor setup        # guided setup for provider, L2, connectors, and verification
isartor set-key -p openai  # configure your LLM provider API key
isartor check        # verify provider/model/key masking and live connectivity
isartor demo         # run the post-install showcase
isartor stop         # stop a running Isartor instance (uses PID file)
isartor update       # self-update to the latest version from GitHub releases

Making Your First Request

Isartor exposes an OpenAI-compatible API. Send a request to the /v1/chat/completions endpoint:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma-2-2b-it",
    "messages": [
      {"role": "user", "content": "Explain the quantum Hall effect in detail, including its significance for condensed matter physics and any applications in modern technology."}
    ]
  }'

Expected JSON Response (snippet):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The quantum Hall effect is a phenomenon..."
      }
    }
  ],
  "usage": { ... }
}

Console Log (snippet):

INFO  [slm_triage] Layer 3 fallback: OpenAI
INFO  [cache] Layer 1a miss: quantum Hall effect prompt

The first request is a cache miss — Layer 2 triages it and Layer 3 routes it to your configured cloud provider.

OpenAI-compatible clients can also:

call GET /v1/models to discover the configured model
send "stream": true and receive OpenAI-style SSE responses
use tool/function calling fields such as tools, tool_choice, and functions

You can also use the native API:

curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Calculate 2+2"}'

Seeing a Cache Hit

Repeat the same request:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma-2-2b-it",
    "messages": [
      {"role": "user", "content": "Explain the quantum Hall effect in detail, including its significance for condensed matter physics and any applications in modern technology."}
    ]
  }'

Expected JSON Response (snippet):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The quantum Hall effect is a phenomenon..."
      }
    }
  ],
  "usage": { ... }
}

Console Log (snippet):

INFO  [cache] Layer 1a exact match: quantum Hall effect prompt
INFO  [slm_triage] Short-circuit: cache hit

This time the response comes from the Layer 1a exact cache — sub-millisecond, zero tokens consumed, no cloud call.

Checking Stats

View prompt totals, layer hit rates, and recent routing history:

isartor stats

Connecting an AI Tool

Isartor works as a drop-in replacement for any OpenAI-compatible client. Point your favourite AI tool at http://localhost:8080/v1 and it will route through the Deflection Stack automatically.

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Summarise this document."}],
)

If your client probes models first, this also works:

curl -sS http://localhost:8080/v1/models

For detailed setup guides for GitHub Copilot CLI, Claude Code, Cursor, and other tools, see the Integrations section.

For advanced configuration, see the Configuration Reference and Architecture.

Architecture

Pattern: Hexagonal Architecture (Ports & Adapters) Location: src/core/, src/adapters/, src/factory.rs

High-Level Overview

Isartor is an AI Prompt Firewall that intercepts LLM traffic and routes it through a multi-layer Deflection Stack. Each layer can short-circuit and return a response without reaching the cloud, dramatically reducing cost and latency.

Documentation contract: if an implementation changes this architecture, the request flow, supported surfaces, deployment shape, or other durable design assumptions, update this page and the ADR pages in the same patch. User-visible capability changes should also be reflected in the README.md feature list and the relevant docs under docs/ and docs-site/src/.

Request Surfaces

Isartor accepts traffic through seven inbound surfaces. All share the same deflection stack but keep cache keys namespaced by response shape so one endpoint never returns another endpoint's schema. Format detection is header-aware: Cursor and Kiro both hit /v1/chat/completions but receive their own isolated cache namespaces.

Surface	Route(s)	Cache Namespace	Detection
Native	`POST /api/chat`, `POST /api/v1/chat`	`native`	Path
OpenAI-compatible	`POST /v1/chat/completions`	`openai`	Path
Anthropic-compatible	`POST /v1/messages`	`anthropic`	Path
Gemini-native	`POST /v1beta/models/{model}:generateContent`, `:streamGenerateContent`	`gemini`	Path
Cursor IDE	`POST /v1/chat/completions`	`cursor`	`X-Cursor-Checksum` / `X-Cursor-Client-Version` / `X-Ghost-Mode` header
Kiro (AWS IDE)	`POST /v1/chat/completions`	`kiro`	`X-Kiro-Version` / `X-Kiro-Client-Id` header
CONNECT proxy	MITM intercept on allowlisted Copilot domains (`:8081`)	`openai` (proxied)	Port 8081

Streaming is a boundary concern: handlers always return canonical JSON. The cache middleware converts cached or downstream responses into surface-specific SSE (text/event-stream) when the client requests streaming ("stream": true). Cursor and Kiro use OpenAI-compatible SSE format.

MCP Server

Isartor also exposes a Model Context Protocol (MCP) server for tool integrations:

stdio mode: isartor mcp — used by Copilot CLI, Claude Desktop, Cursor IDE
HTTP/SSE mode: GET/POST/DELETE /mcp/ — used by any MCP-compatible client

The MCP server provides cache-lookup and cache-store tools so external agents can query and populate the deflection cache directly.

CONNECT Proxy

When started with isartor up copilot (or claude, antigravity), Isartor runs a transparent HTTPS CONNECT proxy on port 8081 alongside the API gateway on :8080. The proxy intercepts allowlisted GitHub Copilot domains, terminates TLS using a local CA (auto-generated at ~/.isartor/ca/), and reuses the same L1/L2/L3 cache and LLM layers for /v1/chat/completions traffic. This lets tools like gh copilot benefit from Isartor's deflection stack without any client-side configuration changes beyond an HTTPS proxy setting.

Pre-Routing Normalization

At the HTTP boundary, request-time model aliases are normalized to real provider model IDs before Layer 1 cache keys are built or Layer 3 routing runs. That keeps aliases like fast and their canonical target model on the same routing and cache path.

When operators need full payload troubleshooting, the outer monitoring middleware emits a separate opt-in JSONL request log with redacted auth headers. This is intentionally kept separate from normal tracing/startup logs.

For a detailed breakdown of the deflection layers, see the Deflection Stack page.

flowchart TD
    A[Request] --> B[Body Buffer + Monitoring]
    B --> C[Auth — L0]
    C --> D[MiniLM Router — L0.5]
    D --> E[Cache — L1a Exact / L1b Semantic]
    E --> F[SLM Router — L2]
    F --> G[Context Optimiser — L2.5]
    G --> H[Cloud Fallback Chain — L3]
    H --> I[Response]

    subgraph F_detail [L2.5 CompressionPipeline]
        direction LR
        F1[ContentClassifier] --> F2[DedupStage]
        F2 --> F3[LogCrunchStage]
    end

Layer 0 — Authentication & Concurrency

Layer 0 is the operational defense perimeter. It runs before any cache lookup or inference.

Authentication: API key validation via the X-API-Key header. When gateway_api_key is empty (the local-first default), authentication is disabled.
Body buffering: The BufferedBody middleware reads the request body once and stores a clone in request extensions. All downstream layers (cache key extraction, prompt parsing, monitoring, retries) read from this buffer instead of consuming the stream.
Monitoring: Root request-level OpenTelemetry tracing span, optional JSONL request logging.

Public health routes (/health, /healthz) and the MCP endpoint (/mcp/) bypass Layer 0 and the entire deflection stack.

Layer 0.5 — MiniLM Classifier Routing

An optional pre-cache routing pass can reuse the same in-process all-MiniLM-L6-v2 embedder used by L1b. Instead of running a second encoder, Isartor generates one embedding and scores four lightweight linear heads:

task_type
complexity
persona
domain

The classifier reads the buffered request body through extract_classifier_context(), which keeps enough agent/tool context to route coding-agent traffic even when the last user turn is short. Matched rules can:

prefer a specific provider earlier in the Layer 3 fallback chain
override the request model before cache-key generation and Layer 3 execution

Because provider-directed routing changes the semantics of the downstream answer, the cache key is prefixed with the selected provider fragment before L1 lookup and writeback. That prevents a classifier-routed provider response from colliding with the default-provider cache entry for the same prompt.

If classifier_routing.fallback_to_existing_routing = true, any classifier miss, load failure, or no-match path simply falls through to the existing routing behavior. If it is false, the gateway fails closed with 503 instead.

Layer 1 — Cache

Layer 1 has two sub-layers that execute in sequence:

L1a — Exact Cache

Fast-hash (ahash) lookup against an in-memory LRU or shared Redis cluster. Sub-millisecond on hit. Cache keys include the API surface namespace and an optional session scope (from x-isartor-session-id or similar headers) so different conversations and endpoints never cross-pollinate.

On a cache hit, ChatResponse.layer is normalized to 1 regardless of which layer originally produced the response.

L1b — Semantic Cache

Cosine similarity over 384-dimensional sentence embeddings from an in-process candle BertModel (all-MiniLM-L6-v2). Catches semantically equivalent prompts that differ in wording (e.g. "Price?" ≈ "Cost?").

Important: L1b semantic cache is intentionally disabled for /v1/messages (Anthropic/Claude Code traffic) because the large, repetitive system/context payloads caused false cache hits across different user questions. Exact cache (L1a) remains active for that surface.

Component	Minimalist	Enterprise
L1a Exact Cache	In-memory LRU (`ahash` + `parking_lot`)	Redis cluster (shared across replicas)
L1b Semantic Cache	In-process `candle` BertModel	External TEI sidecar (optional)

Layer 2 — SLM Router

Neural classification via a Small Language Model (e.g. Qwen-1.5B via llama.cpp sidecar). Classifies the prompt's intent and resolves simple data extraction tasks locally without reaching the cloud. Typical latency: 50–200 ms.

Disabled by default (enable_slm_router = false): Layer is a no-op; request falls through to L2.5.
Classifier modes: tiered (default, multi-level confidence) or binary (simple/complex split).

Two-phase execution:

Phase	Config	Purpose
Classification	`local_slm_url` + `local_slm_model`	Lightweight CPU-friendly Ollama path (always-on). Sees system prompt + last user message for full agentic context.
Answer generation	`layer2.sidecar_url` + `layer2.model_name`	GPU sidecar. Only invoked when classification returns a deflectable intent.

Splitting the two phases means classification never competes with generation for GPU resources and degrades gracefully when the sidecar is busy. Both calls respect layer2.timeout_seconds.

Component	Minimalist	Enterprise
L2 SLM Router	Embedded `candle` GGUF inference (CPU)	Remote vLLM / TGI server (GPU pool)

Layer 2.5 — Context Optimiser

A modular CompressionPipeline with pluggable stages that reduce cloud input tokens by compressing repeated instruction payloads (CLAUDE.md, copilot-instructions.md, skills blocks).

Built-in stages (execute in order):

ContentClassifier — Gate: detects instruction vs conversational content. Short-circuits on conversational messages.
DedupStage — Session-aware cross-turn instruction deduplication. Hashes instruction content; on repeat turns, replaces with compact hash reference.
LogCrunchStage — Static minification: strips comments, decorative rules, consecutive blank lines.

Each stage is a stateless CompressionStage trait object. Shared state (the InstructionCache) is passed as input. If a stage sets short_circuit = true, subsequent stages are skipped.

Component	Minimalist	Enterprise
L2.5 Context Optimiser	In-process CompressionPipeline	In-process CompressionPipeline (extensible with custom stages)

Layer 3 — Cloud Logic

Only the hardest prompts — those not resolved by cache, SLM, or context optimisation — reach Layer 3.

Provider Chain

The running AppState maintains an ordered provider chain: one primary provider plus zero or more fallback providers. Each provider keeps its own retry budget (exponential backoff with jitter). Isartor advances to the next provider only when the current one exhausts retries with a retry-safe upstream error (429, 5xx, timeout). Successful responses are annotated with x-isartor-provider so clients can see which upstream answered.

If a provider has no explicit api_key configured, the resolved chain also does a best-effort lookup in the encrypted local token store (~/.isartor/tokens/) before falling back to an empty key. That lets operators authenticate once with isartor auth <provider> and keep long-lived OAuth/API credentials out of isartor.toml.

Provider Registry

Layer 3 supports 23+ LLM providers through rig-core:

Full client: OpenAI, Azure OpenAI, Anthropic, Copilot (GitHub), Gemini, Cohere, xAI

OpenAI-compatible registry (shared runtime path with provider-specific default endpoints): Groq, Cerebras, Nebius, SiliconFlow, Fireworks, NVIDIA, Chutes, DeepSeek, Galadriel, Hyperbolic, HuggingFace, Mira, Moonshot, Ollama, OpenRouter, Perplexity, Together

Multi-Key Rotation

Each provider can own an in-memory key pool. When multiple credentials are configured, Isartor selects keys with round_robin or priority rotation and temporarily cools down only the rate-limited key after 429/quota-style failures. Key rotation is separate from provider-level fallback.

Stored OAuth Credentials

src/auth/ adds a provider-agnostic authentication layer:

OAuthProvider trait for device-flow, refresh-token, and manual API-key providers
TokenStore for AES-256-GCM encrypted credentials on disk
provider implementations for Copilot, Gemini, Kiro, Anthropic, and OpenAI
isartor auth <provider>, isartor auth status, and isartor auth logout <provider>

Copilot, Gemini, and Kiro use interactive device authorization. Anthropic and OpenAI do not expose a public device flow, so the same encrypted store is used for securely pasted API keys.

Optional Encrypted Config Sync

src/sync/ adds an opt-in config sync path for operators who use multiple machines:

isartor sync init creates a local sync profile with server URL, user hash, and encryption salt
isartor sync push filters the shareable subset of isartor.toml, encrypts it client-side, and uploads the encrypted blob
isartor sync pull downloads, decrypts, and merges only the syncable keys/tables back into the local config file
isartor sync serve runs the self-hostable zero-knowledge blob server

The sync server only stores { user_hash, salt, encrypted_blob, updated_at }. It never sees plaintext config. The synced subset includes provider/model settings, model aliases, fallback providers, and quota/pricing preferences. It explicitly excludes OAuth tokens, cache contents, usage history, bind addresses, and other machine-local runtime paths.

Provider Health

A small in-memory provider-health snapshot tracks request/error counts, last success/failure, and masked key-pool entries for the entire configured chain. Exposed via GET /debug/providers and isartor providers.

Health state now advances from three sources:

real routed Layer 3 successes/failures
manual dashboard connectivity tests (POST /api/admin/providers/test)
an optional background ping loop driven by provider_health_check_interval_secs (default 300, 0 disables)

Web Management Dashboard

An embedded single-page application (SPA) is served at /dashboard. The HTML, CSS, JavaScript, and logo PNG are compiled directly into the binary via include_str! / include_bytes! — no separate static-file directory or CDN required.

The dashboard has five tabs, each backed by authenticated JSON admin endpoints:

Tab	Route(s)	Key features
Overview	`GET /api/admin/overview`	Deflection rate sparkline (7-day SVG), uptime pill, L1a/L1b cache counts, quota-warning banner, provider/model cards, cost/savings
Providers	`GET /api/admin/providers` `POST /api/admin/providers/test`	Health per provider, key-pool status, connectivity test (latency + HTTP status), Add/Edit/Remove/Reorder modal flows, manual test updates health immediately
Usage	`GET /api/admin/usage` `GET /api/admin/usage/breakdown`	Window summary, daily request bar chart, per-provider/model breakdown table, per-provider quota status
Request Log	`GET /api/admin/requests`	Last 100 JSONL request-log entries, expandable rows showing full JSON details
Configuration	`GET /api/admin/config` `POST /api/admin/config`	Form-based editor for all `isartor.toml` settings, including provider-health ping cadence; `toml_edit` write preserves comments; restart-required banner

All /api/admin/* routes require the gateway API key (X-API-Key header). The SPA stores the key in sessionStorage and never transmits it to any third party. Static assets (/dashboard/, /dashboard/logo.png) are served without authentication.

AppState carries a started_at: Instant field (set in AppState::new()) used by the overview endpoint to compute the gateway uptime.

Quota Enforcement

Per-provider quota enforcement is built on top of the usage-event stream (see below). Before a request is dispatched to Layer 3, Isartor projects the request's token/cost impact against the provider's configured daily, weekly, and monthly windows, then either warns, blocks with 429, or falls through to the next provider in the ordered fallback chain.

Stale Fallback

On L3 failure, the handler checks the namespaced exact-cache key first, then a legacy un-namespaced key for backward compatibility.

Offline Mode

When offline_mode = true, Layer 3 is blocked explicitly — returns HTTP 503 instead of silently pretending success.

Usage Analytics

Isartor records per-request provider/model usage events for both cloud calls and pre-L3 deflections. Events are persisted as append-only JSONL under usage_log_path, aggregated in-memory with retention pruning, and exposed through:

isartor stats --usage — CLI usage breakdown by provider/model
isartor stats --by-tool — CLI usage breakdown by client tool

Deflected requests are recorded as saved cost against the configured primary provider/model, while actual cloud calls record estimated prompt/completion usage against the provider/model that served the request.

Pluggable Trait Provider Pattern

All layers are implemented as Rust traits and adapters. Backends are selected at startup via ISARTOR__ environment variables — no code changes or recompilation required.

Rather than feature-flag every call-site, we define Ports (trait interfaces in src/core/ports.rs) and swap the concrete Adapter at startup. This keeps the Deflection Stack logic completely agnostic to the backing implementation.

Adding a New Adapter

Define the struct in src/adapters/cache.rs or src/adapters/router.rs.
Implement the port trait (ExactCache or SlmRouter).
Add a variant to the config enum (CacheBackend or RouterBackend) in src/config.rs.
Wire it in src/factory.rs with a new match arm.
Write tests — each adapter module has a #[cfg(test)] mod tests.

No other files need to change. The middleware and pipeline code operate only on Arc<dyn ExactCache> / Arc<dyn SlmRouter>.

Scalability Model (3-Tier)

Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. The same binary serves all three tiers; the runtime behaviour is entirely configuration-driven.

Level 1 (Edge)           Level 2 (Compose)        Level 3 (K8s)
┌────────────────┐       ┌────────────────┐       ┌────────────────┐
│ Single Process  │       │ Firewall + GPU  │       │ N Firewall Pods │
│ memory cache    │──▶    │ Sidecar         │──▶    │ + Redis Cluster │
│ embedded candle │       │ memory cache    │       │ + vLLM Pool     │
│ context opt.    │       │ (optional)      │       │ (optional)      │
└────────────────┘       └────────────────┘       └────────────────┘

Key insight: Switching to cache_backend=redis unlocks true multi-replica scaling. Without it, each firewall pod maintains an independent cache.

See the deployment guides for tier-specific setup:

Directory Layout

src/
├── sync/
│   └── mod.rs               # Encrypted config sync, profile storage, blob server
├── auth/
│   ├── mod.rs               # OAuthProvider trait + provider registry
│   ├── device_flow.rs       # Shared RFC 8628 polling loop
│   ├── token_store.rs       # AES-GCM encrypted token persistence
│   └── providers/           # Copilot, Gemini, Kiro, Anthropic, OpenAI auth backends
├── core/
│   ├── mod.rs               # Re-exports + is_internal_endpoint()
│   ├── ports.rs             # Trait interfaces (ExactCache, SlmRouter)
│   ├── prompt.rs            # Stable prompt extraction for cache keys
│   ├── cache_scope.rs       # Session-aware cache key namespacing
│   ├── retry.rs             # Retry logic with exponential backoff
│   ├── usage.rs             # Usage event tracking + JSONL persistence
│   ├── quota.rs             # Per-provider quota enforcement
│   ├── request_logger.rs    # Opt-in JSONL request/response logging
│   └── context_compress.rs  # Re-export shim (backward compat)
├── adapters/
│   ├── cache.rs             # InMemoryCache, RedisExactCache
│   └── router.rs            # EmbeddedCandleRouter, RemoteVllmRouter
├── compression/
│   ├── pipeline.rs          # CompressionPipeline executor + CompressionStage trait
│   ├── cache.rs             # InstructionCache (per-session dedup state)
│   ├── optimize.rs          # Request body rewriting (JSON → pipeline → reassembly)
│   └── stages/
│       ├── content_classifier.rs  # Gate: instruction vs conversational
│       ├── dedup.rs               # Cross-turn instruction dedup
│       └── log_crunch.rs          # Static minification
├── middleware/
│   ├── body_buffer.rs       # BufferedBody preservation
│   ├── monitoring.rs        # OTel tracing + request logging
│   ├── auth.rs              # API key validation
│   ├── cache.rs             # L1a exact + L1b semantic cache
│   ├── slm_triage.rs        # L2 SLM intent classification
│   └── context_optimizer.rs # L2.5 compression entry point
├── providers/               # L3 provider implementations
│   └── copilot.rs           # GitHub Copilot token exchange
├── proxy/
│   ├── connect.rs           # HTTPS CONNECT interception
│   └── tls.rs               # Local CA generation + TLS termination
├── dashboard/
│   ├── mod.rs               # Admin API handlers + static SPA route
│   └── index.html           # Embedded single-page dashboard (compiled into binary)
├── formats/                 # Client wire-format adapters + translation helpers
├── handler.rs               # All API surface handlers + provider chain execution
├── state.rs                 # AppState: shared runtime wiring hub
├── factory.rs               # build_exact_cache(), build_slm_router()
├── config.rs                # AppConfig + all configuration types
├── errors.rs                # GatewayError formatting and error chains
├── mcp.rs                   # MCP server (stdio + HTTP/SSE)
├── anthropic_sse.rs         # Anthropic SSE streaming helpers
├── gemini_sse.rs            # Gemini SSE streaming helpers
└── openai_sse.rs            # OpenAI SSE streaming helpers

The Deflection Stack

Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the Deflection Stack to reach the cloud.

Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
                 │ hit                │ hit                 │ simple             │ compressed                │
                 ▼                    ▼                     ▼                    ▼                           ▼
              Response             Response            Local Response     Optimised Prompt            Cloud Response

Layers at a Glance

Layer	Algorithm / Mechanism	What It Does	Typical Latency
L1a — Exact Cache	Fast Hashing (`ahash`)	Sub-millisecond duplicate detection. Traps infinite agent loops instantly.	< 1 ms
L1b — Semantic Cache	Cosine Similarity (Embeddings)	Computes mathematical meaning via pure-Rust `candle` models (`all-MiniLM-L6-v2`) to catch variations ("Price?" ≈ "Cost?").	1–5 ms
L2 — SLM Router	Neural Classification (LLM)	Triages intent using an embedded Small Language Model (e.g. Qwen-1.5B) to resolve simple data extraction tasks.	50–200 ms
L2.5 — Context Optimiser	Instruction Dedup + Minify	Compresses repeated instruction files (CLAUDE.md, copilot-instructions.md) via session dedup and static minification to reduce cloud input tokens.	< 1 ms
L3 — Cloud Logic	Provider Chain + Retries	Routes surviving complex prompts to 23+ providers (OpenAI, Anthropic, Azure, Gemini, etc.) with ordered fallback chain, per-provider retry budgets, multi-key rotation, and quota enforcement.	Network-bound

Layers 1a and 1b deflect 71% of repetitive agentic traffic (FAQ/agent loop patterns) and 38% of diverse task traffic before any neural inference runs.

Layer Details

L1a — Exact Cache

Algorithm: Fast hashing with ahash

L1a is the first line of defence. It computes a hash of the incoming prompt and checks it against an in-memory LRU cache (single-binary mode) or a shared Redis cluster (enterprise mode).

Hit: Returns the cached response immediately (sub-millisecond).
Miss: The request continues to L1b.

Cache keys are namespaced before hashing (native|prompt, openai|prompt, anthropic|prompt, etc.) to ensure one endpoint never returns another endpoint's response schema. On a cache hit, ChatResponse.layer is normalised to 1 regardless of which layer originally produced the response.

Mode	Implementation
Minimalist	In-memory LRU (`ahash` + `parking_lot`)
Enterprise	Redis cluster (shared across replicas, async `redis` crate)

L1b — Semantic Cache

Algorithm: Cosine similarity over sentence embeddings (all-MiniLM-L6-v2)

L1b catches semantically equivalent prompts that differ in wording. A sentence embedding is computed for the incoming prompt using a pure-Rust candle BertModel, then compared against the vector cache using cosine similarity.

Hit (similarity above threshold): Returns the cached response (1–5 ms).
Miss: The request continues to L2.

Embedding pipeline:

Model: sentence-transformers/all-MiniLM-L6-v2 — 384-dimensional embeddings (~90 MB).
Runtime: Pure-Rust candle stack — zero C/C++ dependencies.
Pooling: Mean pooling with attention mask, followed by L2 normalisation.
Thread safety: BertModel is wrapped in std::sync::Mutex; inference runs on tokio::task::spawn_blocking.
Architecture: TextEmbedder is initialised once at startup, stored as Arc<TextEmbedder> in AppState.

The vector cache is maintained in tandem with exact cache entries. Insertions and evictions update the index automatically, providing sub-millisecond vector search latency for thousands of embeddings.

Mode	Implementation
Minimalist	In-process `candle` BertModel
Enterprise	External TEI sidecar (optional)

L2 — SLM Router

Algorithm: Neural classification via Small Language Model

L2 runs a lightweight language model to classify the prompt's intent. Simple requests (data extraction, FAQ-style queries) can be resolved locally without reaching the cloud.

Simple intent: Returns a locally generated response (50–200 ms).
Complex intent: The request continues to L2.5.
Disabled (enable_slm_router = false): Layer is a no-op; request falls through to L3.

Two-phase execution (classify then generate):

Classification uses local_slm_url + local_slm_model — a CPU-friendly Ollama endpoint that is always-on and lightweight. The classifier input is built by extract_classifier_context(), which includes both the system prompt and the last user message so agentic tasks with short user turns and large system prompts are correctly identified as complex.

Answer generation uses layer2.sidecar_url + layer2.model_name — the heavier GPU sidecar, only invoked when the classifier returns a deflectable (SIMPLE / TEMPLATE / SNIPPET) result.

Both calls respect layer2.timeout_seconds; a timeout triggers a clean fallthrough to L3.

Mode	Implementation
Minimalist	Embedded `candle` GGUF inference (e.g. Gemma-2-2B-IT, CPU)
Enterprise	Remote vLLM / TGI server (GPU pool)

L2.5 — Context Optimiser

Algorithm: CompressionPipeline — Modular staged compression

Agentic coding tools (Copilot, Claude Code, Cursor) send large instruction files (CLAUDE.md, copilot-instructions.md, skills blocks) with every turn. L2.5 detects and compresses these payloads before they reach the cloud, saving input tokens on every L3 call.

Pipeline architecture (src/compression/):

L2.5 uses a modular CompressionPipeline with pluggable stages that execute in order. Each stage is a stateless CompressionStage trait object. If a stage sets short_circuit = true, subsequent stages are skipped.

Built-in stages (run in order):

ContentClassifier — Gate stage: detects instruction vs conversational content. Short-circuits on conversational messages so downstream stages skip work.
DedupStage — Session-aware cross-turn deduplication. Hashes instruction content per session; on repeat turns, replaces with a compact hash reference. Short-circuits on dedup hit.
LogCrunchStage — Static minification: strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.

Adding custom stages:

Implement the CompressionStage trait and add your stage to the pipeline via build_pipeline() in src/compression/optimize.rs.

Configuration:

Variable	Default	Description
`ISARTOR__ENABLE_CONTEXT_OPTIMIZER`	`true`	Master switch for L2.5
`ISARTOR__CONTEXT_OPTIMIZER_DEDUP`	`true`	Enable cross-turn instruction deduplication
`ISARTOR__CONTEXT_OPTIMIZER_MINIFY`	`true`	Enable static minification

Observability:

Instrumented as: layer2_5_context_optimizer span in distributed traces.
Response header: x-isartor-context-optimized: bytes_saved=<N> on optimised requests.
Span fields: context.bytes_saved, context.strategy (e.g. "classifier+dedup", "classifier+log_crunch").

Mode	Implementation
Minimalist	In-process CompressionPipeline (classifier → dedup → log_crunch)
Enterprise	In-process CompressionPipeline (extensible with custom stages)

L3 — Cloud Logic

Algorithm: Ordered provider chain with per-provider retry budgets

L3 is the final layer. Only the hardest prompts — those not resolved by cache, SLM, or context optimisation — reach the external cloud LLMs.

Provider chain execution:

Isartor evaluates quota for the current provider (daily/weekly/monthly token and cost windows). If the provider is over quota, the request either blocks (429), warns, or falls through to the next provider depending on the action_on_limit policy.
The request is dispatched to the provider with retry logic (exponential backoff, jitter). Each provider has its own independent retry budget.
On exhausting retries with a retry-safe upstream error (429, 5xx, timeout), Isartor advances to the next fallback provider in the chain.
Successful responses are annotated with the x-isartor-provider header.

Multi-key rotation: Each provider can own an in-memory key pool. When multiple credentials are configured, keys are selected with round_robin or priority strategy. Only the rate-limited key is cooled down after 429/quota failures — other keys continue serving.

Supported providers (23+):

Full client: OpenAI, Azure OpenAI, Anthropic, Copilot (GitHub), Gemini, Cohere, xAI
OpenAI-compatible registry: Groq, Cerebras, Nebius, SiliconFlow, Fireworks, NVIDIA, Chutes, DeepSeek, Galadriel, Hyperbolic, HuggingFace, Mira, Moonshot, Ollama, OpenRouter, Perplexity, Together

Safety nets:

Offline mode (offline_mode = true): Blocks L3 routing explicitly with HTTP 503.
Stale fallback: On L3 failure, checks the namespaced exact-cache key first, then a legacy un-namespaced key for backward compatibility.

Mode	Implementation
Minimalist	Direct to cloud providers via `rig-core`
Enterprise	Direct to cloud providers via `rig-core`

How Layers Interact

The deflection stack is implemented as Axum middleware plus a final handler. For authenticated routes, the execution order is:

Body buffer — BufferedBody stores the request body so multiple layers can read it.
Request-level monitoring — Observability instrumentation.
Auth — API key validation.
Layer 1 cache — L1a exact match, then L1b semantic match.
Layer 2 SLM triage — Intent classification and local response.
Layer 2.5 context optimiser — Instruction dedup + minification via CompressionPipeline.
Layer 3 handler — Cloud LLM fallback.

Implementation note: Axum middleware wraps inside-out — the last .layer(...) added runs first. The stack order in src/main.rs documents this explicitly and must be preserved.

Public health routes (/health, /healthz) intentionally bypass the deflection stack. The authenticated routes are /api/chat, /api/v1/chat, /v1/chat/completions, /v1/messages, and /v1beta/models/{model}:generateContent.

Architecture Decision Records

Key design decisions, trade-offs, and rationale behind Isartor's architecture.

Each ADR follows a lightweight format: Context → Decision → Consequences.

Maintenance policy: when an implementation changes Isartor's lasting system structure, request flow, API surfaces, routing strategy, cache behavior, deployment model, or other architectural trade-offs, update this ADR page in the same change. If that implementation also changes user-visible capabilities, keep the feature list in README.md, supplementary docs in docs/, and published docs in docs-site/src/ aligned as part of the same ticket.

ADR-001: Multi-Layer Deflection Stack Architecture

Date: 2024 · Status: Accepted

Context

AI Prompt Firewall traffic follows a power-law distribution: the majority of prompts are simple or repetitive, while only a small fraction requires expensive cloud LLMs. Sending all traffic to a single provider wastes tokens and money.

Decision

Implement a sequential Deflection Stack with 4+ layers, each capable of short-circuiting:

Layer 0 — Operational defense (auth, rate limiting, concurrency control)
Layer 1 — Semantic + exact cache (zero-cost hits)
Layer 2 — Local SLM triage (classify intent, execute simple tasks locally)
Layer 2.5 — Context optimiser (instruction dedup + minification to reduce cloud input tokens)
Layer 3 — Cloud LLM fallback with ordered provider chain, quotas, and multi-key rotation

Layer 2.5 (Context Optimiser): Compresses repeated instruction payloads (CLAUDE.md, copilot-instructions.md, skills blocks) via a modular CompressionPipeline with three built-in stages: ContentClassifier (gate), DedupStage (session-aware cross-turn dedup), and LogCrunchStage (static minification). Instrumented as the layer2_5_context_optimizer span in observability.

Consequences

Positive: 60–80% of traffic can be resolved before Layer 3, dramatically reducing cost.
Positive: Each layer adds latency only when needed — cache hits are sub-millisecond.
Positive: Clear separation of concerns; each layer is independently testable.
Negative: Deflection Stack adds conceptual complexity vs. a simple reverse proxy.
Negative: Each layer needs its own error handling and timeout strategy.

ADR-002: Axum + Tokio as Runtime Foundation

Date: 2024 · Status: Accepted

Context

The firewall must handle high concurrency (thousands of simultaneous connections) with low latency overhead. The binary should be small, statically linked, and deployable to minimal environments.

Decision

Use Axum 0.8 on Tokio 1.x for the async HTTP server. Build with --target x86_64-unknown-linux-musl and opt-level = "z" + LTO for a ~5 MB static binary.

Consequences

Positive: Tokio's work-stealing scheduler handles 10K+ concurrent connections efficiently.
Positive: Axum's type-safe extractors catch errors at compile time.
Positive: Static musl binary runs in distroless containers (no libc, no shell).
Negative: Rust's compilation times are longer than Go/Node.js equivalents.
Negative: Ecosystem is smaller — fewer off-the-shelf middleware components.

ADR-003: Embedded Candle Classifier (Layer 2)

Date: 2024 · Status: Accepted

Context

For minimal deployments (edge, VPS, air-gapped), requiring an external sidecar (llama.cpp, Ollama, TGI) adds operational complexity. Many classification tasks can be handled by a 2B parameter model on CPU.

Decision

Embed a Gemma-2-2B-IT GGUF model directly in the Rust process using the candle framework. The model is loaded on first start via hf-hub (auto-downloaded from Hugging Face) and wrapped in a tokio::sync::Mutex for thread-safe inference on spawn_blocking.

Consequences

Positive: Zero external dependencies for Layer 2 classification — a single binary handles everything.
Positive: No HTTP overhead for classification calls; inference is an in-process function call.
Positive: Works in air-gapped environments with pre-cached models.
Negative: ~1.5 GB memory overhead for the Q4_K_M model weights.
Negative: CPU inference is slower than GPU (50–200 ms classification, 200–2000 ms generation).
Negative: Mutex serialises inference calls — throughput limited to one inference at a time.
Trade-off: For higher throughput, upgrade to Level 2 (llama.cpp sidecar on GPU).

ADR-004: Three Deployment Tiers

Date: 2024 · Status: Accepted

Context

Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. A single deployment model cannot serve all use cases optimally.

Decision

Define three explicit deployment tiers that share the same binary and configuration surface:

Tier	Strategy	Target
Level 1	Monolithic binary, embedded candle	VPS, edge, bare metal
Level 2	Firewall + llama.cpp sidecars	Docker Compose, single host + GPU
Level 3	Stateless pods + inference pools	Kubernetes, Helm, HPA

The tier is selected purely by environment variables and infrastructure, not by code changes.

Consequences

Positive: A single codebase and binary serves all deployment scenarios.
Positive: Users start at Level 1 and upgrade incrementally — no migrations.
Positive: Clear documentation entry points for each tier.
Negative: Some config variables are irrelevant at certain tiers (e.g., ISARTOR__LAYER2__SIDECAR_URL is unused at Level 1 with embedded candle).
Negative: Testing all three tiers requires different infrastructure setups.

ADR-005: llama.cpp as Sidecar (Level 2) Instead of Ollama

Date: 2024 · Status: Accepted

Context

The original design used Ollama (~1.5 GB image) as the local SLM engine. While Ollama has a convenient API and model management, it's heavyweight for a sidecar.

Decision

Replace Ollama with llama.cpp server (ghcr.io/ggml-org/llama.cpp:server, ~30 MB) as the default sidecar in docker-compose.sidecar.yml. Two instances run side by side:

slm-generation (port 8081) — Phi-3-mini for classification and generation
slm-embedding (port 8082) — all-MiniLM-L6-v2 with --embedding flag

Consequences

Positive: 50× smaller container images (30 MB vs. 1.5 GB).
Positive: Faster cold starts; no model pull step needed (uses --hf-repo auto-download).
Positive: OpenAI-compatible API — firewall code doesn't need to change.
Negative: Ollama's model management UX (pull, list, delete) is lost.
Negative: Each model needs its own llama.cpp instance (no multi-model serving).
Migration: Ollama-based Compose files (docker-compose.yml, docker-compose.azure.yml) are retained for backward compatibility.
Update (ADR-011): The slm-embedding sidecar (port 8082) is now optional. Layer 1 semantic cache embeddings are generated in-process via candle (pure-Rust BertModel).

ADR-006: rig-core for Multi-Provider LLM Client

Date: 2024 · Status: Accepted

Context

Layer 3 must route to multiple cloud LLM providers (OpenAI, Azure OpenAI, Anthropic, xAI). Implementing each provider's API client from scratch would be error-prone and hard to maintain.

Decision

Use rig-core (v0.32.0) as the unified LLM client. Rig provides a consistent CompletionModel abstraction over all supported providers.

Consequences

Positive: Single configuration surface (ISARTOR__LLM_PROVIDER + ISARTOR__EXTERNAL_LLM_API_KEY) switches providers.
Positive: Provider-specific quirks (Azure deployment IDs, Anthropic versioning) handled by rig.
Negative: Adds a dependency; rig's release cadence may not match our needs.
Negative: Limited to providers rig supports (but covers all major ones).

ADR-007: AIMD Adaptive Concurrency Control

Date: 2024 · Status: Accepted

Context

A fixed concurrency limit either over-provisions (wasting resources) or under-provisions (rejecting requests during traffic spikes). The firewall needs to dynamically adjust its limit based on real-time latency.

Decision

Implement an Additive Increase / Multiplicative Decrease (AIMD) concurrency limiter at Layer 0:

If P95 latency < target → limit += 1 (additive increase).
If P95 latency > target → limit *= 0.5 (multiplicative decrease).
Bounded by configurable min/max concurrency limits.

Consequences

Positive: Self-tuning: the limit converges to the optimal value for the current load.
Positive: Protects downstream services (sidecars, cloud LLMs) from overload.
Negative: During cold start, the limit starts low and ramps up — initial requests may see 503s.
Tuning: Target latency must be calibrated per deployment tier.

ADR-008: Unified API Surface

Date: 2024 · Status: Superseded

Context

The original design maintained two API versions: a v1 middleware-based pipeline (/api/chat) and a v2 orchestrator-based pipeline (/api/v2/chat). Maintaining two code paths increased complexity with no clear benefit once the middleware pipeline matured.

Decision

Consolidate into a single endpoint:

/api/chat — Middleware-based pipeline. Each layer is an Axum middleware (auth → cache → SLM triage → handler).
The v2 endpoint (/api/v2/chat) and its pipeline_* configuration fields have been removed.
Orchestrator and trait-based pipeline components remain in src/pipeline/ for potential future reintegration.

Consequences

Positive: Single code path to maintain, test, and observe.
Positive: Simplified configuration surface — no more PIPELINE_* env vars.
Positive: Eliminates user confusion about which endpoint to use.
Negative: Orchestrator-based features (structured processing_log, explicit PipelineContext) are not exposed until reintegrated.

ADR-009: Distroless Container Image

Date: 2024 · Status: Accepted

Context

The firewall binary is statically linked (musl). The runtime container only needs to execute a single binary.

Decision

Use gcr.io/distroless/static-debian12 as the runtime base image. It contains no shell, no package manager, no libc — only the static binary.

Consequences

Positive: Minimal attack surface — no shell to exec into, no tools for attackers.
Positive: Tiny image size (base ~2 MB + binary ~5 MB = ~7 MB total).
Positive: Passes most container security scanners with zero CVEs.
Negative: Cannot docker exec into the container for debugging (no shell).
Negative: Cannot install additional tools at runtime.
Workaround: Use docker logs, Jaeger traces, and Prometheus metrics for debugging.

ADR-010: OpenTelemetry for Observability

Date: 2024 · Status: Accepted

Context

The firewall needs distributed tracing and metrics. Vendor-specific SDKs (Datadog, New Relic, etc.) create lock-in.

Decision

Use OpenTelemetry (OTLP gRPC) as the sole telemetry interface. Traces and metrics are exported to an OTel Collector, which can forward to any backend (Jaeger, Prometheus, Grafana, Datadog, etc.).

Consequences

Positive: Vendor-neutral — switch backends by reconfiguring the collector, not the app.
Positive: OTLP is a CNCF standard with wide ecosystem support.
Positive: When ISARTOR__ENABLE_MONITORING=false, no OTel SDK is initialised — zero overhead.
Negative: Requires an OTel Collector as middleware (adds one more service in Level 2/3).
Negative: Auto-instrumentation is less mature in Rust than in Java/Python.

ADR-011: Pure-Rust Candle for In-Process Sentence Embeddings


Status	Accepted (superseded: fastembed → candle)
Date	2025-06 (updated 2025-07)
Deciders	Core team
Relates to	ADR-003 (Embedded Candle), ADR-005 (llama.cpp sidecar)

Context

Layer 1 (semantic cache) must generate sentence embeddings for every incoming prompt to compute cosine similarity against the vector cache. Previously, this was done via fastembed (ONNX Runtime, BAAI/bge-small-en-v1.5), which introduced a C++ dependency (onnxruntime-sys) that broke cross-compilation on ARM64 macOS and complicated the build matrix.

Decision

Use candle (candle-core, candle-nn, candle-transformers 0.9) with hf-hub and tokenizers to run sentence-transformers/all-MiniLM-L6-v2 in-process via a pure-Rust BertModel. The model weights (~90 MB) are downloaded once from Hugging Face Hub on first startup and cached in ~/.cache/huggingface/. Inference is invoked through tokio::task::spawn_blocking since BERT forward passes are CPU-bound.

Model: sentence-transformers/all-MiniLM-L6-v2 — 384-dimensional embeddings, optimised for sentence similarity.
Runtime: Pure-Rust candle stack — zero C/C++ dependencies, seamless cross-compilation to any rustc target.
Pooling: Mean pooling with attention mask, followed by L2 normalisation.
Thread safety: The inner BertModel is wrapped in std::sync::Mutex because forward() takes &mut self. This is acceptable because inference is always called from spawn_blocking, never holding the lock across .await points.
Architecture: TextEmbedder is initialised once at startup, stored as Arc<TextEmbedder> in AppState, and injected into the cache middleware.

Alternatives Considered

Alternative	Why rejected
fastembed (ONNX Runtime)	C++ dependency (onnxruntime-sys) breaks ARM64 cross-compilation; ~5 MB shared library
llama.cpp sidecar (all-MiniLM-L6-v2)	Network round-trip on hot path, extra container to manage
sentence-transformers (Python)	Crosses FFI boundary, adds Python runtime dependency
ort (raw ONNX Runtime bindings)	Same C++ dependency problem as fastembed

Consequences

Positive: Eliminates ~2–5 ms network latency per embedding call on the cache hot path.
Positive: Zero C/C++ dependencies — cargo build works on any platform without cmake or pre-built binaries.
Positive: Zero sidecar dependency for Level 1 — the minimal Dockerfile runs self-contained.
Positive: Model weights are auto-downloaded from Hugging Face Hub; reproducible builds.
Negative: First startup downloads model weights (~90 MB) if not pre-cached.
Negative: Mutex serialises concurrent embedding calls within a single process (acceptable at current scale; can be replaced with a pool of models if needed).

ADR-012: Pluggable Trait Provider (Hexagonal Architecture)


Status	Accepted
Date	2025-06
Deciders	Core team
Relates to	ADR-003 (Embedded Candle), ADR-004 (Three Deployment Tiers)

Context

As Isartor grew from a single-process binary (Level 1) to a multi-tier deployment (Level 1 → 2 → 3), the cache and SLM router components became tightly coupled to their in-process implementations. Scaling to Level 3 (Kubernetes, multiple replicas) requires:

Shared cache — in-process LRU caches are isolated per pod; cache hits are inconsistent, duplicating work.
GPU-backed inference — in-process Candle inference is CPU-bound; Level 3 needs a dedicated GPU inference pool (vLLM / TGI) that can scale independently.

Hard-coding these choices into the firewall binary would require compile-time feature flags or code branching, making the binary non-portable across tiers.

Decision

Adopt the Ports & Adapters (Hexagonal Architecture) pattern:

Ports (src/core/ports.rs) — Define ExactCache and SlmRouter as async_trait traits (Send + Sync), representing the interfaces the firewall depends on.
Adapters (src/adapters/) — Provide concrete implementations:
- InMemoryCache (ahash + LRU + parking_lot) and RedisExactCache for ExactCache
- EmbeddedCandleRouter and RemoteVllmRouter for SlmRouter
Factory (src/factory.rs) — build_exact_cache(&config) and build_slm_router(&config, &http_client) read AppConfig.cache_backend and AppConfig.router_backend at startup and return the appropriate Box<dyn Trait>.
Configuration (src/config.rs) — CacheBackend enum (Memory | Redis) and RouterBackend enum (Embedded | Vllm) with associated connection URLs, selectable via ISARTOR__CACHE_BACKEND and ISARTOR__ROUTER_BACKEND env vars.

The same binary serves all three deployment tiers; the runtime behaviour is entirely configuration-driven.

Alternatives Considered

Alternative	Why rejected
Compile-time feature flags (`#[cfg(feature = "redis")]`)	Produces different binaries per tier; complicates CI and container builds
Service mesh sidecar (Envoy filter for caching)	Adds infrastructure complexity; cache logic is domain-specific
Plugin system (dynamic `.so` loading)	Over-engineered; `dyn Trait` with compile-time-known variants is simpler
Runtime scripting (Lua / Wasm policy)	Unnecessary indirection; Rust trait dispatch is zero-cost

Consequences

Positive: One binary, all tiers — only env vars change between Level 1 (embedded everything) and Level 3 (Redis + vLLM).
Positive: Horizontal scalability — with cache_backend=redis, all pods share the same cache; with router_backend=vllm, GPU inference scales independently.
Positive: Testability — unit tests inject mock adapters via the trait interface.
Positive: Extensibility — adding a new backend (e.g., Memcached, Triton) requires only a new adapter implementing the trait.
Negative: Minor runtime overhead from dyn Trait dynamic dispatch (single vtable lookup per call — negligible vs. network I/O).
Negative: EmbeddedCandleRouter remains a skeleton; full candle-based classification requires the embedded-inference feature flag to be completed.

ADR-013: Resolve Model Aliases Before Routing and Cache Key Generation


Status	Accepted
Date	2026-03
Deciders	Core team
Relates to	ADR-001 (Deflection Stack), ADR-006 (rig-core), ADR-012 (Hexagonal Architecture)

Context

Users increasingly connect heterogeneous tools and SDKs to Isartor, but raw provider model IDs are verbose and change over time. Operators want stable names such as fast, smart, or code without fragmenting cache behavior or forcing every client to know the real provider model identifier.

Decision

Add a model_aliases configuration map in AppConfig and resolve aliases at the HTTP boundary before:

request routing to Layer 3
exact-cache key generation
semantic-cache input generation
OpenAI-compatible GET /v1/models discovery output

Aliases currently resolve within the already-configured provider. They do not yet switch providers; multi-provider named routing remains a follow-on roadmap item.

Consequences

Positive: Clients can use stable, human-friendly model names without changing provider-specific IDs everywhere.
Positive: Alias traffic and canonical model traffic share the same cache behavior because cache inputs are normalized first.
Positive: GET /v1/models can advertise both real model IDs and operator-defined aliases.
Negative: This introduces another routing indirection layer that operators must document clearly.
Negative: Provider switching by alias is intentionally out of scope for this ADR and remains future work.

ADR-014: Keep Request/Response Debug Logging Separate from Telemetry


Status	Accepted
Date	2026-03
Deciders	Core team
Relates to	ADR-001 (Deflection Stack), ADR-010 (OpenTelemetry for Observability)

Context

Operators sometimes need to inspect the exact request and response payloads Isartor handled when debugging provider integrations, auth failures, or client compatibility issues. Existing OpenTelemetry spans record request metadata and routing outcomes, but intentionally do not include raw bodies because prompts often contain sensitive data and large payloads.

Decision

Add a separate, opt-in request logging mode controlled by:

enable_request_logs = true|false
request_log_path = "~/.isartor/request_logs"

When enabled, the outer monitoring middleware writes one JSONL record per request/response exchange to rotating files under request_log_path. Sensitive headers such as Authorization, api-key, and x-api-key are redacted automatically, and body logging stays out of the normal isartor.log / OpenTelemetry stream. The CLI exposes these logs via isartor logs --requests.

Consequences

Positive: Troubleshooting provider and client integration issues becomes much faster because operators can inspect exact payloads.
Positive: Normal operational logs and telemetry remain privacy-safer by default because body logging is opt-in and stored separately.
Positive: The logs CLI can tail request debug logs without mixing them into startup/runtime logs.
Negative: Even with redaction, request logs can contain sensitive prompt content, so access controls and retention need operator attention.
Negative: Request logging adds some I/O overhead when enabled and should not be the default steady-state mode.

ADR-015: Treat Additional OpenAI-Compatible Vendors as Registry Entries, Not Unique Protocols


Status	Accepted
Date	2026-03
Deciders	Core team
Relates to	ADR-006 (rig-core for Multi-Provider LLM Client), ADR-013 (Resolve Model Aliases Before Routing and Cache Key Generation)

Context

More providers now expose OpenAI-compatible chat-completions APIs. Adding each of them as a bespoke protocol integration would create repetitive code across configuration, runtime routing, setup flows, and connectivity checks even when the wire format is effectively the same.

Decision

Add new vendors such as Cerebras, Nebius, SiliconFlow, Fireworks, NVIDIA, and Chutes as named entries in Isartor's provider registry with:

a stable llm_provider enum variant
a default OpenAI-compatible endpoint
default model suggestions for CLI/setup
connectivity checks that probe each vendor's model-list endpoint
runtime Layer 3 routing that reuses the OpenAI-compatible Rig client with a provider-specific base URL

Consequences

Positive: Users get first-class provider names in set-key, setup, and check without manually discovering endpoints.
Positive: Runtime behavior stays simple because OpenAI-compatible providers share one implementation path where practical.
Positive: Future provider additions become mostly registry work instead of bespoke transport work.
Negative: Operators may assume every provider has identical semantics just because the protocol is OpenAI-compatible; model naming and auth policies still vary by vendor.

ADR-016: Keep Provider Health Status In-Memory and Expose It Through Debug/CLI Views


Status	Accepted
Date	2026-03
Deciders	Core team
Relates to	ADR-001 (Deflection Stack), ADR-010 (OpenTelemetry for Observability), ADR-015 (OpenAI-Compatible Provider Registry)

Context

Operators need a fast way to confirm which Layer 3 provider Isartor is using right now and whether recent upstream requests have been succeeding or failing. Existing isartor check output is a point-in-time connectivity probe, while prompt/agent stats focus on traffic history rather than the currently configured provider's health state.

Decision

Track provider health in memory on AppState for the active provider only. The tracker records:

configured provider name, model, and effective endpoint summary
whether an API key / endpoint is configured
request count and error count
last-known success timestamp
last-known error timestamp and compact error message

Expose that snapshot through two surfaces:

authenticated GET /debug/providers
isartor providers CLI output (with local-config fallback when the endpoint is unavailable)

The tracker is updated by real Layer 3 request outcomes and resets when the process restarts. It is not persisted to Redis, telemetry backends, or disk.

Consequences

Positive: Operators get a fast "what provider is active and is it healthy?" answer without tailing logs.
Positive: The health view stays tightly aligned with the running process because it is updated directly from Layer 3 success/failure paths.
Positive: Debug output remains lightweight and privacy-safer than request-body logging because it stores compact status metadata only.
Negative: Health state is process-local, so multi-replica deployments must query each instance (or aggregate externally) if they want a fleet-wide view.
Negative: Restarting Isartor clears the counters and timestamps by design.

ADR-017: Use an Ordered Layer 3 Provider Chain with Per-Provider Retry Budgets

Date: 2026 · Status: Accepted

Context

Operators want Isartor to remain deflection-first while becoming more resilient to upstream provider outages, rate limits, and quota exhaustion. The previous Layer 3 path retried inside a single provider only, so once that provider stayed unavailable the request failed even when a healthy backup provider/model pair was available.

Decision

Keep the existing top-level llm_provider and external_llm_* settings as the primary Layer 3 backend, and add an ordered fallback_providers chain for optional backups. Each provider gets its own retry budget. When the current provider exhausts retries with a retry-safe upstream error (for example 429, 5xx, timeout, or quota-style failure), Isartor advances to the next configured provider. Bad-request style failures do not cascade.

Successful Layer 3 responses now include an x-isartor-provider response header naming the provider that actually answered. isartor check and the provider-health views also expose the full configured chain rather than only the primary provider.

Consequences

Positive: Isartor can keep serving complex prompts through a backup provider without changing Layer 1 / Layer 2 behavior.
Positive: Operators can express resilient provider/model combos directly in config while preserving backward compatibility for single-provider setups.
Positive: The cache remains provider-agnostic, so a prompt answered by a backup provider still populates the same exact-cache path for future hits.
Negative: Layer 3 routing is more complex because retry policy and failover policy are now separate concerns.
Negative: Fallback chains can hide provider drift if operators do not watch x-isartor-provider or the provider-status surfaces.

ADR-018: Support Per-Provider Multi-Key Rotation with Cooldown

Date: 2026 · Status: Accepted

Context

Some operators hold multiple credentials for the same upstream provider: personal keys, team-shared keys, or separate quota buckets. A single-key model makes Isartor brittle under rate limits because the request fails even when another valid key for the same provider/model is available.

Decision

Allow each Layer 3 provider entry to define a provider_keys pool alongside the legacy single external_llm_api_key / api_key field. Keep single-key config backward compatible by treating the legacy key as an implicit pool member. Within a provider, Isartor now supports round_robin and priority selection plus a per-provider cooldown window after 429 or quota-style upstream failures.

Provider failover and key rotation are separate layers:

retry within a provider can rotate to another key
provider failover still happens only after that provider exhausts retries or hits a cascade-safe terminal error

Expose masked key-pool state through isartor check and the provider-status surfaces so operators can see which keys are configured and whether any are cooling down.

Consequences

Positive: Isartor can survive provider-side throttling without immediately changing providers or failing the request.
Positive: Existing single-key configs continue to work unchanged.
Positive: Operators gain visibility into key-pool strategy and cooldown state without logging raw secrets.
Negative: Runtime state is more complex because provider health and key health are now separate but related views.
Negative: Cooldown state remains process-local and resets on restart.

ADR-019: Accept Gemini-Native Inbound API Traffic at the Gateway Boundary

Date: 2026 · Status: Accepted

Context

Isartor already accepted native, OpenAI-compatible, and Anthropic-compatible client traffic, while Gemini support existed only as an upstream Layer 3 provider. That forced Gemini-native clients to rely on compatibility shims even when they naturally speak generateContent / streamGenerateContent.

Decision

Add Gemini-native inbound routes at the HTTP boundary:

POST /v1beta/models/{model}:generateContent
POST /v1beta/models/{model}:streamGenerateContent

These routes reuse the same Layer 1 / Layer 2 / Layer 3 stack as the other surfaces, but keep their own gemini cache namespace so response shapes never cross-pollinate with native, OpenAI, or Anthropic cache entries. The model embedded in the request path becomes the canonical request model for alias resolution and cache-key generation when the body does not already provide one.

Successful non-streaming responses are returned in Gemini GenerateContentResponse JSON shape. Successful streaming responses are framed as Gemini-style SSE at the boundary while cached state remains canonical JSON.

Consequences

Positive: Gemini-native tools can point at Isartor directly without an OpenAI compatibility shim.
Positive: Request caching and model-alias normalization stay consistent with the existing boundary design.
Positive: Streaming and non-streaming Gemini traffic share one canonical cache representation, reducing duplicate logic.
Negative: The gateway boundary now has another protocol surface to preserve and regression-test.
Negative: Gemini tool/function semantics are not yet a separate passthrough path the way OpenAI tool calls are.

← Back to Architecture

ADR-020: Persist provider/model usage analytics

Status: Accepted
Date: 2026-03-29

Context

Operators need a lightweight way to understand actual L3 spend, deflection savings, and recent request volume by provider/model without adding an external billing pipeline.

Decision

Persist append-only usage events to usage.jsonl, aggregate them in-process with retention pruning, and expose summaries through the authenticated debug API and isartor stats --usage. Deflected requests are recorded as saved cost against the configured primary provider/model, while actual cloud calls record estimated prompt/completion usage against the provider/model that served the request.

Consequences

Operators get simple local cost visibility with no extra services.
Estimates remain heuristic when upstream providers do not return token counts.
The usage log becomes part of the local operator surface and should be treated as operational data.

ADR-021: Reuse the persisted usage tracker for provider quota enforcement

Status: Accepted
Date: 2026-03-29

Context

Operators need quota guardrails per Layer 3 provider, including warning thresholds, hard blocks, and the ability to spill over to the next configured fallback provider. Creating a second persistence layer for quotas would risk drift between what Isartor reports as spend and what it enforces at request time.

Decision

Use the persisted usage tracker as the source of truth for quota enforcement. Providers can define [quota.<provider>] policies with daily, weekly, and monthly token and/or cost limits, a warning_threshold_ratio, and an action_on_limit of warn, block, or fallback. Quota evaluation happens before Layer 3 dispatch using current-period usage plus projected request usage, and period windows reset on UTC boundaries.

Consequences

Usage reporting and quota enforcement stay aligned because both read the same event history.
fallback quota actions integrate naturally with the existing ordered Layer 3 provider chain.
Projected token/cost enforcement remains heuristic when upstream usage metadata is unavailable before the request is sent.

ADR-022: Embedded web management dashboard

Status: Accepted
Date: 2026-04-08

Context

Operators increasingly want a browser-based way to monitor the gateway — deflection rate, provider health, token costs, and the live request log — without needing to run additional services or install third-party tools. The CLI (isartor stats, isartor providers) covers this use-case well for power users, but a visual overview is friendlier for shared team environments and periodic spot-checks.

Decision

Embed a single-page application (SPA) directly in the binary using Rust's include_str! macro. The HTML/CSS/JS file is self-contained (no CDN, no external assets) and served at GET /dashboard/. Admin API endpoints (/api/admin/*) are protected by the existing gateway API-key auth middleware and supply JSON to the frontend. The API key is stored in sessionStorage — it is never transmitted to any third party.

The dashboard provides five tabs:

Tab	Key features
Overview	Deflection rate sparkline (7-day SVG chart), uptime pill, L1a/L1b cache entry counts, quota-warning banner, provider/model cards
Providers	Health per provider, per-key pool status, connectivity test (`POST /api/admin/providers/test`), Add Provider modal
Usage	Window summary cards, daily request bar chart, per-provider/model breakdown table, per-provider quota status
Request Log	Last 100 JSONL request-log entries, expandable rows showing full JSON details
Configuration	Form-based editor for all `isartor.toml` settings; `toml_edit` write preserves comments; restart-required banner on save

Consequences

Zero runtime dependencies: no separate web server, no static-file mount, no CDN.
The dashboard binary footprint is bounded by the size of the HTML/JS/CSS — currently around 40 KB (including logo PNG).
Operators gain browser-level visibility and basic config management with the same API key already in use.
Configuration writes go to isartor.toml on disk; a gateway restart is required to apply changes (hot-reload is not supported).
AppState gains a started_at: Instant field for uptime reporting; all struct-literal test fixtures must include this field. | Request Log | Last 100 JSONL request-log entries, expandable rows showing full JSON details | | Configuration | Form-based editor for all isartor.toml settings; toml_edit write preserves comments; restart-required banner on save |

Consequences

Zero runtime dependencies: no separate web server, no static-file mount, no CDN.
The dashboard binary footprint is bounded by the size of the HTML/JS/CSS — currently around 40 KB (including logo PNG).
Operators gain browser-level visibility and basic config management with the same API key already in use.
Configuration writes go to isartor.toml on disk; a gateway restart is required to apply changes (hot-reload is not supported).
AppState gains a started_at: Instant field for uptime reporting; all struct-literal test fixtures must include this field.

ADR-023: Bidirectional Format Translation Matrix

Status: Accepted
Date: 2026-04-07

Context

Isartor serves five distinct client wire formats:

Client	Endpoint / Detection
Native (Isartor)	`POST /api/chat`, `POST /api/v1/chat`
OpenAI	`POST /v1/chat/completions`
Anthropic	`POST /v1/messages`
Gemini	`POST /v1beta/models/*:generateContent`
Cursor	`POST /v1/chat/completions` + `X-Cursor-Checksum` header
Kiro (AWS)	`POST /v1/chat/completions` + `X-Kiro-Version` header

Previously, each client format was handled ad hoc inside its own handler function with no shared abstraction. Cross-format translation (e.g. an Anthropic client routed to a Groq provider) went through the Rig agent string extraction path, which loses structured tool-call information and multi-turn context.

Decision

Introduce src/formats/ as a ports-and-adapters layer for client wire formats:

types.rs — canonical InternalRequest, InternalResponse, InternalChunk, InternalMessage, InternalContent, InternalTool
mod.rs — ApiFormat trait: parse_request, build_response, cache_namespace, name
openai.rs — reference implementation; also exports internal_to_openai_body
anthropic.rs — Anthropic Messages API; also exports internal_to_anthropic_body
gemini.rs — Gemini GenerateContent; also exports internal_to_gemini_body
cursor.rs — thin wrapper over OpenAI adapter; separate cache namespace
kiro.rs — thin wrapper over OpenAI adapter; separate cache namespace
translate.rs — ProviderWireFormat enum + translate_request(req, provider) → bytes

Format detection (formats::detect_format) checks path first, then headers. formats::cache_namespace (header-aware) replaces the path-only cache_namespace_for_path in the cache middleware.

Cache namespace invariant

Each client format owns its cache namespace:

Format	Namespace
OpenAI	`openai`
Anthropic	`anthropic`
Gemini	`gemini`
Cursor	`cursor`
Kiro	`kiro`
Native	`native`

Cursor and Kiro are now isolated from the generic OpenAI namespace so IDE-specific prompts do not collide with generic API traffic.

SSE streaming

Handlers always return canonical JSON. The cache middleware (src/middleware/cache.rs) converts JSON → SSE at the boundary when is_streaming is true. streaming_cache_response is extended to handle "cursor" and "kiro" namespaces (both use OpenAI SSE format).

Consequences

cache_namespace_for_path is no longer used in cache.rs; it remains in prompt.rs for other callers but is superseded by formats::cache_namespace in the middleware.
Adding a new client format requires: one new src/formats/<name>.rs file implementing ApiFormat, one entry in formats::detect_format, one entry in formats::cache_namespace, and one entry in streaming_cache_response.
The Rig-based extract-text path remains as the fallback for all formats; the format module provides infrastructure for future passthrough-with-translation.

ADR-024: Generalized Provider Authentication and Encrypted Local Credential Storage

Status: Accepted
Date: 2026-04-07

Context

Before this change, Isartor handled provider authentication in two disconnected ways:

most providers expected a static api_key in isartor.toml or environment variables
GitHub Copilot had bespoke device-flow logic embedded directly in src/providers/copilot.rs

That made interactive authentication inconsistent across providers and encouraged operators to keep long-lived credentials in config files even when an OAuth-style login flow existed.

Decision

Introduce a shared src/auth/ module with:

OAuthProvider trait — common interface for device-flow login, token polling, refresh, and manual API-key capture
TokenStore — AES-256-GCM encrypted credential files under ~/.isartor/tokens/
device_flow.rs — shared RFC 8628 polling loop and terminal instructions
provider implementations for Copilot, Gemini, Kiro, Anthropic, and OpenAI
CLI entry point: isartor auth <provider>, isartor auth status, isartor auth logout <provider>

Layer 3 provider resolution now does a best-effort lookup in the token store when a provider has no explicit configured api_key, so authenticated local credentials can be reused without copying them into isartor.toml.

Consequences

OAuth-capable providers now share one authentication framework instead of embedding login logic inside a single provider adapter.
Stored credentials are kept outside the main config file and encrypted at rest on disk.
Static set-key configuration remains supported for service accounts, CI, and headless deployments.
Providers without a public device flow (currently Anthropic and OpenAI) still participate via the same encrypted store, but use secure terminal key entry instead of browser/device auth.

ADR-025: Optional End-to-End Encrypted Cloud Config Sync

Status: Accepted
Date: 2026-04-07

Context

Operators who use Isartor across multiple machines needed a way to share provider/model configuration without manually copying isartor.toml and without uploading plaintext API keys to a hosted control plane.

The key constraints were:

sync must be strictly opt-in
the server must never see plaintext config
self-hosting must be possible with the same binary
OAuth tokens, cache contents, and usage history must remain local-only

Decision

Introduce src/sync/ plus the isartor sync CLI:

isartor sync init — saves a local sync profile (~/.isartor/sync-profile.json)
isartor sync push — filters the shareable subset of isartor.toml, encrypts it client-side with PBKDF2-derived AES-256-GCM, and uploads the encrypted blob
isartor sync pull — downloads, decrypts, and merges only the syncable keys/tables back into the local config
isartor sync serve — runs a self-hostable zero-knowledge blob server with GET/PUT /sync/{user_hash}

The sync server stores only { user_hash, salt_hex, encrypted_blob_hex, updated_at }. Conflict detection is timestamp-based: push checks whether the remote record changed since the last local sync, and pull checks for concurrent local edits since the last pull. Manual override is explicit via --force.

Consequences

Config sync remains off by default; nothing leaves the machine unless the operator runs isartor sync ....
The synced subset is intentionally narrower than the full runtime config: provider settings, aliases, fallback chain, and quota/pricing preferences are included; OAuth tokens, cache contents, usage history, bind addresses, and local log/cache paths are excluded.
The same Isartor binary can act as either a sync client or a self-hosted sync server, so no separate control-plane service is required for small deployments.

ADR-026: Provider Health Includes Manual Tests and Optional Background Pings

Status: Accepted
Date: 2026-04-08

Context

The dashboard already showed provider health, but the signal only changed after real routed Layer 3 traffic succeeded or failed. That left two gaps:

operators could click Test and confirm a provider was reachable, but the badge still stayed unknown
quiet environments had stale health state for long periods because no request traffic exercised every configured provider

Decision

Keep provider health in-memory, but allow two additional probe sources to update that same health snapshot:

manual dashboard tests via POST /api/admin/providers/test
background periodic pings driven by provider_health_check_interval_secs (default 300, 0 disables)

Probe results update last success/failure timestamps and healthy/failing status without inflating routed request or error counters. The runtime spawns the periodic loop only when offline mode is disabled.

Consequences

Dashboard badges reflect successful manual tests immediately.
Operators can keep provider status warm even during idle periods without sending full chat traffic through each provider.
Health counters continue to represent real routed traffic, while probe events affect only liveness-oriented status fields.

ADR-027: MiniLM Multi-Head Classifier Runs Before Cache and Layer 3 Routing

Status: Accepted
Date: 2026-04-08

Context

Issue #99 adds a second kind of local classification problem: not just "can Layer 2 answer this prompt locally?" but also "which provider/model should receive this request if it reaches Layer 3?" The codebase already loads all-MiniLM-L6-v2 for L1b semantic cache, so introducing a second embedding model would add startup cost, memory pressure, and another source of drift.

Routing also cannot safely happen after Layer 1 cache lookup, because provider-directed routing changes which upstream answer is valid for a given request. Reusing the old cache key would let a classifier-routed request return a cached response created for a different provider path.

Decision

Add an optional Layer 0.5 MiniLM classifier-routing middleware between auth and cache:

reuse the existing in-process MiniLM embedder already loaded for L1b
load a JSON artifact containing four lightweight linear heads: task_type, complexity, persona, and domain
classify buffered request context via extract_classifier_context()
match ordered config rules that can prefer a provider and/or override the request model before Layer 1 and Layer 3
prefix exact/semantic cache material with the selected provider fragment when routing changes the provider
support two operating modes:
- fallback mode (fallback_to_existing_routing = true) — classifier failures or no-match results fall through to the old routing path
- fail-closed mode (fallback_to_existing_routing = false) — classifier failures or no-match results return 503

Consequences

The same MiniLM embedding now feeds both L1b semantic cache and classifier-guided Layer 3 routing, avoiding a second encoder.
Classifier-selected provider/model choices become part of cache safety, so cache behavior is provider-aware when routing rules apply.
Operators can roll the feature out gradually with fallback enabled, then tighten it to fail closed once artifacts and rules are trusted.
A model matrix shorthand ([classifier_routing.matrix]) provides a visual 2D grid mapping complexity × task_type to "provider/model" targets, compiled into rules at startup. Explicit rules always take priority. The "local" target keeps a cell on the cache/SLM path. The "default" key in either dimension acts as a wildcard.

AI Tool Integrations

Isartor is an OpenAI-compatible and Anthropic-compatible gateway that deflects repeated or simple prompts at Layer 1 (cache) and Layer 2 (local SLM) before they reach the cloud. Clients integrate by overriding their base URL to point at Isartor or by registering Isartor as an MCP server — no proxy, no MITM, no CA certificates.

Endpoints

Isartor's server defaults to: http://localhost:8080.

Authenticated chat endpoints:

Endpoint	Protocol	Path
Native Isartor (recommended for direct use)	Native	`POST /api/chat` / `POST /api/v1/chat`
OpenAI Models	OpenAI	`GET /v1/models`
OpenAI Chat Completions	OpenAI	`POST /v1/chat/completions`
Anthropic Messages	Anthropic	`POST /v1/messages`
Cache lookup / store (used by MCP clients)	Native	`POST /api/v1/cache/lookup` / `POST /api/v1/cache/store`

Authentication

Isartor can enforce a gateway key on authenticated routes when Layer 0 auth is enabled.

Supported headers:

X-API-Key: <gateway_api_key>
Authorization: Bearer <gateway_api_key> (useful for OpenAI/Anthropic-compatible clients)

By default, gateway_api_key is empty and auth is disabled (local-first). To enable gateway authentication, set ISARTOR__GATEWAY_API_KEY to a secret value. In production, always set a strong key.

Observability headers

All endpoints in the Deflection Stack include:

X-Isartor-Layer: l1a | l1b | l2 | l3 | l0
X-Isartor-Deflected: true if resolved locally (no cloud call)

Example: OpenAI-compatible request

curl -sS http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "2 + 2?"}
    ]
  }'

If gateway auth is enabled, also add:

-H 'Authorization: Bearer your-secret-key'

Many OpenAI-compatible SDKs and coding agents also call:

curl -sS http://localhost:8080/v1/models

OpenAI-compatible agent features supported by Isartor:

GET /v1/models for model discovery
stream: true on /v1/chat/completions with OpenAI-style SSE and data: [DONE]
tools, tool_choice, functions, and function_call passthrough
tool_calls preserved in provider responses
tool-aware exact cache keys, with semantic cache skipped for tool-use flows

Example: Anthropic-compatible request

curl -sS http://localhost:8080/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "claude-sonnet-4-6",
    "system": "Be concise.",
    "max_tokens": 100,
    "messages": [
      {
        "role": "user",
        "content": [{"type": "text", "text": "What is the capital of France?"}]
      }
    ]
  }'

If gateway auth is enabled, also add:

-H 'X-API-Key: your-secret-key'

Supported tools at a glance

Tool	Command	Mechanism
GitHub Copilot CLI	`isartor connect copilot`	MCP server (cache-only)
GitHub Copilot in VS Code	`isartor connect copilot-vscode`	Managed `settings.json` debug overrides
OpenClaw	`isartor connect openclaw`	Managed OpenClaw provider config (`openclaw.json`)
OpenCode	`isartor connect opencode`	Global provider + auth config
Claude Code + GitHub Copilot	`isartor connect claude-copilot`	Claude base URL override + Copilot-backed L3
Claude Code	`isartor connect claude`	Base URL override
Claude Desktop	`isartor connect claude-desktop`	Managed local MCP registration (`isartor mcp`)
Cursor IDE	`isartor connect cursor`	Base URL override + MCP
OpenAI Codex CLI	`isartor connect codex`	Base URL override
Gemini CLI	`isartor connect gemini`	Base URL override
Antigravity	`isartor connect antigravity`	Base URL override
Generic / other tools	`isartor connect generic`	Base URL override

Add --gateway-api-key <key> to any connect command only if you have explicitly enabled gateway auth.

Connection status

# Check all connected clients
isartor connect status

Global troubleshooting

Symptom	Cause	Fix
"connection refused"	Isartor not running	Run `isartor up` first
Gateway returns 401	Auth enabled but key not configured	Add `--gateway-api-key` to connect command

For tool-specific troubleshooting, see each integration page above.

GitHub Copilot CLI

Copilot CLI integrates via an MCP (Model Context Protocol) server that Isartor registers as a stdio subprocess. Isartor also exposes the same MCP tools over Streamable HTTP at http://localhost:8080/mcp/ for editors and web agents that prefer HTTP/SSE transport. Both transports expose two tools:

isartor_chat — cache lookup only. Returns the cached answer on hit (L1a exact or L1b semantic), or an empty string on miss. On a miss, Copilot uses its own LLM to answer — Isartor never routes through its configured L3 provider for Copilot traffic.
isartor_cache_store — stores a prompt/response pair in Isartor's cache so future identical or similar prompts are deflected locally.

This design means Copilot still owns the conversation loop, while Isartor acts as a transparent cache layer that reduces redundant cloud calls. On a cache hit, Isartor returns the cached text and does not call its own Layer 3 provider. Copilot CLI may still emit its normal final-answer event after the tool result, but that is a Copilot-side render step rather than an Isartor L3 forward.

Prerequisites

Isartor installed (curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh)
GitHub Copilot CLI installed

Step-by-step setup

# 1. Start Isartor
isartor up --detach

# 2. Register the MCP server with Copilot CLI
isartor connect copilot

# 3. Start Copilot normally — plain chat prompts will use Isartor cache first
copilot

How it works

isartor connect copilot adds an isartor entry to ~/.copilot/mcp-config.json
isartor connect copilot also installs a managed instruction block in ~/.copilot/copilot-instructions.md
When Copilot CLI starts, it launches isartor mcp as a stdio subprocess and loads the Isartor instruction block
The MCP server exposes isartor_chat (cache lookup) and isartor_cache_store (cache write)
For plain conversational prompts, Copilot now prefers this flow:
- Call isartor_chat with the user's prompt
- Cache hit: return the cached answer immediately, verbatim
- Cache miss: answer with Copilot's own model, then call isartor_cache_store
When Copilot calls isartor_chat:
- Cache hit (L1a exact or L1b semantic): returns the cached answer instantly
- Cache miss: returns empty → Copilot uses its own LLM
After Copilot gets an answer from its LLM, it can call isartor_cache_store to populate the cache for future requests

HTTP/SSE MCP endpoint

Isartor now exposes the same MCP tool surface at /mcp/ using Streamable HTTP:

POST /mcp/ — client → server JSON-RPC
GET /mcp/ — server → client SSE stream
DELETE /mcp/ — explicit session teardown

The HTTP transport uses the MCP Mcp-Session-Id header after initialize, and supports both JSON responses and SSE responses for POST requests. A minimal editor config looks like:

{"servers":{"isartor":{"type":"http","url":"http://localhost:8080/mcp/"}}}

Important note about "still going to L3"

If you inspect Copilot CLI JSON traces, you may still see a normal final_answer event after isartor_chat returns a cache hit. That does not mean Isartor forwarded the prompt to its own Layer 3 provider. The important signal is Isartor's own log and headers:

Cache lookup: L1a exact hit or Cache lookup: L1b semantic hit
no new Layer 3: Forwarding to LLM via Rig entry for that prompt

In other words:

Isartor L3 call = bad for a cache hit
Copilot final-answer render after a tool hit = expected CLI behavior

Isartor now installs stricter Copilot instructions that tell Copilot to emit the cached tool result verbatim on cache hits, without paraphrasing or extra tool calls.

Cache endpoints (used by MCP internally)

The MCP server calls these HTTP endpoints on the Isartor gateway:

# Cache lookup — returns cached response or 204 No Content
curl -X POST http://localhost:8080/api/v1/cache/lookup \
  -H "Content-Type: application/json" \
  -d '{"prompt": "capital of France"}'

# Cache store — saves a prompt/response pair
curl -X POST http://localhost:8080/api/v1/cache/store \
  -H "Content-Type: application/json" \
  -d '{"prompt": "capital of France", "response": "The capital of France is Paris."}'

Custom gateway URL

# If Isartor runs on a non-default port
isartor connect copilot --gateway-url http://localhost:18080

Disconnecting

isartor connect copilot --disconnect

This removes the isartor entry from ~/.copilot/mcp-config.json. It also removes the managed Isartor block from ~/.copilot/copilot-instructions.md.

Troubleshooting

Symptom	Cause	Fix
Copilot has no `isartor_chat` tool	MCP server not registered	Run `isartor connect copilot`
Copilot works but bypasses cache	Isartor instructions not installed or custom instructions disabled	Run `isartor connect copilot` again and do not launch Copilot with `--no-custom-instructions`
Cache never hits for Copilot	Responses not stored after LLM answers	Ask Copilot to call `isartor_cache_store` after answering

GitHub Copilot in VS Code

Route GitHub Copilot's code completions and chat requests in VS Code through Isartor, so repetitive prompts are deflected locally via the L1a/L1b cache layers. This reduces cloud API calls, lowers latency for repeated patterns, and gives you per-tool visibility in isartor stats.

How is this different from Copilot CLI? The Copilot CLI integration uses an MCP server for the terminal-based copilot command. This page covers VS Code — the editor extension that provides inline completions and Copilot Chat.

Prerequisites

Isartor installed and running (isartor up --detach)
GitHub Copilot VS Code extension installed (requires a Copilot subscription)
An LLM provider API key configured in Isartor for Layer 3 fallback (isartor set-key -p openai or similar)

Step 1 — Start Isartor

# Install (if not already)
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

# Configure your LLM provider key (OpenAI, Anthropic, Azure, etc.)
isartor set-key -p openai

# Start the gateway in the background
isartor up --detach

Verify it's running:

curl http://localhost:8080/health
# {"status":"ok", ...}

Step 2 — Configure VS Code

Recommended:

isartor connect copilot-vscode

This command:

auto-detects the VS Code settings.json path on macOS, Linux, and Windows
backs up the original file to settings.json.isartor-backup
writes the three github.copilot.advanced.debug.* overrides
refuses to write if Isartor is not reachable

Manual alternative: open your VS Code User Settings (JSON) and add:

{
  "github.copilot.advanced": {
    "debug.overrideProxyUrl": "http://localhost:8080",
    "debug.overrideCAPIUrl": "http://localhost:8080/v1",
    "debug.chatOverrideProxyUrl": "http://localhost:8080/v1/chat/completions"
  }
}

Setting	What It Does
`debug.overrideProxyUrl`	Routes Copilot's main API traffic through Isartor
`debug.overrideCAPIUrl`	Overrides the completions API endpoint (inline suggestions)
`debug.chatOverrideProxyUrl`	Overrides the Copilot Chat endpoint

Custom port? If Isartor runs on a different port, replace 8080 with your port everywhere above.

Step 3 — Restart VS Code

Close and reopen VS Code (or run "Developer: Reload Window" from the command palette). Copilot will now route requests through Isartor.

Step 4 — Verify

Open any code file and trigger a Copilot suggestion (start typing a comment or function). Then check Isartor's stats:

isartor stats

You should see requests flowing through Isartor's layers. Repeat the same prompt and you'll see L1a cache hits — Isartor deflected the duplicate without a cloud call.

For per-tool breakdown:

isartor stats --by-tool

Copilot VS Code traffic appears as copilot in the tool column (identified from the User-Agent header). The table now includes requests, cache hits/misses, average latency, retries, errors, and L1a/L1b safety.

How It Works

VS Code Copilot Extension
        │
        ▼ (HTTP request to overrideProxyUrl)
   ┌─────────────┐
   │   Isartor    │
   │  Gateway     │
   │              │
   │  L1a ──► L1b ──► L3 (Cloud)
   │  hit?    hit?    forward
   └─────────────┘
        │
        ▼
   Response back to VS Code

Copilot sends completion/chat requests to Isartor instead of GitHub's servers
L1a Exact Cache — sub-millisecond hit for identical prompts (< 1 ms)
L1b Semantic Cache — catches variations of the same prompt (1–5 ms)
L3 Cloud — only genuinely new prompts reach your configured LLM provider
Response flows back to Copilot transparently — no change to the editor UX

Disconnecting

isartor connect copilot-vscode --disconnect

If a backup exists, Isartor restores it. Otherwise it removes only the three managed github.copilot.advanced.debug.* keys.

Benefits

Benefit	How
Reduced API costs	Repetitive completions are served from cache
Lower latency	Cache hits return in < 5 ms vs hundreds of ms for cloud
Visibility	`isartor stats --by-tool` shows Copilot request counts, cache hit/miss safety, latency, retries, and errors
Privacy	Cached prompts never leave your machine on repeat requests
Model flexibility	Route L3 to any provider (OpenAI, Anthropic, Azure, local Ollama)

Advanced Configuration

Use a specific LLM provider for Layer 3

Isartor routes surviving (non-cached) prompts to your configured L3 provider. You can use any supported provider:

# OpenAI (default)
isartor set-key -p openai

# Anthropic
isartor set-key -p anthropic

# Azure OpenAI
export ISARTOR__LLM_PROVIDER=azure
export ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com
export ISARTOR__AZURE_DEPLOYMENT_ID=<deployment>
isartor set-key -p azure

Adjust cache sensitivity

Tune the semantic cache threshold to control how similar a prompt must be to trigger an L1b hit:

# Default: 0.92 (higher = stricter matching)
export ISARTOR__SIMILARITY_THRESHOLD=0.90

See the Configuration Reference for all available options.

Enable monitoring

export ISARTOR__ENABLE_MONITORING=true
export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317

See Metrics & Tracing for Grafana dashboards and OTel setup.

Known Limitations

Copilot Chat override — The debug.chatOverrideProxyUrl setting may not be fully respected by all versions of the Copilot Chat extension (tracking issue). Inline code completions (debug.overrideCAPIUrl) work reliably. If chat requests bypass Isartor, try using the global VS Code proxy setting as a workaround:
```
{
  "http.proxy": "http://localhost:8080"
}
```
Note: This routes all VS Code HTTP traffic through Isartor, not just Copilot. Use a PAC script if you need finer control.
Authentication — These debug.* settings bypass Copilot's normal GitHub authentication. Isartor handles the LLM provider auth via its own API key configuration. Your Copilot subscription is still required for the extension to activate.
Extension updates — VS Code may update the Copilot extension automatically. If the proxy stops working after an update, verify the settings are still present in settings.json and restart VS Code.

Troubleshooting

Symptom	Cause	Fix
Copilot suggestions stop working	Isartor not running	Run `isartor up --detach` and verify with `curl http://localhost:8080/health`
`isartor connect copilot-vscode` cannot find VS Code settings	Non-standard editor config path	Pass through manual JSON editing as a fallback
No requests in `isartor stats`	Settings not applied	Verify `settings.json` has the override block, then reload VS Code
Chat works but completions don't	Wrong endpoint URL	Ensure `debug.overrideCAPIUrl` ends with `/v1`
Completions work but chat doesn't	Known chat override limitation	Add `debug.chatOverrideProxyUrl` or use `http.proxy` as workaround
Auth errors from Copilot	Missing L3 provider key	Run `isartor set-key -p openai` (or your provider)
High latency on first request	Model loading	First request downloads the embedding model (~25 MB); subsequent requests are fast

Reverting

To stop routing Copilot through Isartor, remove the github.copilot.advanced block from your settings.json and reload VS Code:

// Remove this entire block:
"github.copilot.advanced": {
    "debug.overrideProxyUrl": "http://localhost:8080",
    "debug.overrideCAPIUrl": "http://localhost:8080/v1",
    "debug.chatOverrideProxyUrl": "http://localhost:8080/v1/chat/completions"
}

OpenClaw

OpenClaw is a self-hosted AI assistant that can connect chat apps and agent workflows to LLM providers. The pragmatic Isartor setup is to register Isartor as a custom OpenAI-compatible OpenClaw provider and let OpenClaw use that provider as its primary model path.

This is similar in spirit to the LiteLLM integration docs, but with one important difference:

LiteLLM is a multi-model gateway and catalog
Isartor is a prompt firewall / gateway that currently exposes the upstream model you configured in Isartor itself

So the best OpenClaw UX is: configure the model in Isartor first, then let isartor connect openclaw mirror that model into OpenClaw's provider config.

Pragmatic setup

# 1. Configure Isartor's upstream provider/model
isartor set-key -p groq
isartor check

# 2. Start Isartor
isartor up --detach

# 3. Make sure OpenClaw is onboarded
openclaw onboard --install-daemon

# 4. Register Isartor as an OpenClaw provider
isartor connect openclaw

# 5. Verify OpenClaw sees the provider/model and auth
openclaw models status --agent main --probe

# 6. Smoke test a prompt
openclaw agent --agent main -m "Hello from OpenClaw through Isartor"

What `isartor connect openclaw` does

It writes or updates your OpenClaw config (default: ~/.openclaw/openclaw.json) with:

models.providers.isartor
a single managed model entry matching Isartor's current upstream model
agents.defaults.model.primary = "isartor/<your-model>"
the main / default agent model override when one is present
a refresh of stale per-agent models.json registries so OpenClaw regenerates them with the latest baseUrl and apiKey

Example generated provider block:

models: {
  providers: {
    isartor: {
      baseUrl: "http://localhost:8080/v1",
      apiKey: "isartor-local",
      api: "openai-completions",
      models: [
        {
          id: "openai/gpt-oss-120b",
          name: "Isartor (openai/gpt-oss-120b)"
        }
      ]
    }
  }
}

And the default model becomes:

agents: {
  defaults: {
    model: {
      primary: "isartor/openai/gpt-oss-120b"
    }
  }
}

Base URL and auth path

OpenClaw must talk to Isartor's OpenAI-compatible /v1 surface.

Correct base URL: http://localhost:8080/v1
Wrong base URL: http://localhost:8080

Why this matters:

OpenClaw appends /chat/completions for OpenAI-compatible custom providers
Isartor exposes that route as /v1/chat/completions
using the root gateway URL can produce 404 errors such as gateway unknown L0 via chat/completions

isartor connect openclaw writes the /v1 path for you, so prefer the connector over hand-editing the provider block.

Reconnecting after changing the gateway API key

OpenClaw stores custom-provider state in two places:

~/.openclaw/openclaw.json
per-agent models.json registries under ~/.openclaw/agents/<agentId>/agent/

Those per-agent registries can keep an old apiKey or baseUrl even after openclaw.json changes. That is why you can still see 401 after fixing the key in the top-level config.

The supported fix is simply:

isartor connect openclaw --gateway-api-key <your-key>
openclaw models status --agent main --probe
openclaw agent --agent main -m "Hello from OpenClaw through Isartor"

The connector now refreshes openclaw.json, updates the main / default agent model override, and removes stale per-agent models.json files so OpenClaw regenerates them with the new auth.

Why this is the best fit

The upstream LiteLLM/OpenClaw docs assume the gateway can expose a multi-model catalog and route among many providers behind one endpoint.

Isartor is different today:

OpenClaw talks to Isartor over the OpenAI-compatible /v1/chat/completions surface
Isartor forwards using its configured upstream provider/model
OpenClaw model refs should therefore mirror the model currently configured in Isartor

That means:

if you change Isartor's provider/model later, rerun isartor connect openclaw
if you change Isartor's gateway API key later, rerun isartor connect openclaw --gateway-api-key ...
do not expect isartor/openai/... and isartor/anthropic/... fallbacks to behave like LiteLLM provider switching unless Isartor itself grows multi-provider routing later

Options

Flag	Default	Description
`--model`	Isartor's configured upstream model	Override the single model ID exposed to OpenClaw
`--config-path`	auto-detected	Path to `openclaw.json`
`--gateway-api-key`	(none)	Gateway key if auth is enabled

Files written

~/.openclaw/openclaw.json — managed OpenClaw provider config
~/.openclaw/agents/<agentId>/agent/models.json — regenerated by OpenClaw after Isartor clears stale custom-provider caches
openclaw.json.isartor-backup — backup, when a prior config existed

Disconnecting

isartor connect openclaw --disconnect

If a backup exists, Isartor restores it. Otherwise it removes only the managed models.providers.isartor entry and related isartor/... default-model references.

Recommended user workflow

For day-to-day use:

Pick your upstream provider with isartor set-key
Validate with isartor check
Keep Isartor running with isartor up --detach
Let OpenClaw use isartor/<configured-model> as its primary model
Use openclaw models status --agent main --probe whenever you want to confirm what OpenClaw currently sees

If you later switch Isartor from, for example, Groq to OpenAI or Azure:

isartor set-key -p openai
isartor check
isartor connect openclaw

That refreshes OpenClaw's provider model to match the new Isartor config.

What Isartor does for OpenClaw

Benefit	How
Cache repeated agent prompts	OpenClaw often repeats the same context and system framing. L1a exact cache resolves those instantly.
Catch paraphrases	L1b semantic cache resolves similar follow-ups locally when safe.
Compress repeated instructions	L2.5 trims repeated context before cloud fallback.
Keep one stable gateway URL	OpenClaw only needs `isartor/<model>` while Isartor owns the upstream provider configuration.
Observability	`isartor stats --by-tool` lets you track OpenClaw cache hits, latency, and savings.

Troubleshooting

Symptom	Cause	Fix
OpenClaw cannot reach the provider	Isartor not running	Run `isartor up --detach` first
OpenClaw onboarding/custom provider returns 404	Base URL points at `http://localhost:8080` instead of `http://localhost:8080/v1`	Use `isartor connect openclaw` or update the custom provider base URL to end with `/v1`
OpenClaw still shows the old model	Isartor model changed after initial connect	Re-run `isartor connect openclaw`
Auth errors (401) after reconnecting	OpenClaw is still using stale per-agent provider state	Re-run `isartor connect openclaw --gateway-api-key <key>` so Isartor refreshes `openclaw.json` and clears stale per-agent `models.json` registries
"Model is not allowed"	OpenClaw allowlist still excludes the managed model	Re-run `isartor connect openclaw` so the managed model is re-added to the allowlist

OpenCode

OpenCode integrates via a global provider config and auth store. Isartor registers an isartor provider backed by @ai-sdk/openai-compatible and points it at the gateway's /v1 endpoint.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure OpenCode
isartor connect opencode

# 3. Start OpenCode
opencode

How it works

isartor connect opencode backs up ~/.config/opencode/opencode.json
It writes an isartor provider definition to that config file
It writes a matching auth entry to ~/.local/share/opencode/auth.json
The provider uses @ai-sdk/openai-compatible with baseURL set to http://localhost:8080/v1
If gateway auth is disabled, Isartor writes a dummy local auth key so OpenCode still has a credential to send

Files written

~/.config/opencode/opencode.json
~/.local/share/opencode/auth.json

Backups:

~/.config/opencode/opencode.json.isartor-backup
~/.local/share/opencode/auth.json.isartor-backup

Disconnecting

isartor connect opencode --disconnect

Disconnect restores the original files from backup when available. If no backup exists, it removes only the managed isartor entries.

Troubleshooting

Symptom	Cause	Fix
OpenCode cannot see the Isartor provider	Config file not written	Run `isartor connect opencode` again
OpenCode shows auth errors	Gateway auth mismatch	Re-run with `--gateway-api-key` or update `ISARTOR__GATEWAY_API_KEY`
OpenCode cannot list models	`/v1/models` unreachable	Verify `curl http://localhost:8080/v1/models`

Claude Code + GitHub Copilot

Use Claude Code's editor and CLI workflow while routing Layer 3 through your existing GitHub Copilot subscription via Isartor. Repeated prompts are still deflected by Isartor's L1a/L1b cache first, ad L2 SLM (if turned on) so cache hits consume zero Copilot quota.

Current status: experimental. The connector and Copilot-backed L3 routing are implemented, but Isartor's Anthropic compatibility surface is still text-oriented today. That means plain Claude Code prompting works best right now; more advanced Anthropic tool-use blocks may still require follow-up work.

Prerequisites

Active GitHub Copilot subscription
Isartor installed
Claude Code installed

# Install Isartor
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

# Install Claude Code
npm install -g @anthropic-ai/claude-code

Setup

Path A — Interactive authentication (recommended)

isartor connect claude-copilot

This starts GitHub device-flow authentication, stores the OAuth token locally, updates ./isartor.toml, and writes Claude Code settings into ~/.claude/settings.json.

When no --github-token is provided, Isartor now prefers browser/device-flow OAuth first. It will reuse a previously saved OAuth credential, but it will not silently reuse legacy saved PATs.

Path B — Use an existing GitHub token

isartor connect claude-copilot --github-token ghp_YOUR_TOKEN

Use --github-token only when you intentionally want to override the default browser login flow with a PAT.

Path C — Choose custom Copilot models

isartor connect claude-copilot \
  --github-token ghp_YOUR_TOKEN \
  --model gpt-4.1 \
  --fast-model gpt-4o-mini

After the command finishes, restart Isartor so the new Layer 3 config is loaded:

isartor stop
isartor up --detach
claude

One-click smoke test

./scripts/claude-copilot-smoke-test.sh
# or
make smoke-claude-copilot

The script automatically:

reads the saved Copilot credential from ~/.isartor/providers/copilot.json
picks a supported Copilot-backed model
starts a temporary Isartor instance
runs a Claude Code smoke prompt
prints an ROI demo showing L3, L1a exact-hit, and L1b semantic-hit behavior

What the command changes

`~/.claude/settings.json`

The command writes these Claude Code environment overrides:

Setting	Value	Purpose
`ANTHROPIC_BASE_URL`	`http://localhost:8080` (or your gateway URL)	Routes Claude Code to Isartor
`ANTHROPIC_AUTH_TOKEN`	`dummy` or your gateway key	Satisfies Claude Code auth requirements
`ANTHROPIC_MODEL`	selected model	Primary Copilot-backed model
`ANTHROPIC_DEFAULT_SONNET_MODEL`	selected model	Default Claude Code Sonnet mapping
`ANTHROPIC_DEFAULT_HAIKU_MODEL`	fast model	Lightweight/background tasks
`DISABLE_NON_ESSENTIAL_MODEL_CALLS`	`1`	Reduce unnecessary quota burn
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`	`1`	Compatibility flag across Claude Code versions
`ENABLE_TOOL_SEARCH`	`true`	Preserve Claude Code tool search behavior
`CLAUDE_CODE_MAX_OUTPUT_TOKENS`	`16000`	Stay under Copilot's output cap

`./isartor.toml`

The command also sets Isartor Layer 3 to use the Copilot provider:

llm_provider = "copilot"
external_llm_model = "claude-sonnet-4.5"
external_llm_api_key = "ghp_..."
external_llm_url = "https://api.githubcopilot.com/chat/completions"

Available Copilot-backed models

Model	Type	Notes
`claude-sonnet-4.5`	Balanced	Good default for Claude-style behavior
`claude-haiku-4.5`	Fast	Lower-latency Claude-family option
`gpt-4o`	Strong general model	Good for broad coding tasks
`gpt-4o-mini`	Fast + cheap	Good default fast/background model
`gpt-4.1`	Included	Safe fallback choice
`o3-mini`	Reasoning	Higher-latency reasoning model

What Isartor saves

Without Isartor:

Every Claude Code prompt -> GitHub Copilot API -> quota consumed

With Isartor:

Repeated prompt (L1a hit) -> served locally -> 0 Copilot quota
Similar prompt (L1b hit)  -> served locally -> 0 Copilot quota
Novel prompt (cache miss) -> forwarded to Copilot -> quota consumed

Example session:

100 Claude Code prompts
  40 exact repeats      -> L1a -> 0 quota
  25 semantic variants  -> L1b -> 0 quota
  35 novel prompts      -> L3  -> 35 Copilot-backed requests

Result: 35 routed requests instead of 100

Limitations

GitHub Copilot output is capped; Isartor writes CLAUDE_CODE_MAX_OUTPUT_TOKENS=16000
The current /v1/messages compatibility path is still text-oriented, so some advanced Anthropic tool-use flows may not yet behave exactly like direct Anthropic routing
Extended-thinking / provider-specific Anthropic features are not preserved
If the chosen Copilot model is unavailable to your account, requests fail instead of silently falling back to Anthropic

Disconnect

isartor connect claude-copilot --disconnect

This restores the backed-up ~/.claude/settings.json and ./isartor.toml.

Troubleshooting

Error	Cause	Fix
`Authentication failed`	Browser login incomplete, token invalid, or expired	Re-run `isartor connect claude-copilot` and finish GitHub sign-in
`No active GitHub Copilot subscription`	Signed-in GitHub user has no active Copilot seat / entitlement	Check `https://github.com/features/copilot` and enterprise seat assignment
`Model not found`	Account cannot access the requested model	Retry with `--model gpt-4.1`
`Claude Code still uses Anthropic`	Isartor not restarted after config change	Run `isartor stop && isartor up --detach`
`401` from Isartor	Gateway auth enabled but Claude settings use dummy token	Re-run with the gateway key available in local config
`Tool call failed`	Current Anthropic compatibility is still text-first	Use simpler prompting for now; full tool-use compatibility is follow-up work

Claude Code

Claude Code integrates via ANTHROPIC_BASE_URL, pointing all API traffic at Isartor's /v1/messages endpoint.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Claude Code
isartor connect claude

# 3. Claude Code now routes through Isartor automatically

How it works

isartor connect claude sets ANTHROPIC_BASE_URL in ~/.claude/settings.json
Claude Code sends requests to Isartor's /v1/messages endpoint
Isartor forwards to the Anthropic API as Layer 3 when the request is not deflected

Disconnecting

isartor connect claude --disconnect

Troubleshooting

Symptom	Cause	Fix
Claude not routing through Isartor	`settings.json` not updated	Run `isartor connect claude`

Claude Desktop

Claude Desktop integrates with Isartor via a local MCP server. The recommended setup is isartor connect claude-desktop, which registers isartor mcp in Claude Desktop's config so Claude can use Isartor's cache-aware tools.

Step-by-step setup

# 1. Start Isartor
isartor up --detach

# 2. Register Isartor in Claude Desktop
isartor connect claude-desktop

# 3. Restart Claude Desktop

After restart, open Claude Desktop's tools/connectors UI and confirm the isartor MCP server is present.

What the connector writes

isartor connect claude-desktop updates Claude Desktop's local MCP config and keeps a backup next to it.

Typical config paths:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux (best-effort path): ~/.config/Claude/claude_desktop_config.json

The generated MCP entry looks like:

{
  "mcpServers": {
    "isartor": {
      "command": "/path/to/isartor",
      "args": ["mcp"],
      "env": {
        "ISARTOR_GATEWAY_URL": "http://localhost:8080"
      }
    }
  }
}

If gateway auth is enabled, the connector also writes ISARTOR__GATEWAY_API_KEY into the managed server env block.

What Claude Desktop gets

The Isartor MCP server exposes these tools:

isartor_chat — cache-first lookup through Isartor's L1a/L1b layers
isartor_cache_store — store prompt/response pairs back into Isartor after a cache miss

This gives Claude Desktop a low-risk integration path that fits the current MCP model without relying on Anthropic base-URL overrides.

Advanced / manual setup

If you prefer to edit the config yourself, add a local MCP server entry that runs:

isartor mcp

Isartor also exposes MCP over HTTP/SSE at:

http://localhost:8080/mcp/

That remote MCP surface is useful for clients that support HTTP/SSE registration directly, but isartor connect claude-desktop currently uses the local stdio flow because it is the most reliable Claude Desktop path today.

Disconnecting

isartor connect claude-desktop --disconnect

This restores the backup when one exists; otherwise it removes only the managed mcpServers.isartor entry.

Troubleshooting

Symptom	Cause	Fix
Claude Desktop shows no `isartor` tools	Claude Desktop was not restarted	Quit and relaunch Claude Desktop after `isartor connect claude-desktop`
Tools appear but calls fail	Isartor is not running	Start the gateway with `isartor up --detach`
MCP server is present but unauthorized	Gateway auth enabled	Re-run `isartor connect claude-desktop --gateway-api-key <key>`
You want the original config back	Managed config needs rollback	Run `isartor connect claude-desktop --disconnect`

Note on desktop extensions

Claude Desktop now supports desktop extensions, but Isartor's first-class integration in this repo uses the simpler local MCP server flow today. That keeps setup light and works with the existing isartor mcp implementation immediately.

Cursor IDE

Cursor IDE integrates via the OpenAI Base URL override in Cursor's model settings, and optionally via MCP server registration for tool-based integration.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Cursor
isartor connect cursor

# 3. Open Cursor → Settings → Cursor Settings → Models
# 4. Enable "Override OpenAI Base URL" and enter: http://localhost:8080/v1
# 5. Paste the API key shown in the connect output
# 6. Add a custom model name (e.g. gpt-4o) and enable it
# 7. Use Ask or Plan mode (Agent mode doesn't support custom keys yet)

How it works

isartor connect cursor writes a reference env file to ~/.isartor/env/cursor.sh
It also registers Isartor as an MCP server in ~/.cursor/mcp.json
In Cursor, override the OpenAI Base URL to point at Isartor's /v1 endpoint
Cursor can use Isartor's GET /v1/models endpoint to discover the configured model
All chat completions requests route through Isartor's L1/L2/L3 deflection stack
Isartor supports OpenAI streaming SSE, tool-call passthrough, and HTTP/SSE MCP at http://localhost:8080/mcp/ for compatible Cursor workflows
Cursor's Ask and Plan modes are supported; Agent mode requires native keys

Cursor's generated MCP config points at:

{"mcpServers":{"isartor":{"type":"http","url":"http://localhost:8080/mcp/"}}}

Disconnecting

isartor connect cursor --disconnect

Troubleshooting

Symptom	Cause	Fix
Cursor not routing through Isartor	Base URL override not set	Open Cursor Settings → Models → enable Override OpenAI Base URL
Cursor model picker is empty	Cursor cannot reach model discovery	Verify `http://localhost:8080/v1/models` is reachable from Cursor

OpenAI Codex CLI

OpenAI Codex CLI integrates via OPENAI_BASE_URL, routing requests through Isartor's OpenAI-compatible /v1 surface, including /v1/chat/completions and /v1/models.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Codex
isartor connect codex

# 3. Source the env file
source ~/.isartor/env/codex.sh

# 4. Run Codex
codex --model o3-mini

How it works

isartor connect codex writes OPENAI_BASE_URL and OPENAI_API_KEY to ~/.isartor/env/codex.sh
Codex can query /v1/models to discover the configured model
Codex sends chat requests to Isartor's /v1/chat/completions endpoint
Isartor supports OpenAI streaming SSE and tool-call passthrough for compatible agent workflows
Isartor forwards to the configured upstream as Layer 3 when not deflected
Use --model to select any model name configured in your L3 provider

Disconnecting

isartor connect codex --disconnect

Troubleshooting

Symptom	Cause	Fix
Codex not routing through Isartor	Env vars not loaded	Run `source ~/.isartor/env/codex.sh` in your shell
Codex cannot list models	`/v1/models` unreachable or auth mismatch	Test `curl http://localhost:8080/v1/models` with the same auth settings

Gemini CLI

Gemini CLI integrates via GEMINI_API_BASE_URL, routing requests through Isartor's gateway.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Gemini CLI
isartor connect gemini

# 3. Source the env file
source ~/.isartor/env/gemini.sh

# 4. Run Gemini CLI
gemini

How it works

isartor connect gemini writes GEMINI_API_BASE_URL and GEMINI_API_KEY to ~/.isartor/env/gemini.sh
Gemini CLI sends requests to Isartor's gateway
Isartor forwards to the configured upstream as Layer 3 when not deflected

Disconnecting

isartor connect gemini --disconnect

Troubleshooting

Symptom	Cause	Fix
Gemini not routing through Isartor	Env vars not loaded	Run `source ~/.isartor/env/gemini.sh` in your shell

Antigravity

Antigravity integrates via an OpenAI-compatible base URL override. Isartor generates a shell env file that sets OPENAI_BASE_URL and OPENAI_API_KEY to route all LLM calls through the Deflection Stack.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Generate the env file
isartor connect antigravity

# 3. Activate the environment
source ~/.isartor/env/antigravity.sh

# 4. Start Antigravity
# (it will now use Isartor as its OpenAI endpoint)

How it works

isartor connect antigravity creates ~/.isartor/env/antigravity.sh
The file exports OPENAI_BASE_URL pointing at http://localhost:8080/v1
It exports OPENAI_API_KEY with your gateway key (or a local placeholder)
When sourced, Antigravity sends all OpenAI-compatible calls through Isartor

Files written

~/.isartor/env/antigravity.sh

Disconnecting

isartor connect antigravity --disconnect

Then restart your shell to clear the exported variables.

Troubleshooting

Symptom	Cause	Fix
Connection refused	Isartor not running	Run `isartor up` first
Auth errors (401)	Gateway auth enabled	Re-run with `--gateway-api-key`
Env not applied	Shell not sourced	Run `source ~/.isartor/env/antigravity.sh`

Generic Connector

For tools not explicitly supported, use the generic connector to generate an env script that sets the tool's base URL environment variable to point at Isartor.

Compatible tools

The generic connector works with any OpenAI-compatible tool, including:

Windsurf
Zed
Cline
Roo Code
Aider
Continue
Antigravity (also available via isartor connect antigravity)
OpenClaw (also available via isartor connect openclaw)
Any other tool that reads an OPENAI_BASE_URL or similar environment variable

OpenAI-compatible features exposed by Isartor include:

GET /v1/models for model discovery
POST /v1/chat/completions
stream: true SSE responses
tool/function calling passthrough (tools, tool_choice, functions, tool_calls)

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure the tool (example: Windsurf)
isartor connect generic \
  --tool-name Windsurf \
  --base-url-var OPENAI_BASE_URL \
  --api-key-var OPENAI_API_KEY

# 3. Source the env file
source ~/.isartor/env/windsurf.sh

# 4. Start the tool

Arguments

Flag	Required	Description
`--tool-name`	yes	Display name (also used for env script filename)
`--base-url-var`	yes	Env var the tool reads for its API base URL
`--api-key-var`	no	Env var the tool reads for its API key
`--no-append-v1`	no	Don't append `/v1` to the gateway URL

Disconnecting

isartor connect generic \
  --tool-name Windsurf \
  --base-url-var OPENAI_BASE_URL \
  --disconnect

Troubleshooting

Symptom	Cause	Fix
Tool not routing through Isartor	Env vars not loaded	Run `source ~/.isartor/env/<tool>.sh` in your shell
Tool says no models are available	It expects OpenAI model discovery	Verify it can reach `http://localhost:8080/v1/models`

Level 1 — Minimal Deployment

Single static binary, embedded candle inference + in-process candle sentence embeddings, zero C/C++ dependencies.

This guide covers deploying Isartor as a standalone process — no sidecars, no Docker Compose, no orchestrator. The firewall binary embeds a Gemma-2-2B-IT GGUF model via candle for Layer 2 classification and uses candle's BertModel (sentence-transformers/all-MiniLM-L6-v2) for Layer 1 semantic cache embeddings — all entirely in-process, pure Rust.

When to Use Level 1

✅ Good Fit	❌ Consider Level 2/3 Instead
€5–€20/month VPS (Hetzner, DigitalOcean, Linode)	GPU inference for generation quality
ARM edge devices (Raspberry Pi 5, Jetson Nano)	More than ~50 concurrent users
Air-gapped / offline environments	Production observability stack required
Development & local experimentation	Multi-node high-availability
CI/CD test runners

Prerequisites

Requirement	Minimum	Recommended
RAM	2 GB free	4 GB free
Disk	2 GB (model download)	5 GB
CPU	2 cores	4+ cores (AVX2 recommended)
Rust (build from source)	1.75+	Latest stable
OS	Linux (x86_64 / aarch64), macOS	Ubuntu 22.04 LTS

Memory budget: Gemma-2-2B Q4_K_M ≈ 1.5 GB, candle BertModel ≈ 90 MB, tokenizer ≈ 4 MB, firewall runtime ≈ 50 MB. Total: ~1.7 GB resident.

Option A: One-Click Install (Recommended)

The fastest way to get started is to leverage the pre-built, cross-platform binaries generated by the CI/CD pipeline.

Install via script:

curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex

This script detects your target OS and processor architecture, downloads the correct release binary, and adds it to your path automatically.

Option B: Build from Source

1. Clone & Build

git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release

The release binary is at ./target/release/isartor (~5 MB statically linked).

2. Configure Environment

Create a minimal .env file or export variables directly:

# Required — your cloud LLM key for Layer 3 fallback
export ISARTOR__EXTERNAL_LLM_API_KEY="sk-..."

# Optional — override defaults
export ISARTOR__GATEWAY_API_KEY="my-secret-key"
export ISARTOR__HOST_PORT="0.0.0.0:8080"
export ISARTOR__LLM_PROVIDER="openai"          # openai | azure | anthropic | xai
export ISARTOR__EXTERNAL_LLM_MODEL="gpt-4o-mini"

# Cache mode — "both" enables exact + semantic cache. Semantic embeddings
# are generated in-process via candle BertModel — no sidecar needed.
export ISARTOR__CACHE_MODE="both"

# Pluggable backends — Level 1 uses the defaults (no change needed):
#   ISARTOR__CACHE_BACKEND=memory     — in-process LRU (ahash + parking_lot)
#   ISARTOR__ROUTER_BACKEND=embedded  — in-process Candle GGUF SLM
# These are ideal for a single-process deployment with zero dependencies.

3. Start the Firewall

./target/release/isartor up

On first start, the embedded classifier will auto-download the Gemma-2-2B-IT GGUF model from Hugging Face Hub (~1.5 GB). Subsequent starts load from the local cache (~/.cache/huggingface/).

INFO  isartor > Listening on 0.0.0.0:8080
INFO  isartor::layer1::embeddings > Initialising candle TextEmbedder (all-MiniLM-L6-v2)...
INFO  isartor::layer1::embeddings > TextEmbedder ready (~90 MB BertModel loaded)
INFO  isartor::services::local_inference > Downloading model from mradermacher/gemma-2-2b-it-GGUF...
INFO  isartor::services::local_inference > Model loaded (1.5 GB), ready for inference

4. Verify

# Health check
curl http://localhost:8080/health

# Test the firewall
curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: my-secret-key" \
  -d '{"prompt": "Hello, how are you?"}' | jq .

Option B: Docker (Single Container)

For environments where you prefer a container but don't need a full Compose stack.

Build the Image

cd isartor
docker build -t isartor:latest -f docker/Dockerfile .

Run

docker run -d \
  --name isartor \
  -p 8080:8080 \
  -e ISARTOR__GATEWAY_API_KEY="my-secret-key" \
  -e ISARTOR__EXTERNAL_LLM_API_KEY="sk-..." \
  -e ISARTOR__CACHE_MODE="both" \
  -e HF_HOME=/tmp/huggingface \
  -v isartor-models:/tmp/huggingface \
  isartor:latest

Note: The -v flag mounts a named volume for the Hugging Face cache so the model downloads persist across container restarts.

The official Docker image runs as non-root and uses HF_HOME=/tmp/huggingface to ensure the cache is writable.

Option C: systemd Service (Production Linux)

For long-running production deployments on bare metal or VPS.

1. Install the Binary

# Build
cargo build --release

# Install to /usr/local/bin
sudo cp target/release/isartor /usr/local/bin/isartor
sudo chmod +x /usr/local/bin/isartor

2. Create a System User

sudo useradd --system --no-create-home --shell /usr/sbin/nologin isartor

3. Create Environment File

sudo mkdir -p /etc/isartor
sudo tee /etc/isartor/env <<'EOF'
ISARTOR__HOST_PORT=0.0.0.0:8080
ISARTOR__GATEWAY_API_KEY=your-production-key
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...
ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__CACHE_MODE=both
ISARTOR__CACHE_BACKEND=memory
ISARTOR__ROUTER_BACKEND=embedded
RUST_LOG=isartor=info
EOF
sudo chmod 600 /etc/isartor/env

4. Create systemd Unit

sudo tee /etc/systemd/system/isartor.service <<'EOF'
[Unit]
Description=Isartor Prompt Firewall
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=isartor
Group=isartor
EnvironmentFile=/etc/isartor/env
ExecStart=/usr/local/bin/isartor
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/cache/isartor

[Install]
WantedBy=multi-user.target
EOF

5. Create Model Cache Directory

sudo mkdir -p /var/cache/isartor
sudo chown isartor:isartor /var/cache/isartor

6. Enable & Start

sudo systemctl daemon-reload
sudo systemctl enable isartor
sudo systemctl start isartor

# Check status
sudo systemctl status isartor
sudo journalctl -u isartor -f

Model Pre-Caching (Air-Gapped / Offline)

If the deployment target has no internet access, pre-download the model on a connected machine and copy it over.

On the Connected Machine

# Install huggingface-cli
pip install huggingface-hub

# Download the GGUF file
huggingface-cli download mradermacher/gemma-2-2b-it-GGUF \
  gemma-2-2b-it.Q4_K_M.gguf \
  --local-dir ./models

# Also grab the tokenizer (from the base model)
huggingface-cli download google/gemma-2-2b-it \
  tokenizer.json \
  --local-dir ./models

Transfer to Target

scp -r ./models/ user@target-host:/var/cache/isartor/

By default, hf-hub uses ~/.cache/huggingface/. In the official Docker image, Isartor sets HF_HOME=/tmp/huggingface (non-root safe). Set HF_HOME or ISARTOR_HF_CACHE_DIR to point to your pre-cached directory if needed.

Level 1 Configuration Reference

These are the most relevant ISARTOR__* variables for Level 1 deployments. For the full reference, see the Configuration Reference.

Variable	Default	Level 1 Notes
`ISARTOR__HOST_PORT`	`0.0.0.0:8080`	Bind address
`ISARTOR__GATEWAY_API_KEY`	`""`	Set to enable gateway auth
`ISARTOR__CACHE_MODE`	`both`	`both` recommended — candle BertModel provides in-process semantic embeddings
`ISARTOR__CACHE_BACKEND`	`memory`	In-process LRU — ideal for single-process Level 1
`ISARTOR__ROUTER_BACKEND`	`embedded`	In-process Candle GGUF SLM — zero external dependencies
`ISARTOR__CACHE_TTL_SECS`	`300`	Cache TTL in seconds
`ISARTOR__CACHE_MAX_CAPACITY`	`10000`	Max entries per cache
`ISARTOR__LLM_PROVIDER`	`openai`	`openai` · `azure` · `anthropic` · `xai`
`ISARTOR__EXTERNAL_LLM_API_KEY`	(empty)	Required for Layer 3 fallback
`ISARTOR__EXTERNAL_LLM_MODEL`	`gpt-4o-mini`	Cloud LLM model name
`ISARTOR__ENABLE_MONITORING`	`false`	Enable for stdout OTel (no collector needed)

Embedded Classifier Defaults (Compiled)

Setting	Default Value	Description
`repo_id`	`mradermacher/gemma-2-2b-it-GGUF`	HF repo for the GGUF model
`gguf_filename`	`gemma-2-2b-it.Q4_K_M.gguf`	Model file (~1.5 GB)
`max_classify_tokens`	`20`	Token limit for classification
`max_generate_tokens`	`256`	Token limit for simple task execution
`temperature`	`0.0`	Greedy decoding for classification
`repetition_penalty`	`1.1`	Avoids degenerate loops

Performance Expectations

Metric	Typical Value (4-core x86_64)
Cold start (model download)	30–120 s (depends on bandwidth; ~1.5 GB Gemma + ~90 MB candle BertModel)
Warm start (cached model)	3–8 s
Classification latency	50–200 ms
Simple task execution	200–2000 ms
Firewall overhead (no inference)	< 1 ms
Memory (steady state)	~1.6 GB
Binary size	~5 MB

Upgrading to Level 2

When your traffic outgrows Level 1, the migration path is straightforward:

Add the generation sidecar — ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081 (replaces embedded candle with the more powerful Phi-3-mini on GPU).
Optionally add an embedding sidecar — ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082 (only needed for external embedding inference; the default L1b semantic cache already uses in-process candle BertModel).
Deploy via Docker Compose — See Level 2 — Sidecar Deployment.

Note: The pluggable backend defaults (cache_backend=memory, router_backend=embedded) remain appropriate for Level 2 single-host deployments. You only need to switch to cache_backend=redis and router_backend=vllm at Level 3 when scaling horizontally.

No code changes required — only environment variables and infrastructure.

Level 2 — Sidecar Deployment

Split architecture: Isartor firewall + llama.cpp generation sidecar on a single host.

This guide covers deploying Isartor with a dedicated AI sidecar for generation. The firewall delegates Layer 2 inference to a lightweight llama.cpp container via HTTP, while Layer 1 semantic cache embeddings run in-process via candle BertModel (no embedding sidecar required). The overall stack runs on a single machine via Docker Compose.

When to Use Level 2

✅ Good Fit	❌ Consider Level 1 or Level 3
Single host with GPU (NVIDIA, AMD)	No GPU available → Level 1 embedded candle
Want GPU-accelerated Layer 2 generation	Multi-node scaling → Level 3 Kubernetes
Want full observability stack (Jaeger, Grafana)	Budget VPS (< 4 GB RAM) → Level 1
Development with production-like topology	Auto-scaling inference pools → Level 3
10–100 concurrent users	> 100 concurrent users → Level 3

Prerequisites

Requirement	Minimum	Recommended
RAM	8 GB	16 GB
Disk	10 GB	20 GB (model cache)
CPU	4 cores	8+ cores
GPU (optional)	NVIDIA with 4 GB VRAM	NVIDIA with 8+ GB VRAM
Docker	24.0+	Latest
Docker Compose	v2.20+	Latest
NVIDIA Container Toolkit (GPU)	Latest	Latest

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Single Host                              │
│                                                                 │
│  ┌─────────────┐    ┌───────────────────┐    ┌──────────────┐  │
│  │   Client     │───▶│  Isartor Firewall │    │  Jaeger UI   │  │
│  │             │    │  :8080             │    │  :16686      │  │
│  └─────────────┘    │  (candle L1        │    └──────────────┘  │
│                     │   embeddings       │                      │
│                     │   built-in)        │                      │
│                     └──┬────────────────┘                       │
│                        │                                        │
│              HTTP :8081│                                        │
│                        ▼                                        │
│               ┌────────────┐                  ┌──────────────┐ │
│               │ slm-gen    │                  │  Grafana     │ │
│               │ Phi-3-mini │                  │  :3000       │ │
│               │ (llama.cpp)│                  └──────────────┘ │
│               └────────────┘                                    │
│                                               ┌──────────────┐ │
│               ┌─────────────────────────┐     │  Prometheus  │ │
│               │    OTel Collector :4317  │────▶│  :9090       │ │
│               └─────────────────────────┘     └──────────────┘ │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ Optional: slm-embed :8082 (llama.cpp)                    │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Services

Service	Image	Port	Purpose	Memory Limit
gateway	`isartor:latest` (built)	8080	Prompt Firewall (includes candle BertModel for Layer 1 embeddings)	256 MB
slm-generation	`ghcr.io/ggml-org/llama.cpp:server`	8081	Phi-3-mini-4k (Q4_K_M) — intent classification + generation	4 GB
slm-embedding (optional)	`ghcr.io/ggml-org/llama.cpp:server`	8082	all-MiniLM-L6-v2 (Q8_0) — external embedding sidecar (default uses in-process candle)	512 MB
otel-collector	`otel/opentelemetry-collector-contrib:0.96.0`	4317	OTLP gRPC receiver	128 MB
jaeger	`jaegertracing/all-in-one:1.55`	16686	Distributed tracing UI	256 MB
prometheus	`prom/prometheus:v2.51.0`	9090	Metrics storage (7d retention)	256 MB
grafana	`grafana/grafana:10.4.0`	3000	Dashboards	256 MB

Quick Start (CPU Only)

1. Clone the Repository

git clone https://github.com/isartor-ai/isartor.git
cd isartor/docker

2. Configure Layer 3 (Optional)

Layers 0–2 work without a cloud LLM key. If you want Layer 3 fallback:

cp .env.full.example .env.full

Edit .env.full and set your provider:

ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...

3. Start the Full Stack

docker compose -f docker-compose.sidecar.yml up --build

First launch downloads model files (~1.5 GB for Phi-3 + ~50 MB for MiniLM). Subsequent starts use the cached isartor-slm-models volume.

4. Wait for Health Checks

The firewall waits for both sidecars to become healthy before starting:

docker compose -f docker-compose.sidecar.yml ps

All services should show healthy or running.

5. Verify

# Health check
curl http://localhost:8080/healthz

# Test the firewall
curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is 2+2?"}' | jq .

# If you enabled gateway auth, add:
#   -H "X-API-Key: your-secret-key"

# Check traces in Jaeger
open http://localhost:16686

GPU Passthrough (NVIDIA)

To enable GPU acceleration for the llama.cpp sidecars:

1. Install NVIDIA Container Toolkit

# Ubuntu / Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

2. Add GPU Resources to Compose

Create a docker-compose.gpu.override.yml:

services:
  slm-generation:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # The default --n-gpu-layers 99 in docker-compose.sidecar.yml
    # already offloads all layers to GPU when available.

  slm-embedding:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

3. Start with GPU Override

docker compose \
  -f docker-compose.sidecar.yml \
  -f docker-compose.gpu.override.yml \
  up --build

Expected GPU Impact

Metric	CPU Only (8-core)	GPU (RTX 3060 12 GB)
Phi-3 classification	500–2000 ms	30–100 ms
Phi-3 generation (256 tokens)	5–15 s	0.5–2 s
MiniLM embedding	20–50 ms	5–10 ms

Available Compose Files

The docker/ directory contains several Compose configurations for different use cases:

File	Description	Provider
`docker-compose.sidecar.yml`	Recommended. Full stack with llama.cpp sidecars + observability	Any (configurable)
`docker-compose.yml`	Legacy stack with Ollama (heavier)	OpenAI
`docker-compose.azure.yml`	Legacy stack with Ollama, pre-configured for Azure OpenAI	Azure
`docker-compose.observability.yml`	Observability-focused stack (Ollama + OTel + Jaeger + Grafana)	Azure

We recommend docker-compose.sidecar.yml for all new deployments. The llama.cpp sidecars are ~30 MB each vs. Ollama's ~1.5 GB.

Environment Variables (Level 2 Specific)

These variables are relevant to the sidecar architecture. For the full reference, see the Configuration Reference.

Firewall ↔ Sidecar Communication

Variable	Default	Description
`ISARTOR__LAYER2__SIDECAR_URL`	`http://127.0.0.1:8081`	Generation sidecar URL (use Docker service name in Compose: `http://slm-generation:8081`)
`ISARTOR__LAYER2__MODEL_NAME`	`phi-3-mini`	Model name for OpenAI-compatible requests
`ISARTOR__LAYER2__TIMEOUT_SECONDS`	`30`	HTTP timeout for generation calls
`ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL`	`http://127.0.0.1:8082`	Embedding sidecar URL — optional (default uses in-process candle; use `http://slm-embedding:8082` in Compose)
`ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME`	`all-minilm`	Embedding model name (sidecar only)
`ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS`	`10`	HTTP timeout for embedding calls (sidecar only)

Pluggable Backends

Variable	Default	Description
`ISARTOR__CACHE_BACKEND`	`memory`	In-process LRU — ideal for single-host Docker Compose
`ISARTOR__ROUTER_BACKEND`	`embedded`	In-process Candle SLM classification — no external dependency

Scalability note: These defaults are appropriate for Level 2 (single host). When moving to Level 3 (multi-replica K8s), switch to cache_backend=redis and router_backend=vllm for horizontal scaling.

Cache

Variable	Default	Description
`ISARTOR__CACHE_MODE`	`both`	Use `both` — in-process candle BertModel provides semantic embeddings at all tiers
`ISARTOR__SIMILARITY_THRESHOLD`	`0.85`	Cosine similarity threshold for cache hits

Observability

Variable	Default	Description
`ISARTOR__ENABLE_MONITORING`	`true` (in Compose)	Enable OTel trace/metric export
`ISARTOR__OTEL_EXPORTER_ENDPOINT`	`http://otel-collector:4317`	OTel Collector gRPC endpoint

Operational Commands

Logs

# All services
docker compose -f docker-compose.sidecar.yml logs -f

# Firewall only
docker compose -f docker-compose.sidecar.yml logs -f gateway

# Sidecars
docker compose -f docker-compose.sidecar.yml logs -f slm-generation slm-embedding

Restart a Service

docker compose -f docker-compose.sidecar.yml restart gateway

Tear Down (Preserve Model Cache)

docker compose -f docker-compose.sidecar.yml down
# Models persist in the 'isartor-slm-models' volume

Tear Down (Clean Everything)

docker compose -f docker-compose.sidecar.yml down -v
# Removes all volumes including model cache — next start re-downloads models

View Model Cache Size

docker volume inspect isartor-slm-models

Networking Notes

All services share a Docker bridge network created by Compose.
The firewall references sidecars by Docker service name (slm-generation, slm-embedding), not localhost.
Only the firewall (8080), Jaeger UI (16686), Grafana (3000), and Prometheus (9090) are exposed to the host.
Sidecar ports (8081, 8082) are also exposed for debugging but can be removed in production by deleting the ports: mapping.

Scaling Within Level 2

Before moving to Level 3, you can vertically scale Level 2:

Optimisation	How
More GPU VRAM	Use larger quantisation (Q8_0 instead of Q4_K_M) for better quality
Bigger model	Swap Phi-3-mini for Phi-3-medium or Qwen2-7B in the Compose command
More cache	Increase `ISARTOR__CACHE_MAX_CAPACITY` and `ISARTOR__CACHE_TTL_SECS`
Faster embedding	Use `nomic-embed-text` (768-dim) for richer semantic matching
More concurrency	Scale horizontally with multiple firewall replicas behind a load balancer

Upgrading to Level 3

When a single host is no longer sufficient:

Extract the firewall into stateless Kubernetes pods (it's already stateless).
Replace sidecars with an auto-scaling inference pool (vLLM, TGI, or Triton).
Add an internal load balancer between firewall pods and the inference pool.
Move observability to a managed solution (Datadog, Grafana Cloud, Azure Monitor).

See Level 3 — Enterprise Deployment for the full Kubernetes guide.

Level 3 — Enterprise Deployment

Fully decoupled microservices: stateless firewall pods + auto-scaling GPU inference pools.

This guide covers deploying Isartor on Kubernetes with Helm, horizontal pod autoscaling, dedicated GPU inference pools (vLLM or TGI), service mesh integration, and production-grade observability.

When to Use Level 3

✅ Good Fit	❌ Overkill For
100+ concurrent users	< 50 users → Level 2 Docker Compose
Multi-region / multi-zone HA	Single-machine development → Level 1
Auto-scaling GPU inference	No GPU budget → Level 1 embedded candle
Compliance: mTLS, audit logs, RBAC	Hobby projects / PoCs
Cost optimisation via scale-to-zero	Teams without Kubernetes experience

Architecture

                        ┌────────────────────┐
                        │    Ingress / ALB    │
                        │  (TLS termination)  │
                        └──────────┬─────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │      Firewall Deployment     │
                    │      (N stateless pods)       │
                    │                              │
                    │  ┌────────┐   ┌────────┐    │
                    │  │ Pod 1  │   │ Pod N  │    │
                    │  │isartor │   │isartor │    │
                    │  └────────┘   └────────┘    │
                    │                              │
                    │  HPA: CPU / custom metrics   │
                    └──────────────┬───────────────┘
                                   │
                          Internal ClusterIP
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                     │
     ┌────────▼───────┐  ┌────────▼───────┐   ┌────────▼───────┐
  │ Inference Pool  │  │ Embedding Pool  │   │ Cloud LLM      │
  │ (vLLM / TGI)   │  │ (TEI / llama)   │   │ (OpenAI / etc) │
  │                 │  │                 │   │ (Layer 3 only)  │
  │ GPU Nodes       │  │ CPU/GPU Nodes   │   └────────────────┘
  │ HPA on GPU util │  │ HPA on RPS      │
  └─────────────────┘  └─────────────────┘

Component Summary

Component	Replicas	Scaling Metric	Resource
Firewall	2–20	CPU utilisation / request rate	CPU nodes
Inference Pool (vLLM)	1–N	GPU utilisation / queue depth	GPU nodes
Embedding Pool (TEI)	1–N	Requests per second	CPU or GPU nodes (optional; default uses in-process candle)
OTel Collector	1 (DaemonSet or Deployment)	—	CPU nodes
Ingress Controller	1–2	—	CPU nodes

Prerequisites

Requirement	Details
Kubernetes cluster	1.28+ (EKS, GKE, AKS, or bare metal)
Helm	v3.12+
kubectl	Matching cluster version
GPU nodes (for inference pool)	NVIDIA GPU Operator installed, or GKE/EKS GPU node pools
Container registry	For pushing the Isartor firewall image
Ingress controller	nginx-ingress, Istio, or cloud ALB

Step 1: Build & Push the Firewall Image

# Build
docker build -t your-registry.io/isartor:v0.1.0 -f docker/Dockerfile .

# Push
docker push your-registry.io/isartor:v0.1.0

Step 2: Namespace & Secrets

kubectl create namespace isartor

# Cloud LLM API key (Layer 3 fallback)
kubectl create secret generic isartor-llm-secret \
  --namespace isartor \
  --from-literal=api-key='sk-...'

# Firewall API key (Layer 0 auth)
kubectl create secret generic isartor-gateway-secret \
  --namespace isartor \
  --from-literal=gateway-api-key='your-production-key'

Step 3: Firewall Deployment

# k8s/gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: isartor-gateway
  namespace: isartor
  labels:
    app: isartor-gateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: isartor-gateway
  template:
    metadata:
      labels:
        app: isartor-gateway
    spec:
      containers:
        - name: gateway
          image: your-registry.io/isartor:v0.1.0
          ports:
            - containerPort: 8080
              name: http
          env:
            - name: ISARTOR__HOST_PORT
              value: "0.0.0.0:8080"
            - name: ISARTOR__GATEWAY_API_KEY
              valueFrom:
                secretKeyRef:
                  name: isartor-gateway-secret
                  key: gateway-api-key
            # Pluggable backends — scaled for multi-replica K8s
            - name: ISARTOR__CACHE_BACKEND
              value: "redis"          # Shared cache across all firewall pods
            - name: ISARTOR__REDIS_URL
              value: "redis://redis.isartor:6379"
            - name: ISARTOR__ROUTER_BACKEND
              value: "vllm"           # GPU-backed vLLM inference pool
            - name: ISARTOR__VLLM_URL
              value: "http://isartor-inference:8081"
            - name: ISARTOR__VLLM_MODEL
              value: "gemma-2-2b-it"
            # Cache
            - name: ISARTOR__CACHE_MODE
              value: "both"
            - name: ISARTOR__SIMILARITY_THRESHOLD
              value: "0.85"
            - name: ISARTOR__CACHE_TTL_SECS
              value: "300"
            - name: ISARTOR__CACHE_MAX_CAPACITY
              value: "50000"
            # Inference pool (internal service)
            - name: ISARTOR__LAYER2__SIDECAR_URL
              value: "http://isartor-inference:8081"
            - name: ISARTOR__LAYER2__MODEL_NAME
              value: "phi-3-mini"
            - name: ISARTOR__LAYER2__TIMEOUT_SECONDS
              value: "30"
            # Embedding pool (optional — default uses in-process candle)
            - name: ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL
              value: "http://isartor-embedding:8082"
            - name: ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME
              value: "all-minilm"
            - name: ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS
              value: "10"
            # Layer 3 — Cloud LLM
            - name: ISARTOR__LLM_PROVIDER
              value: "openai"
            - name: ISARTOR__EXTERNAL_LLM_MODEL
              value: "gpt-4o-mini"
            - name: ISARTOR__EXTERNAL_LLM_API_KEY
              valueFrom:
                secretKeyRef:
                  name: isartor-llm-secret
                  key: api-key
            # Observability
            - name: ISARTOR__ENABLE_MONITORING
              value: "true"
            - name: ISARTOR__OTEL_EXPORTER_ENDPOINT
              value: "http://otel-collector.isartor:4317"
          resources:
            requests:
              cpu: "250m"
              memory: "128Mi"
            limits:
              cpu: "1000m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: isartor-gateway
  namespace: isartor
spec:
  selector:
    app: isartor-gateway
  ports:
    - port: 8080
      targetPort: http
      name: http
  type: ClusterIP

Step 4: Inference Pool (vLLM)

vLLM provides high-throughput, GPU-optimised inference with continuous batching.

# k8s/inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: isartor-inference
  namespace: isartor
  labels:
    app: isartor-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: isartor-inference
  template:
    metadata:
      labels:
        app: isartor-inference
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "microsoft/Phi-3-mini-4k-instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8081"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.9"
          ports:
            - containerPort: 8081
              name: http
          resources:
            requests:
              nvidia.com/gpu: 1
              memory: "8Gi"
            limits:
              nvidia.com/gpu: 1
              memory: "16Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: isartor-inference
  namespace: isartor
spec:
  selector:
    app: isartor-inference
  ports:
    - port: 8081
      targetPort: http
      name: http
  type: ClusterIP

Alternative: Text Generation Inference (TGI)

Replace vLLM with TGI if you prefer Hugging Face's inference server:

containers:
  - name: tgi
    image: ghcr.io/huggingface/text-generation-inference:latest
    args:
      - "--model-id"
      - "microsoft/Phi-3-mini-4k-instruct"
      - "--port"
      - "8081"
      - "--max-input-length"
      - "4096"
      - "--max-total-tokens"
      - "8192"

Alternative: llama.cpp Server (CPU / Light GPU)

For budget clusters without heavy GPU nodes:

containers:
  - name: llama-cpp
    image: ghcr.io/ggml-org/llama.cpp:server
    args:
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8081"
      - "--hf-repo"
      - "microsoft/Phi-3-mini-4k-instruct-gguf"
      - "--hf-file"
      - "Phi-3-mini-4k-instruct-q4.gguf"
      - "--ctx-size"
      - "4096"
      - "--n-gpu-layers"
      - "99"

Step 5: Embedding Pool (TEI) — Optional

Note: The gateway generates Layer 1 embeddings in-process via candle BertModel. This external embedding pool is optional for high-throughput deployments that want to offload embedding generation.

Text Embeddings Inference (TEI) provides optimised embedding generation.

# k8s/embedding-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: isartor-embedding
  namespace: isartor
  labels:
    app: isartor-embedding
spec:
  replicas: 2
  selector:
    matchLabels:
      app: isartor-embedding
  template:
    metadata:
      labels:
        app: isartor-embedding
    spec:
      containers:
        - name: tei
          image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest
          args:
            - "--model-id"
            - "sentence-transformers/all-MiniLM-L6-v2"
            - "--port"
            - "8082"
          ports:
            - containerPort: 8082
              name: http
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "1Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: isartor-embedding
  namespace: isartor
spec:
  selector:
    app: isartor-embedding
  ports:
    - port: 8082
      targetPort: http
      name: http
  type: ClusterIP

Step 6: Horizontal Pod Autoscaler

Gateway HPA

# k8s/gateway-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: isartor-gateway-hpa
  namespace: isartor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: isartor-gateway
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

Inference Pool HPA (Custom Metrics)

For GPU-based scaling, use custom metrics from Prometheus:

# k8s/inference-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: isartor-inference-hpa
  namespace: isartor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: isartor-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"

Note: GPU-based HPA requires the Prometheus Adapter or KEDA to expose GPU metrics to the HPA controller.

Step 7: Ingress

nginx-ingress Example

# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: isartor-ingress
  namespace: isartor
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.isartor.example.com
      secretName: isartor-tls
  rules:
    - host: api.isartor.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: isartor-gateway
                port:
                  number: 8080

Istio VirtualService (Service Mesh)

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: isartor-vs
  namespace: isartor
spec:
  hosts:
    - api.isartor.example.com
  gateways:
    - isartor-gateway
  http:
    - match:
        - uri:
            prefix: /api/
      route:
        - destination:
            host: isartor-gateway
            port:
              number: 8080
      timeout: 120s
      retries:
        attempts: 2
        perTryTimeout: 60s

Step 8: Apply Everything

# Apply in order
kubectl apply -f k8s/gateway-deployment.yaml
kubectl apply -f k8s/inference-deployment.yaml
kubectl apply -f k8s/embedding-deployment.yaml
kubectl apply -f k8s/gateway-hpa.yaml
kubectl apply -f k8s/inference-hpa.yaml
kubectl apply -f k8s/ingress.yaml

# Verify
kubectl get pods -n isartor
kubectl get svc -n isartor
kubectl get hpa -n isartor

Redis Configuration for Distributed Cache

Enterprise deployments use Redis to share the exact-match cache across all firewall pods. Configure the cache provider via environment variables or isartor.yaml:

Environment Variables

ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379

YAML Configuration

exact_cache:
  provider: redis
  redis_url: "redis://redis-cluster.svc:6379"
  # Optional: redis_db: 0

Kubernetes Topology with Redis

Deploy Redis as a StatefulSet within the cluster, accessible only via ClusterIP:

[Ingress]
   |
[Isartor Deployment] <--> [Redis StatefulSet]
   |
   +--> [vLLM Deployment (GPU nodes)]

Isartor pods scale horizontally for network I/O and cache hits.
Redis ensures cache consistency across all pods.
The vLLM GPU pool scales independently for inference throughput.

vLLM Configuration for SLM Routing

Enterprise deployments replace the embedded candle SLM with a remote vLLM inference pool for higher throughput. Configure the router backend via environment variables or isartor.yaml:

Environment Variables

ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm-openai.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct

YAML Configuration

slm_router:
  provider: remote_http
  remote_url: "http://vllm-openai.svc:8000"
  model: "meta-llama/Llama-3-8B-Instruct"

Docker Compose Example (Enterprise Sidecar)

For development or staging environments that mirror enterprise topology:

services:
  isartor:
    image: isartor-ai/isartor:latest
    ports:
      - "8080:8080"
    environment:
      - ISARTOR__CACHE_BACKEND=redis
      - ISARTOR__REDIS_URL=redis://redis-cluster:6379
      - ISARTOR__ROUTER_BACKEND=vllm
      - ISARTOR__VLLM_URL=http://vllm-openai:8000
      - ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
    depends_on:
      - redis
      - vllm-openai

  redis:
    image: redis:7
    ports:
      - "6379:6379"

  vllm-openai:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"

Observability in Level 3

For Kubernetes deployments, you have several options:

Approach	Stack	Effort
Self-managed	OTel Collector DaemonSet → Jaeger + Prometheus + Grafana	Medium
Managed (AWS)	AWS X-Ray + CloudWatch + Managed Grafana	Low
Managed (GCP)	Cloud Trace + Cloud Monitoring	Low
Managed (Azure)	Azure Monitor + Application Insights	Low
Third-party	Datadog / New Relic / Grafana Cloud	Low

The gateway exports traces and metrics via OTLP gRPC to whatever ISARTOR__OTEL_EXPORTER_ENDPOINT points at. See Metrics & Tracing for detailed setup.

Scalability Deep-Dive

Level 3 is designed for horizontal scaling. The Pluggable Trait Provider architecture ensures every component can scale independently:

Stateless Gateway Pods

The Isartor gateway binary is fully stateless when configured with cache_backend=redis and router_backend=vllm. All request-scoped state (cache, inference) is offloaded to external services, meaning:

Gateway pods scale linearly — add replicas via HPA without coordination overhead.
Zero warm-up penalty — new pods serve requests immediately (no model loading, no cache priming).
Rolling updates — deploy new versions with zero downtime; old and new pods share the same Redis cache.

Shared Cache via Redis

With ISARTOR__CACHE_BACKEND=redis:

Benefit	Impact
Consistent hit rate	All pods read/write the same cache — no per-pod cold caches
Memory efficiency	Cache memory is centralised, not duplicated N times
Persistence	Redis AOF/RDB survives pod restarts
Cluster mode	Redis Cluster or ElastiCache provides sharded, HA caching

GPU Inference Pool (vLLM)

With ISARTOR__ROUTER_BACKEND=vllm:

Benefit	Impact
Independent GPU scaling	Scale inference replicas separately from gateway pods
Continuous batching	vLLM's PagedAttention maximises GPU utilisation
Mixed hardware	Gateway runs on cheap CPU nodes; inference on GPU nodes
Cost control	Scale inference to zero when idle (KEDA + queue-depth trigger)

Scaling Dimensions

Dimension	Knob	Metric
Gateway replicas	HPA `minReplicas` / `maxReplicas`	CPU utilisation, request rate
Inference replicas	HPA on custom GPU metrics	GPU utilisation, queue depth
Cache capacity	`ISARTOR__CACHE_MAX_CAPACITY`	Cache hit rate, memory usage
Concurrency	HPA + replica scaling	P95 latency, request rate
Redis	Redis Cluster nodes	Key count, memory, eviction rate

Cost Optimisation

Strategy	Description
Spot / preemptible nodes	Use for inference pods (they're stateless and restart quickly)
Scale-to-zero	Use KEDA with queue-depth trigger to scale inference to 0 when idle
Right-size GPU	A100 80 GB for large models, T4/L4 for Phi-3-mini (4 GB VRAM is sufficient)
Shared GPU	NVIDIA MPS or MIG to run multiple inference pods per GPU
Semantic cache	Higher `ISARTOR__CACHE_MAX_CAPACITY` = fewer inference calls
Smaller quantisation	Q4_K_M uses less VRAM at marginal quality cost

Security Checklist

TLS termination at ingress (cert-manager + Let's Encrypt or cloud certs)
mTLS between services (Istio / Linkerd / Cilium)
ISARTOR__GATEWAY_API_KEY from Kubernetes Secret, not plaintext
ISARTOR__EXTERNAL_LLM_API_KEY from Kubernetes Secret
Network policies restricting pod-to-pod communication
RBAC: least-privilege ServiceAccounts for each workload
Pod security standards: restricted or baseline
Image scanning (Trivy, Snyk) in CI pipeline
Audit logging enabled on the cluster

Downgrading to Level 2

If Kubernetes overhead doesn't justify the scale:

Export your env vars from the Kubernetes ConfigMap/Secret.
Map them into docker/.env.full.
Run docker compose -f docker-compose.sidecar.yml up --build.

No code changes — the binary is identical across all three tiers.

Air-Gapped / Offline Deployment

Overview

Isartor is architecturally the most air-gap-friendly LLM gateway available. Its pure-Rust statically compiled binary embeds all inference models at build time, requires no runtime dependencies, and validates licenses with an offline HMAC check — so Isartor itself does not initiate unsolicited telemetry or license calls to external services.

The zero-phone-home guarantee applies to Isartor-managed network paths: the --offline flag disables L3 cloud routing and external observability backends at the application layer, and our CI phone-home audit test (see tests/phone_home_audit.rs) exercises these code paths on every commit.

Supported regulated industries: defense, healthcare (HIPAA), finance (SOX), and government (FedRAMP).

Pre-Deployment Checklist

Complete these steps before deploying Isartor in an air-gapped environment:

Download the airgapped Docker image
```
docker pull ghcr.io/isartor-ai/isartor:latest-airgapped
```
This image includes local copies of the L1b embedding models to minimize or avoid external downloads during normal operation in most setups. See Image Size Comparison for size details and be sure to follow any additional configuration steps required by your environment to operate fully offline.
Transfer to your air-gapped environment via your organisation's approved media transfer process (USB, air-gap data diode, etc.).
Enable offline mode
```
export ISARTOR__OFFLINE_MODE=true
```
Alternatively, pass --offline on the command line:
```
isartor --offline
```
Disable L3 or point it at an internal LLM endpoint
- For strictly air-gapped / zero-egress deployments, you must enable offline mode (step 3). Leaving ISARTOR__EXTERNAL_LLM_API_KEY unset alone does not prevent the gateway from attempting outbound L3 calls to the default external endpoint on cache misses.
- To run fully local (cache + SLM only) with no outbound attempts, enable offline mode and leave ISARTOR__EXTERNAL_LLM_API_KEY unset.
- To route L3 to a self-hosted model, see Connecting to an Internal LLM.

Run isartor check to confirm zero external connections:

isartor check

Expected output (with offline mode active):

Isartor Connectivity Audit
──────────────────────────
Required (L3 cloud routing):
  → api.openai.com:443     [NOT CONFIGURED]
    (BLOCKED — offline mode active)

Optional (observability / monitoring):
  → http://localhost:4317  [NOT CONFIGURED]

Internal only (no external):
  → (in-memory cache — no network connection)  [CONFIGURED - internal]

Zero hidden telemetry connections: ✓ VERIFIED
Air-gap compatible: ✓ YES (L3 disabled or offline mode active)

Run isartor audit verify (planned — see issue #3) to confirm the signed audit log is functioning correctly.

Connecting to an Internal LLM

In this configuration Isartor acts as a fully air-gapped deflection layer in front of an internal LLM. 100% of traffic stays inside the perimeter: L1a and L1b handle cached / semantically similar prompts locally, and only genuine cache misses are forwarded to your self-hosted model over the internal network.

# Route L3 to a self-hosted vLLM instance on the internal network.
export ISARTOR__EXTERNAL_LLM_URL=http://vllm.internal.corp:8000/v1
export ISARTOR__LLM_PROVIDER=openai          # vLLM exposes an OpenAI-compat API
export ISARTOR__EXTERNAL_LLM_MODEL=meta-llama/Llama-3-8B-Instruct

# Enable offline mode to block any accidental external connections.
export ISARTOR__OFFLINE_MODE=true

# Start the gateway.
isartor

Note: ISARTOR__EXTERNAL_LLM_URL sets the L3 endpoint URL. Point it at your internal vLLM or TGI server.

With this configuration:

L1a (exact cache) deflects duplicate prompts instantly (< 1 ms).
L1b (semantic cache) deflects semantically similar prompts (1–5 ms).
L3 forwards surviving cache-miss prompts to your internal vLLM.
Zero bytes leave the network perimeter.

When offline mode is active, Isartor prints a status banner at startup so operators can confirm the configuration at a glance:

  ┌──────────────────────────────────────────────────────┐
  │  [Isartor] OFFLINE MODE ACTIVE                       │
  ├──────────────────────────────────────────────────────┤
  │  ✓ L1a Exact Cache:     active                       │
  │  ✓ L1b Semantic Cache:  active                       │
  │  - L2 SLM Router:       disabled (ENABLE_SLM_ROUTER=false)│
  │  ✗ L3 Cloud Logic:      DISABLED (offline mode)      │
  │  ✗ Telemetry export:    DISABLED if external endpoint │
  │  ✓ License validation:  offline HMAC check           │
  └──────────────────────────────────────────────────────┘

Environment Variables Reference

Variable	Default	Description
`ISARTOR__OFFLINE_MODE`	`false`	Enable air-gap mode. Blocks L3 cloud calls.
`ISARTOR__EXTERNAL_LLM_URL`	—	Internal LLM endpoint (vLLM, TGI, etc.).
`ISARTOR__EXTERNAL_LLM_MODEL`	`gpt-4o-mini`	Model name passed to the internal LLM.
`ISARTOR__SIMILARITY_THRESHOLD`	`0.85`	Cosine similarity threshold for L1b cache hits. Lower values increase local deflection.
`ISARTOR__OTEL_EXPORTER_ENDPOINT`	`http://localhost:4317`	OTel collector endpoint. External URLs are suppressed in offline mode.

For the complete variable listing, see the Configuration Reference.

Image Size Comparison

Image	Tag	Includes models	Compressed size
Base	`latest`	No (downloads on first run)	~120 MB
Air-gapped	`latest-airgapped`	Yes (all-MiniLM-L6-v2 embedded)	~210 MB

The latest-airgapped image is approximately 90 MB larger due to the pre-bundled embedding model. This is the recommended image for any environment with restricted outbound internet access.

Compliance Notes

FedRAMP / NIST 800-53

This deployment posture supports the following NIST 800-53 controls:

Control	Description	How Isartor Supports It
AU-2	Audit Logging	Every prompt, deflection decision, and L3 call is logged as a structured JSON event with tracing spans.
SC-7	Boundary Protection	`ISARTOR__OFFLINE_MODE=true` enforces a hard block on all outbound connections. The phone-home audit CI test verifies this.
SI-4	Information System Monitoring	OpenTelemetry traces + Prometheus metrics provide real-time visibility into the deflection stack. Internal-only OTel endpoints are supported.
CM-6	Configuration Settings	All settings are controlled via environment variables with documented defaults. No runtime code changes are needed.

HIPAA

When ISARTOR__OFFLINE_MODE=true and L3 is pointed at an internal model:

PHI in prompts never leaves the network perimeter.
The L1b semantic cache computes embeddings in-process using a pure-Rust candle model — no external API calls.
Audit logs are written to stdout for ingestion by your internal SIEM.

Disclaimer

This document describes deployment architecture. The controls described above are architectural claims based on code behaviour — they are not a formal compliance certification. Consult your compliance team and engage a qualified assessor for formal FedRAMP authorization or HIPAA compliance review.

Configuration Reference

Complete reference for every Isartor configuration variable, CLI command, and provider option.

Configuration Loading Order

Isartor loads configuration in the following order (later sources override earlier ones):

Compiled defaults — baked into the binary
isartor.toml — if present in the working directory or ~/.isartor/
Environment variables — ISARTOR__... with double-underscore separators

Generate a starter config file with:

isartor init

Master Configuration Table

YAML Key	Environment Variable	Type	Default	Description
server.host	ISARTOR__HOST	string	0.0.0.0	Host address for server binding
server.port	ISARTOR__PORT	int	8080	Port for HTTP server
exact_cache.provider	ISARTOR__CACHE_BACKEND	string	memory	Layer 1a cache backend: memory or redis
exact_cache.redis_url	ISARTOR__REDIS_URL	string	(none)	Redis connection string (if provider=redis)
exact_cache.redis_db	ISARTOR__REDIS_DB	int	0	Redis database index
semantic_cache.provider	ISARTOR__SEMANTIC_BACKEND	string	candle	Layer 1b semantic cache: candle (in-process) or tei (external)
semantic_cache.remote_url	ISARTOR__TEI_URL	string	(none)	TEI endpoint (if provider=tei)
slm_router.provider	ISARTOR__ROUTER_BACKEND	string	embedded	Layer 2 router: embedded or vllm
slm_router.remote_url	ISARTOR__VLLM_URL	string	(none)	vLLM/TGI endpoint (if provider=vllm)
slm_router.model	ISARTOR__VLLM_MODEL	string	gemma-2-2b-it	Model name/path for SLM router
slm_router.model_path	ISARTOR__MODEL_PATH	string	(baked-in)	Path to GGUF model file (embedded mode)
slm_router.classifier_mode	ISARTOR__LAYER2__CLASSIFIER_MODE	string	tiered	Classifier mode: `tiered` (TEMPLATE/SNIPPET/COMPLEX) or `binary` (legacy SIMPLE/COMPLEX)
slm_router.max_answer_tokens	ISARTOR__LAYER2__MAX_ANSWER_TOKENS	u64	2048	Max tokens the SLM may generate for a local answer
fallback.openai_api_key	ISARTOR__OPENAI_API_KEY	string	(none)	OpenAI API key for Layer 3 fallback
fallback.anthropic_api_key	ISARTOR__ANTHROPIC_API_KEY	string	(none)	Anthropic API key for Layer 3 fallback
llm_provider	ISARTOR__LLM_PROVIDER	string	openai	LLM provider (see below for full list)
external_llm_model	ISARTOR__EXTERNAL_LLM_MODEL	string	gpt-4o-mini	Model name to request from the provider
model_aliases.	ISARTOR__MODEL_ALIASES__	string	(none)	Request-time alias that resolves to a real model ID
external_llm_api_key	ISARTOR__EXTERNAL_LLM_API_KEY	string	(none)	API key for the configured LLM provider (not needed for ollama)
provider_keys	ISARTOR__PROVIDER_KEYS	JSON array	[]	Optional multi-key pool for the primary provider
key_rotation_strategy	ISARTOR__KEY_ROTATION_STRATEGY	string	round_robin	Multi-key selection strategy: `round_robin` or `priority`
key_cooldown_secs	ISARTOR__KEY_COOLDOWN_SECS	u64	60	Cooldown applied after a key hits rate limits or quota exhaustion
l3_timeout_secs	ISARTOR__L3_TIMEOUT_SECS	u64	120	HTTP timeout applied to all Layer 3 provider requests
enable_context_optimizer	ISARTOR__ENABLE_CONTEXT_OPTIMIZER	bool	true	Master switch for L2.5 context optimiser
context_optimizer_dedup	ISARTOR__CONTEXT_OPTIMIZER_DEDUP	bool	true	Enable cross-turn instruction deduplication
context_optimizer_minify	ISARTOR__CONTEXT_OPTIMIZER_MINIFY	bool	true	Enable static minification (comments, rules, blanks)
enable_request_logs	ISARTOR__ENABLE_REQUEST_LOGS	bool	false	Opt-in request/response debug logging with redaction
request_log_path	ISARTOR__REQUEST_LOG_PATH	string	~/.isartor/request_logs	Directory for rotating JSONL request logs
usage_log_path	ISARTOR__USAGE_LOG_PATH	string	~/.isartor	Directory that stores `usage.jsonl` for usage stats and quotas
usage_retention_days	ISARTOR__USAGE_RETENTION_DAYS	u64	30	Retention window for persisted usage events
usage_window_hours	ISARTOR__USAGE_WINDOW_HOURS	u64	24	Default reporting window for `isartor stats --usage`
provider_health_check_interval_secs	ISARTOR__PROVIDER_HEALTH_CHECK_INTERVAL_SECS	u64	300	Background provider ping cadence for dashboard and health status (`0` disables)
classifier_routing.enabled	ISARTOR__CLASSIFIER_ROUTING__ENABLED	bool	false	Enable the MiniLM multi-head routing pass before Layer 1 cache
classifier_routing.artifacts_path	ISARTOR__CLASSIFIER_ROUTING__ARTIFACTS_PATH	string	(empty)	Path to the JSON artifact containing MiniLM routing heads
classifier_routing.confidence_threshold	ISARTOR__CLASSIFIER_ROUTING__CONFIDENCE_THRESHOLD	float	0.55	Minimum overall confidence required before routing rules can match
classifier_routing.fallback_to_existing_routing	ISARTOR__CLASSIFIER_ROUTING__FALLBACK_TO_EXISTING_ROUTING	bool	true	When false, fail closed with `503` instead of falling back to the normal routing path
classifier_routing.rules	ISARTOR__CLASSIFIER_ROUTING__RULES	JSON array	[]	Ordered routing rules matching classifier labels to provider/model targets
classifier_routing.matrix	ISARTOR__CLASSIFIER_ROUTING__MATRIX	TOML table	{}	Model matrix: 2D grid of `complexity → task_type → "provider/model"` (compiled to rules at startup)
quota..*	ISARTOR__QUOTA____*	mixed	(none)	Per-provider token/cost quota policy and action

Sections

Server

server.host, server.port: Bind address and port.

Layer 1a: Exact Match Cache

exact_cache.provider: memory or redis
exact_cache.redis_url, exact_cache.redis_db: Redis config

Layer 1b: Semantic Cache

semantic_cache.provider: candle or tei
semantic_cache.remote_url: TEI endpoint
Requests that carry x-isartor-session-id, x-thread-id, x-session-id, or x-conversation-id are isolated into a session-aware cache scope. The same scope can also be provided in request bodies via session_id, thread_id, conversation_id, or metadata.*. If no session identifier is present, Isartor keeps the legacy global-cache behavior.

Layer 2: SLM Router

slm_router.provider: embedded or vllm
slm_router.remote_url, slm_router.model, slm_router.model_path: Router config
slm_router.classifier_mode: tiered (default — TEMPLATE/SNIPPET/COMPLEX) or binary (legacy SIMPLE/COMPLEX)
slm_router.max_answer_tokens: Max tokens the SLM may generate for a local answer (default 2048)

Layer 0.5: MiniLM classifier routing

classifier_routing.enabled: Enables the pre-cache MiniLM routing pass.
classifier_routing.artifacts_path: JSON artifact path loaded at startup. The artifact contains four heads: task_type, complexity, persona, and domain.
classifier_routing.confidence_threshold: Global minimum for overall_confidence before any rule matches.
classifier_routing.fallback_to_existing_routing: Default true. When false, requests fail closed with 503 if the classifier artifact is missing, classification fails, or no rule matches.
classifier_routing.rules: Ordered rule list. Each rule may match any subset of task_type, complexity, persona, and domain, and must supply at least one route target: provider and/or model.
classifier_routing.matrix: Optional model matrix — a 2D grid mapping complexity × task_type to "provider/model" targets. Matrix entries compile into rules at startup. Explicit rules take priority. Use "local" for cells that should stay on the cache/SLM path. Use "default" in either dimension as a wildcard.

Example:

[classifier_routing]
enabled = true
artifacts_path = "./minilm-routing-artifact.json"
confidence_threshold = 0.60
fallback_to_existing_routing = true

[[classifier_routing.rules]]
name = "codegen-backend-builder"
task_type = "codegen"
complexity = "complex"
persona = "builder"
domain = "backend"
provider = "groq"
model = "llama-3.3-70b-versatile"

Model matrix example

[classifier_routing.matrix.complex]
code_generation = "groq/llama-3.3-70b-versatile"
analysis        = "anthropic/claude-sonnet-4-20250514"
conversation    = "openai/gpt-4o"
default         = "groq/llama-3.3-70b-versatile"

[classifier_routing.matrix.simple]
code_generation = "groq/llama-3.1-8b-instant"
analysis        = "groq/llama-3.1-8b-instant"
default         = "local"

Rows = complexity labels, columns = task_type labels.
"provider/model" pins both; "provider" alone pins only the provider.
"local" = stay on the cache/SLM path (no L3 provider override).
"default" in either dimension acts as a wildcard.
More-specific cells are tried first: complex/codegen → complex/default → default/default.

Monitor classifier-guided routing with:

x-isartor-provider response header
isartor stats / isartor stats --by-tool
GET /debug/providers
opt-in request logs via enable_request_logs

Layer 2.5: Context Optimiser

L2.5 compresses repeated instruction payloads (CLAUDE.md, copilot-instructions.md, skills blocks) before they reach the cloud, reducing input tokens on every L3 call.

enable_context_optimizer: Master switch (default true). Set to false to disable L2.5 entirely.
context_optimizer_dedup: Enable cross-turn instruction deduplication (default true). When the same instruction block is seen in consecutive turns of the same session, it is replaced with a compact hash reference.
context_optimizer_minify: Enable static minification (default true). Strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.

The pipeline processes system/instruction messages from OpenAI, Anthropic, and native request formats. See Deflection Stack — L2.5 for architecture details.

Request debug logging

Isartor can optionally record request and response payloads to a separate JSONL log for troubleshooting provider or client integrations.

enable_request_logs: Default false. Set to true only while debugging.
request_log_path: Directory where rotating request log files are written. Default ~/.isartor/request_logs.
provider_health_check_interval_secs: Default 300. Controls the dashboard/runtime background provider ping loop. Set to 0 to disable periodic pings.

Important behavior:

request logs are separate from isartor.log and OpenTelemetry output
sensitive headers such as Authorization, api-key, and x-api-key are redacted automatically
bodies are truncated to a bounded size per entry to keep logs manageable
isartor logs --requests shows or follows the request log stream
dashboard Test actions also update the in-memory provider health badge immediately, while the background ping loop keeps it fresh between real routed requests

Layer 3: Cloud Fallbacks

fallback.openai_api_key, fallback.anthropic_api_key: API keys for external LLMs
llm_provider: Select the active provider. All providers are powered by rig-core except copilot, which uses Isartor's native GitHub Copilot adapter:
- openai (default), azure, anthropic, xai
- gemini, mistral, groq, cerebras, nebius, siliconflow, fireworks, nvidia, chutes, deepseek
- cohere, galadriel, hyperbolic, huggingface
- mira, moonshot, ollama (local, no key), openrouter
- perplexity, together
- copilot (GitHub Copilot subscription-backed L3)
external_llm_model: Model name for the selected provider (e.g. gpt-4o-mini, gemini-2.0-flash, mistral-small-latest, llama-3.1-8b-instant, deepseek-chat, command-r, sonar, moonshot-v1-128k)
Many OpenAI-compatible providers ship with built-in default endpoints now, so set-key, setup, and check work directly for providers such as Cerebras, Nebius, SiliconFlow, Fireworks, NVIDIA, and Chutes.
model_aliases: Optional map of friendly names to real model IDs. Alias resolution happens at the HTTP boundary before L1 cache keys are built, so model="fast" and the resolved real model share the same canonical cache behavior.
external_llm_api_key: API key for the configured provider (not needed for ollama)
provider_keys: Optional array-of-tables or ISARTOR__PROVIDER_KEYS JSON array for multiple credentials on the same provider. Each entry supports key, priority, and optional label.
key_rotation_strategy: round_robin (default) or priority
key_cooldown_secs: Cooldown window, in seconds, after a key hits 429 / quota-style upstream failures
l3_timeout_secs: Shared timeout, in seconds, for all Layer 3 provider HTTP calls
fallback_providers: Optional ordered backup chain. Keep the current top-level provider as the primary, then add [[fallback_providers]] entries with provider, model, api_key, provider_keys, key_rotation_strategy, and url. Azure fallbacks can also set azure_deployment_id and azure_api_version.
ISARTOR__FALLBACK_PROVIDERS: Environment override for the same chain as a JSON array of provider objects. Example:

export ISARTOR__FALLBACK_PROVIDERS='[
  {"provider":"nvidia","model":"meta/llama-3.1-8b-instruct","api_key":"nvapi-...","url":"https://integrate.api.nvidia.com/v1/chat/completions"},
  {"provider":"openrouter","model":"openai/gpt-4o-mini","api_key":"sk-or-...","url":"https://openrouter.ai/api/v1/chat/completions"}
]'

Failover happens only after the current provider exhausts its own retry budget, and only for provider-side errors that are safe to cascade (for example 429, 5xx, timeouts, and quota-style failures). Invalid request / bad request errors do not move to the next provider.
Successful Layer 3 responses include an x-isartor-provider header naming the upstream that actually answered.

Provider status / health

Isartor keeps an in-memory status tracker for the configured Layer 3 chain. It is intentionally process-local and resets on restart.

GET /debug/providers: Authenticated debug endpoint that returns the active provider plus every configured primary/fallback entry, including model, endpoint, request/error counts, and the last-known success/error timestamps and message.
Provider status now also includes masked key-pool entries, their strategy, request counts, rate-limit counts, and cooldown state.
isartor providers: CLI view that reads /debug/providers when the gateway is reachable and falls back to local config inspection when it is not.
The tracker is updated only by real Layer 3 request outcomes. It does not persist across restarts and does not write to Redis or other storage.

Supported inbound API surfaces

Isartor currently accepts four user-facing request formats at the gateway boundary:

Native Isartor: POST /api/chat and POST /api/v1/chat
OpenAI-compatible: POST /v1/chat/completions
Anthropic-compatible: POST /v1/messages
Gemini-native: POST /v1beta/models/{model}:generateContent and POST /v1beta/models/{model}:streamGenerateContent

Gemini-native requests use the model embedded in the URL path as the canonical request model when the body omits a model field. Cache entries stay namespaced by API surface, so Gemini JSON responses never collide with native, OpenAI, or Anthropic cache entries even when the underlying prompt text is identical.

Model aliases

Use aliases when you want clients to send stable names like fast, smart, or code instead of raw provider model IDs:

[model_aliases]
fast = "gpt-4o-mini"
smart = "gpt-4o"
code = "gpt-4.1"

You can also write them from the CLI:

isartor set-alias --alias fast --model gpt-4o-mini

Aliases are currently model aliases within the configured provider. They are surfaced by GET /v1/models alongside the configured real model IDs.

TOML Config Example

Generate a scaffold with isartor init, then edit isartor.toml:

[server]
host = "0.0.0.0"
port = 8080

[exact_cache]
provider = "memory"           # "memory" or "redis"
# redis_url = "redis://127.0.0.1:6379"
# redis_db = 0

[semantic_cache]
provider = "candle"           # "candle" or "tei"
# remote_url = "http://localhost:8082"

[slm_router]
provider = "embedded"         # "embedded" or "vllm"
# remote_url = "http://localhost:8000"
# model = "gemma-2-2b-it"

# L2.5 Context Optimiser (all enabled by default)
# enable_context_optimizer = true
# context_optimizer_dedup = true
# context_optimizer_minify = true

[fallback]
# openai_api_key = "sk-..."
# anthropic_api_key = "sk-ant-..."

# llm_provider = "openai"
# external_llm_model = "gpt-4o-mini"
# external_llm_api_key = "sk-..."
# key_rotation_strategy = "round_robin"
# key_cooldown_secs = 60
#
# [[provider_keys]]
# key = "sk-primary"
# priority = 1
# label = "primary"
#
# [[provider_keys]]
# key = "sk-team-shared"
# priority = 2
# label = "team-shared"
#
# [[fallback_providers]]
# provider = "nvidia"
# model = "meta/llama-3.1-8b-instruct"
# api_key = "nvapi-..."
# url = "https://integrate.api.nvidia.com/v1/chat/completions"
#
# [model_aliases]
# fast = "gpt-4o-mini"
# smart = "gpt-4o"

Per-Tier Defaults

Setting	Level 1 (Minimal)	Level 2 (Sidecar)	Level 3 (Enterprise)
Cache backend	memory	memory	redis
Semantic backend	candle	candle	tei (optional)
SLM router	embedded	embedded or sidecar	vllm
LLM provider	openai	openai	any
Monitoring	false	true	true

Provider-Specific Configuration

Each provider requires ISARTOR__EXTERNAL_LLM_API_KEY (except Ollama) and a matching ISARTOR__LLM_PROVIDER value:

# OpenAI (default)
export ISARTOR__LLM_PROVIDER=openai
export ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini

# Azure OpenAI
export ISARTOR__LLM_PROVIDER=azure

# Anthropic
export ISARTOR__LLM_PROVIDER=anthropic
export ISARTOR__EXTERNAL_LLM_MODEL=claude-3-haiku-20240307

# xAI (Grok)
export ISARTOR__LLM_PROVIDER=xai

# Google Gemini
export ISARTOR__LLM_PROVIDER=gemini
export ISARTOR__EXTERNAL_LLM_MODEL=gemini-2.0-flash

# Ollama (local — no API key required)
export ISARTOR__LLM_PROVIDER=ollama
export ISARTOR__EXTERNAL_LLM_MODEL=llama3

# GitHub Copilot (configured automatically by `isartor connect claude-copilot`)
export ISARTOR__LLM_PROVIDER=copilot
export ISARTOR__EXTERNAL_LLM_MODEL=claude-sonnet-4.5

Setting API Keys with the CLI

Use isartor set-key for interactive key management:

isartor set-key --provider openai
isartor set-key --provider anthropic
isartor set-key --provider xai

This writes the key to isartor.toml or the appropriate env file.

CLI Commands

Command	Description
`isartor up`	Start the API gateway only (recommended default). Flag: `--detach` to run in background
`isartor up <copilot\|claude\|antigravity>`	Start the gateway plus the CONNECT proxy for that client
`isartor init`	Generate a commented `isartor.toml` config scaffold
`isartor demo`	Run the post-install showcase (cache-only, or live + cache when a provider is configured)
`isartor check`	Audit outbound connections
`isartor connect <client>`	Configure AI clients to route through Isartor
`isartor connect copilot`	Configure Copilot CLI with CONNECT proxy + TLS MITM
`isartor connect claude-copilot`	Configure Claude Code to use GitHub Copilot through Isartor
`isartor stats`	Show total prompts, counts by layer, and recent prompt routing history
`isartor set-key --provider <name>`	Set LLM provider API key (writes to `isartor.toml` or env file)
`isartor set-alias --alias <name> --model <id>`	Set a request-time model alias in `isartor.toml`
`isartor providers`	Show the active provider config plus last-known in-memory Layer 3 health
`isartor logs --requests`	Show or follow the separate request/response debug log
`isartor stop`	Stop a running Isartor instance (uses PID file). Flags: `--force` (SIGKILL), `--pid-file <path>`
`isartor update`	Self-update to the latest (or specific) version. Flags: `--version <tag>`, `--dry-run`, `--force`

See also: Architecture · Metrics & Tracing · Troubleshooting

Usage analytics

Isartor can persist provider/model usage events to ~/.isartor/usage.jsonl and expose aggregated summaries through GET /debug/usage and isartor stats --usage. Configure usage_log_path, usage_retention_days, usage_window_hours, and usage_pricing.<provider>.{input_cost_per_million_usd,output_cost_per_million_usd} via ISARTOR__... environment variables or isartor.toml.

Provider quotas

Quota policies reuse the same persisted usage tracker instead of introducing a second accounting store. Define a policy per provider with [quota.<provider>], then set any mix of token and cost ceilings:

daily_token_limit, weekly_token_limit, monthly_token_limit
daily_cost_limit_usd, weekly_cost_limit_usd, monthly_cost_limit_usd
warning_threshold_ratio (default 0.8)
action_on_limit = warn, block, or fallback

Behavior notes:

warnings are emitted when projected usage for the in-flight request crosses the configured threshold
block returns HTTP 429
fallback skips the current provider and continues down the existing Layer 3 provider chain
quota windows reset on UTC boundaries: daily at midnight, weekly on Monday 00:00, monthly on the first day of the month
isartor check prints the current quota window status for each configured provider target

Example:

[quota.openai]
daily_token_limit = 500000
monthly_cost_limit_usd = 25.0
warning_threshold_ratio = 0.8
action_on_limit = "fallback"

[quota.anthropic]
daily_cost_limit_usd = 10.0
action_on_limit = "block"

Metrics & Tracing

Definitive reference for Isartor's OpenTelemetry traces, metrics, structured logging, and observability stack — from local development to Kubernetes.

Overview

Isartor uses OpenTelemetry for distributed tracing and metrics, plus tracing-subscriber with a JSON layer for structured logging.

Signal	Protocol	Default Endpoint
Traces	OTLP gRPC	`http://localhost:4317`
Metrics	OTLP gRPC	`http://localhost:4317`
Logs	stdout (JSON)	—

When ISARTOR__ENABLE_MONITORING=false (default), only the console log layer is active — zero OTel overhead.

Architecture

┌─────────────┐                  ┌──────────────────┐
│  Isartor    │  OTLP gRPC      │  OTel Collector   │
│  Gateway    │─────────────────▶│  :4317            │
│             │  (traces +       │                   │
│             │   metrics)       │  Pipelines:       │
└─────────────┘                  │  traces → Jaeger  │
                                 │  metrics → Prom   │
                                 └───┬──────────┬────┘
                                     │          │
                          ┌──────────▼──┐  ┌────▼──────────┐
                          │   Jaeger    │  │  Prometheus   │
                          │   :16686    │  │  :9090        │
                          │   (UI)      │  │  (scrape)     │
                          └─────────────┘  └───────┬───────┘
                                                   │
                                           ┌───────▼───────┐
                                           │   Grafana     │
                                           │   :3000       │
                                           │  (dashboards) │
                                           └───────────────┘

Enabling Monitoring

ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://localhost:4317
RUST_LOG=info,h2=warn,hyper=warn,tower=warn       # optional override

When ISARTOR__ENABLE_MONITORING=false (the default), Isartor uses console-only logging via tracing-subscriber with RUST_LOG filtering. No OTel SDK is initialised — zero overhead.

Telemetry Initialisation (`src/telemetry.rs`)

init_telemetry() returns an OtelGuard (RAII). The guard holds the SdkTracerProvider and SdkMeterProvider; dropping it flushes pending telemetry and shuts down exporters gracefully.

Component	Description
JSON stdout layer	Structured logs emitted as JSON when monitoring is on
Pretty console layer	Human-readable output when monitoring is off
OTLP trace exporter	gRPC via `opentelemetry-otlp` → Collector
OTLP metric exporter	gRPC via `opentelemetry-otlp` → Collector
EnvFilter	Reads `RUST_LOG`, defaults to `info,h2=warn,hyper=warn,tower=warn`

Service identity:

service.name    = "isartor-gateway"
service.version = env!("CARGO_PKG_VERSION")   # e.g. "0.1.0"

Distributed Traces — Span Reference

Every request gets a root span (gateway_request) from the monitoring middleware. Child spans are created per-layer:

Root Span

Span Name	Source	Key Attributes
`gateway_request`	`src/middleware/monitoring.rs`	`http.method`, `http.route`, `http.status_code`, `client.address`, `isartor.final_layer`

http.status_code and isartor.final_layer are recorded after the response returns (empty → filled pattern).

Layer 0 — Auth

Span Name	Source	Key Attributes
(inline `tracing::debug!`/`warn!`)	`src/middleware/auth.rs`	—

Auth is lightweight; no dedicated span is created. Events are logged at debug/warn level.

Layer 1a — Exact Cache

Span Name	Source	Key Attributes
`l1a_exact_cache_get`	`src/adapters/cache.rs`	`cache.backend` (`memory`\|`redis`), `cache.key`, `cache.hit`
`l1a_exact_cache_put`	`src/adapters/cache.rs`	`cache.backend`, `cache.key`, `response_len`

Layer 1b — Semantic Cache

Span Name	Source	Key Attributes
`l1b_semantic_cache_search`	`src/vector_cache.rs`	`cache.entries_scanned`, `cache.hit`, `cosine_similarity`
`l1b_semantic_cache_insert`	`src/vector_cache.rs`	`cache.evicted`, `cache.size_after`

cosine_similarity — the best-match score formatted to 4 decimal places. This is the key attribute for tuning the similarity threshold.

Layer 2 — SLM Triage

Span Name	Source	Key Attributes
`layer2_slm`	`src/middleware/slm_triage.rs`	`slm.complexity_score` (`TEMPLATE`\|`SNIPPET`\|`COMPLEX`; legacy binary mode: `SIMPLE`\|`COMPLEX`)
`l2_classify_intent`	`src/adapters/router.rs`	`router.backend` (`embedded_candle`\|`remote_vllm`), `router.decision`, `router.model`, `router.url`, `prompt_len`

Layer 2.5 — Context Optimiser

Span Name	Source	Key Attributes
`layer2_5_context_optimizer`	`src/middleware/context_optimizer.rs`	`context.bytes_saved`, `context.strategy` (e.g. `"classifier+dedup"`, `"classifier+log_crunch"`)

When L2.5 modifies the request body, it also sets the response header x-isartor-context-optimized: bytes_saved=<N>.

Layer 3 — Cloud LLM

Span Name	Source	Key Attributes
`layer3_llm`	`src/handler.rs`	`ai.prompt.length_bytes`, `provider.name`, `model`

Custom Span Attributes — Quick Reference

These are the Isartor-specific attributes (beyond standard OTel semantic conventions) that appear on spans and are useful for filtering in Jaeger / Tempo:

Attribute	Type	Where Set	Purpose
`isartor.final_layer`	string	Root `gateway_request` span	Which layer resolved the request
`cache.hit`	bool	L1a and L1b spans	Whether the cache lookup succeeded
`cosine_similarity`	string	L1b search span	Best cosine-similarity score (4 d.p)
`cache.entries_scanned`	u64	L1b search span	Entries scanned during similarity search
`cache.backend`	string	L1a get/put spans	`"memory"` or `"redis"`
`router.decision`	string	L2 classify span	`"TEMPLATE"`, `"SNIPPET"`, or `"COMPLEX"` (tiered mode); `"SIMPLE"` or `"COMPLEX"` (binary mode)
`router.backend`	string	L2 classify span	`"embedded_candle"` or `"remote_vllm"`
`context.bytes_saved`	u64	L2.5 optimizer span	Bytes removed by compression pipeline
`context.strategy`	string	L2.5 optimizer span	Pipeline stages that modified content (e.g. `"classifier+dedup"`)
`provider.name`	string	L3 handler span	e.g. `"openai"`, `"xai"`, `"azure"`
`model`	string	L3 handler span	e.g. `"gpt-4o"`, `"grok-beta"`
`http.status_code`	u16	Root span	HTTP response status code
`client.address`	string	Root span	Client IP (from `x-forwarded-for`)

OTel Metrics (`src/metrics.rs`)

Four instruments are registered as a singleton GatewayMetrics via OnceLock:

Metric Name	Type	Attributes	Description
`isartor_requests_total`	Counter	`final_layer`, `status_code`, `traffic_surface`, `client`, `endpoint_family`, `tool`	Total prompts processed
`isartor_request_duration_seconds`	Histogram	`final_layer`, `status_code`, `traffic_surface`, `client`, `endpoint_family`	End-to-end request duration
`isartor_layer_duration_seconds`	Histogram	`layer_name`, `tool`	Per-layer latency
`isartor_tokens_saved_total`	Counter	`final_layer`, `traffic_surface`, `client`, `endpoint_family`, `tool`	Estimated tokens saved by early resolve
`isartor_errors_total`	Counter	`layer`, `error_class`, `tool`	Error occurrences by layer / agent
`isartor_retries_total`	Counter	`operation`, `attempts`, `outcome`, `tool`	Retry outcomes by agent
`isartor_cache_events_total`	Counter	`cache_layer`, `outcome`, `tool`	L1 / L1a / L1b hit-miss safety by agent

Where Metrics Are Recorded

Call Site	Metrics Recorded
`root_monitoring_middleware`	`record_request_with_context()`, `record_tokens_saved_with_context()` (if early)
`proxy::connect::emit_proxy_decision()`	`record_request_with_context()`, `record_tokens_saved_with_context()` (if early)
`cache_middleware` (L1 hit)	`record_layer_duration("L1a_ExactCache" \| "L1b_SemanticCache")`
`slm_triage_middleware` (L2 hit)	`record_layer_duration("L2_SLM")`
`context_optimizer_middleware`	`record_layer_duration("L2_5_ContextOptimiser")` (when bytes saved > 0)
`chat_handler` (L3)	`record_layer_duration("L3_Cloud")`

Request Dimensions

Unified prompt telemetry distinguishes:

traffic_surface: gateway or proxy
client: direct, openai, anthropic, copilot, claude, antigravity, etc.
endpoint_family: native, openai, or anthropic

Token Estimation

estimate_tokens(prompt) uses the heuristic: max(1, prompt.len() / 4). This is intentionally conservative — the metric tracks relative savings rather than precise token counts.

ROI — `isartor_tokens_saved_total`

This is the headline business metric. Every request resolved before Layer 3 (exact cache, semantic cache, or local SLM) avoids a round-trip to the external LLM provider.

# Daily token savings
sum(increase(isartor_tokens_saved_total[24h]))

# Savings by layer
sum by (final_layer) (rate(isartor_tokens_saved_total[1h]))

# Prompt volume by traffic surface
sum by (traffic_surface) (rate(isartor_requests_total[5m]))

# Prompt volume by client
sum by (client) (rate(isartor_requests_total[5m]))

# Estimated cost savings (assuming $0.01 per 1K tokens)
sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01

Use this metric to justify infrastructure spend for the caching / SLM layers.

Docker Compose — Local Observability Stack

Use the provided compose file for local development:

cd docker
docker compose -f docker-compose.observability.yml up -d

Service	Port	Purpose
OTel Collector	4317	OTLP gRPC receiver
Jaeger	16686	Trace UI
Prometheus	9090	Metrics scrape + query
Grafana	3000	Dashboards (anonymous admin)

Configuration files:

File	Purpose
`docker/otel-collector-config.yaml`	Collector pipelines
`docker/prometheus.yml`	Scrape targets

Pipeline Flow

Isartor  ──OTLP gRPC──▶  OTel Collector ──▶  Jaeger    (traces)
                                          └──▶  Prometheus (metrics)
                                                     │
                                                     ▼
                                                  Grafana

OTel Collector Configuration

The collector config is at docker/otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  otlp:
    endpoint: "jaeger:4317"
    tls:
      insecure: true
  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp, debug]
    metrics:
      receivers: [otlp]
      exporters: [prometheus, debug]

Prometheus Configuration

The Prometheus config is at docker/prometheus.yml:

scrape_configs:
  - job_name: 'otel-collector'
    scrape_interval: 5s
    static_configs:
      - targets: ['otel-collector:8889']

Prometheus scrapes the OTel Collector's Prometheus exporter on port 8889 every 5 seconds.

Per-Tier Setup

Level 1 — Minimal (Console Logs Only)

No observability stack is needed. Use RUST_LOG for structured console output:

ISARTOR__ENABLE_MONITORING=false
RUST_LOG=isartor=info

For debug-level output during development:

RUST_LOG=isartor=debug,tower_http=trace

Level 2 — Docker Compose (Full Stack)

The docker-compose.sidecar.yml includes the complete observability stack:

cd docker
docker compose -f docker-compose.sidecar.yml up --build

Services included:

Service	URL	Purpose
OTel Collector	`localhost:4317` (gRPC)	Receives OTLP from gateway
Jaeger UI	`http://localhost:16686`	View distributed traces
Prometheus	`http://localhost:9090`	Query metrics
Grafana	`http://localhost:3000`	Dashboards (anonymous admin access)

The gateway is pre-configured with:

ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317

Level 3 — Kubernetes (Managed or Self-Hosted)

Approach	Recommended Stack	Notes
Self-managed	OTel Collector DaemonSet + Jaeger Operator + kube-prometheus-stack	Full control, higher ops burden
AWS	AWS X-Ray + CloudWatch + Managed Grafana	ADOT Collector as sidecar/DaemonSet
GCP	Cloud Trace + Cloud Monitoring + Cloud Logging	Use OTLP exporter to Cloud Trace
Azure	Application Insights + Azure Monitor	Use Azure Monitor OpenTelemetry exporter
Grafana Cloud	Grafana Alloy + Grafana Cloud	Low ops, managed Prometheus + Tempo
Datadog	Datadog Agent + OTel Collector	Enterprise APM

For all options, point the gateway at the collector:

ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.isartor:4317

Grafana Dashboard Queries (PromQL)

Panel	PromQL
Request Rate	`rate(isartor_requests_total[5m])`
P95 Latency	`histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))`
Layer Resolution	`sum by (final_layer) (rate(isartor_requests_total[5m]))`
Traffic Surface Split	`sum by (traffic_surface) (rate(isartor_requests_total[5m]))`
Client Split	`sum by (client) (rate(isartor_requests_total[5m]))`
Per-Layer Latency	`histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m])))`
Tokens Saved / Hour	`sum(increase(isartor_tokens_saved_total[1h]))`
Tokens Saved by Layer	`sum by (final_layer) (rate(isartor_tokens_saved_total[5m]))`
Cache Hit Rate	`rate(isartor_requests_total{final_layer=~"L1.*"}[5m]) / rate(isartor_requests_total[5m])`

Jaeger — Useful Searches

Goal	Search
Slow requests (> 500 ms)	Service `isartor-gateway`, Min Duration `500ms`
Cache misses	Tag `cache.hit=false`
Semantic cache tuning	Tag `cosine_similarity` — sort by value
Layer 3 fallbacks	Tag `isartor.final_layer=L3_Cloud`
SLM local resolutions	Tag `router.decision=TEMPLATE` or `router.decision=SNIPPET` (tiered); `router.decision=SIMPLE` (binary)

Trace Anatomy

A typical trace for a cache-miss, locally-resolved request:

isartor-gateway
  └─ HTTP POST /api/chat                       [250ms]
       ├─ Layer0_AuthCheck                       [0.1ms]
       ├─ Layer1_SemanticCache (MISS)            [5ms]
       ├─ Layer2_IntentClassifier                [80ms]
       │     intent=TEMPLATE, confidence=0.97
       └─ Layer2_LocalExecutor                   [160ms]
             model=phi-3-mini, tokens=42

Built-in User Views

For quick operator checks without a separate telemetry stack:

isartor stats --gateway-url http://localhost:8080
isartor stats --gateway-url http://localhost:8080 --by-tool

Add --gateway-api-key <key> only when gateway auth is enabled.

--by-tool prints richer per-agent stats: requests, cache hits/misses, average latency, retry count, error count, and L1a/L1b safety ratios.

Built-in JSON endpoints:

GET /health
GET /debug/proxy/recent
GET /debug/stats/prompts
GET /debug/stats/agents

Alerting Rules

Prometheus Alerting Rules

Create docker/prometheus-alerts.yml:

groups:
  - name: isartor
    rules:
      - alert: HighErrorRate
        expr: rate(isartor_requests_total{status="error"}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Isartor error rate > 5% for 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Isartor P95 latency > 2s for 5 minutes"

      - alert: LowCacheHitRate
        expr: >
          rate(isartor_requests_total{final_layer=~"L1.*"}[15m]) /
          rate(isartor_requests_total[15m]) < 0.3
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Cache hit rate below 30% — consider tuning similarity threshold"

      - alert: LowDeflectionRate
        expr: |
          1 - (
            sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
            /
            sum(rate(isartor_requests_total[1h]))
          ) < 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Isartor deflection rate below 50%"

      - alert: FirewallDown
        expr: up{job="isartor"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Isartor gateway is down"

Troubleshooting

Symptom	Cause	Fix
No traces in Jaeger	Monitoring disabled	Set `ISARTOR__ENABLE_MONITORING=true`
No traces in Jaeger	Collector unreachable	Verify `OTEL_EXPORTER_ENDPOINT` + port 4317
No metrics in Prometheus	Prometheus can't scrape collector	Check `prometheus.yml` targets
Grafana "No data"	Data source misconfigured	URL should be `http://prometheus:9090`
Console shows "OTel disabled"	Config precedence	Check env vars override file config
`isartor_layer_duration_seconds` empty	No requests yet	Send a test request

Performance Tuning

How to measure, tune, and operate Isartor for maximum deflection and minimum latency.

Understanding Deflection

Deflection = the percentage of requests resolved before Layer 3 (the external cloud LLM). A request is "deflected" if it is served by:

Layer	Mechanism	Cost
L1a — Exact Cache	SHA-256 hash match	$0
L1b — Semantic Cache	Cosine similarity match	$0
L2 — SLM Triage	Local SLM classifies requests as TEMPLATE, SNIPPET, or COMPLEX (tiered mode) and answers TEMPLATE/SNIPPET locally	$0

The deflection rate directly maps to cost savings. A 70 % deflection rate means only 30 % of requests reach the paid cloud LLM.

Measuring Deflection Rate

Via Prometheus / Grafana

The gateway emits isartor_requests_total with a final_layer label. Use the following PromQL to compute the deflection rate:

# Overall deflection rate (last 1 hour)
1 - (
  sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
  /
  sum(increase(isartor_requests_total[1h]))
)

# Deflection rate by layer (pie chart)
sum by (final_layer) (rate(isartor_requests_total[5m]))

# Exact-cache deflection only
sum(increase(isartor_requests_total{final_layer="L1a_ExactCache"}[1h]))
/
sum(increase(isartor_requests_total[1h]))

Via the API

Send a test batch and count response layer values:

# Send 100 identical requests — expect 99 cache hits
for i in $(seq 1 100); do
  curl -s -X POST http://localhost:8080/api/chat \
    -H "Content-Type: application/json" \
    -H "X-API-Key: $ISARTOR_API_KEY" \
    -d '{"prompt": "What is the capital of France?"}' \
  | jq '.layer'
done | sort | uniq -c

Expected output (ideal):

   1 3       ← first request → cloud
  99 1       ← remaining → exact cache

Via Structured Logs

When ISARTOR__ENABLE_MONITORING=true, every request logs the final layer:

# grep JSON logs for final-layer distribution
cat logs.json | jq '.isartor.final_layer' | sort | uniq -c

Via Jaeger / Tempo

Filter traces by the isartor.final_layer tag:

Goal	Search
All cache hits	Tag `isartor.final_layer=L1a_ExactCache` or `L1b_SemanticCache`
SLM resolutions	Tag `isartor.final_layer=L2_SLM`
Cloud fallbacks	Tag `isartor.final_layer=L3_Cloud`

Tuning Configuration for Deflection

Cache Mode

Variable	Values	Recommended
`ISARTOR__CACHE_MODE`	`exact`, `semantic`, `both`	`both` (default)

exact — Only identical prompts hit. Good for deterministic agent loops.
semantic — Catches paraphrases ("Price?" ≈ "Cost?"). Higher hit rate but adds ~1–5 ms embedding cost.
both — Exact check first (< 1 ms), then semantic if no exact hit. Best of both worlds.

Similarity Threshold

Variable	Default	Range
`ISARTOR__SIMILARITY_THRESHOLD`	`0.85`	`0.0`–`1.0`

Value	Effect
`0.95`	Very strict — only near-identical prompts match. Low false positives, lower hit rate.
`0.85`	Balanced — catches common paraphrases. Recommended starting point.
`0.75`	Aggressive — higher hit rate but risk of returning wrong cached answers.
`0.60`	Dangerous — high false-positive rate. Not recommended for production.

How to tune:

Set ISARTOR__ENABLE_MONITORING=true.
Send representative traffic for 1 hour.
In Jaeger, search for cosine_similarity attribute on l1b_semantic_cache_search spans.
Plot the distribution. If most similarity scores cluster between 0.80–0.90, a threshold of 0.85 is good.
If you see many scores at 0.82–0.84 that should be hits, lower to 0.80.

Cache TTL

Variable	Default	Description
`ISARTOR__CACHE_TTL_SECS`	`300` (5 min)	Time-to-live for cached responses

Short TTL (60–120 s): Good for rapidly changing data, real-time dashboards.
Medium TTL (300–600 s): Balanced for most workloads.
Long TTL (1800+ s): Maximises deflection for static Q&A / documentation bots.

Cache Capacity

Variable	Default	Description
`ISARTOR__CACHE_MAX_CAPACITY`	`10000`	Max entries in each cache (LRU eviction)

Monitor eviction rate via cache.evicted span attribute on l1b_semantic_cache_insert.
If eviction rate > 5 % of inserts, increase capacity or shorten TTL.
Each cache entry ≈ 2–4 KB (prompt hash + response + optional 384-dim vector).

Tuning Latency

Target Latencies by Layer

Layer	Target (p95)	Typical Range
L1a — Exact Cache	< 1 ms	0.1–0.5 ms
L1b — Semantic Cache	< 10 ms	1–5 ms
L2 — SLM Triage	< 300 ms	50–200 ms (embedded), 100–500 ms (sidecar)
L3 — Cloud LLM	< 3 s	500 ms – 5 s (network-bound)

Measure with PromQL

# P95 latency by layer
histogram_quantile(0.95,
  sum by (le, layer_name) (
    rate(isartor_layer_duration_seconds_bucket[5m])
  )
)

# P95 end-to-end latency
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))

Reducing Latency

Bottleneck	Symptom	Fix
Embedding	L1b > 10 ms	Use a lighter model or increase CPU allocation
SLM inference	L2 > 500 ms	Use quantised model (Q4_K_M GGUF), switch to embedded engine
Redis	L1a > 5 ms	Check network latency, use Redis cluster with read replicas
Cloud LLM	L3 > 5 s	Switch provider, use a smaller model, enable request timeout

Memory & Resource Tuning

Memory Budget

Component	Memory Usage	Notes
Exact cache (in-memory, 10K entries)	~20–40 MB	Scales linearly with `cache_max_capacity`
Semantic cache (in-memory, 10K entries)	~30–60 MB	384-dim float32 vectors + response strings
candle embedder (all-MiniLM-L6-v2)	~90 MB	Loaded at startup, constant
Candle GGUF model (embedded SLM)	~1–4 GB	Depends on model quantisation
Tokio runtime	~10–20 MB	Async task pool
Total (minimalist mode)	~150–200 MB	No embedded SLM
Total (embedded mode)	~1.5–4.5 GB	With embedded Candle SLM

CPU Considerations

Embedding generation runs on spawn_blocking (dedicated thread pool).
Candle GGUF inference is CPU-bound; allocate ≥ 4 cores for embedded mode.
The Tokio async runtime uses the default thread count (num_cpus).

Container Limits

# docker-compose example
services:
  gateway:
    deploy:
      resources:
        limits:
          memory: 512M    # minimalist mode
          cpus: "2"
        # For embedded SLM mode:
        # limits:
        #   memory: 4G
        #   cpus: "4"

Cache Tuning Deep-Dive

Exact vs. Semantic Cache Hit Analysis

# Exact cache hit rate
sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))

# Semantic cache hit rate
sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))

Cache Backend: Memory vs. Redis

Factor	In-Memory	Redis
Latency	~0.1 ms	~1–5 ms (network hop)
Capacity	Limited by process RAM	Limited by Redis memory
Multi-replica	❌ No sharing	✅ Shared across pods
Persistence	❌ Lost on restart	✅ Optional AOF/RDB
Recommended for	Single-instance, dev, edge	K8s, multi-replica, production

Switch with:

export ISARTOR__CACHE_BACKEND=redis
export ISARTOR__REDIS_URL=redis://redis.svc:6379

When to Disable Semantic Cache

Traffic is 100 % deterministic (exact same prompts repeated).
Embedding overhead is unacceptable (< 1 ms budget).
Set ISARTOR__CACHE_MODE=exact.

SLM Router Tuning

Embedded vs. Sidecar

Mode	Variable	Latency	Resource Usage
Embedded (Candle)	`ISARTOR__INFERENCE_ENGINE=embedded`	50–200 ms	High CPU, 1–4 GB RAM
Sidecar (llama.cpp)	`ISARTOR__INFERENCE_ENGINE=sidecar`	100–500 ms	Separate process, GPU optional
Remote (vLLM/TGI)	`ISARTOR__ROUTER_BACKEND=vllm`	100–500 ms	Separate server, GPU recommended

Model Selection

Model	Size	Speed	Accuracy
Phi-3-mini (Q4_K_M)	~2 GB	Fast	Good
Gemma-2-2B-IT (Q4)	~1.5 GB	Very fast	Good
Qwen-1.5-1.8B (Q4)	~1.2 GB	Fastest	Adequate
Llama-3-8B (Q4)	~4.5 GB	Slower	Best

For intent classification (TEMPLATE/SNIPPET/COMPLEX in tiered mode, or SIMPLE/COMPLEX in legacy binary mode), smaller models (1–3 B params) are sufficient. Use the smallest model that meets your accuracy needs.

Tuning the Classification Prompt

The system prompt in src/middleware/slm_triage.rs determines classification accuracy. If too many COMPLEX requests are misclassified as TEMPLATE or SNIPPET (resulting in bad local answers), consider:

Making the system prompt more specific to your domain.
Adding examples to the prompt (few-shot).
Switching to a larger model.
Setting ISARTOR__LAYER2__MAX_ANSWER_TOKENS to allow longer SLM responses (default 2048).
Falling back to binary mode via ISARTOR__LAYER2__CLASSIFIER_MODE=binary if the three-tier split does not suit your workload.

Embedder Tuning

In-Process (candle)

The default embedder uses candle with sentence-transformers/all-MiniLM-L6-v2 (pure-Rust BertModel):

384-dimensional vectors
~90 MB model footprint
1–5 ms per embedding (CPU)
Runs on spawn_blocking to avoid starving the Tokio runtime

Sidecar Embedder

For higher throughput or GPU acceleration:

export ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082
export ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME=all-minilm
export ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS=10

Embedding Model Selection

Model	Dims	Speed	Quality
all-MiniLM-L6-v2	384	Fastest	Good
bge-small-en-v1.5	384	Fast	Better
bge-base-en-v1.5	768	Moderate	Best

Use 384-dim models for production. 768-dim models double memory usage for marginal quality improvement in most use cases.

SLO / SLA Goal Templates

Developer / Internal SLO

Metric	Target	Measurement
Availability	99.5 %	`up{job="isartor"}` over 30-day window
P95 latency (cache hit)	< 10 ms	`histogram_quantile(0.95, ...)` on L1
P95 latency (end-to-end)	< 3 s	`histogram_quantile(0.95, ...)` on all
Deflection rate	> 50 %	`1 - (L3 / total)` over 24 h
Error rate	< 1 %	`rate(isartor_requests_total{http_status=~"5.."}[5m])`

Production / Enterprise SLO

Metric	Target	Measurement
Availability	99.9 %	Multi-replica, health check monitoring
P95 latency (cache hit)	< 5 ms	Requires Redis or fast in-memory
P95 latency (end-to-end)	< 2 s	Optimised models, provider SLAs
P99 latency (end-to-end)	< 5 s	Tail latency budget
Deflection rate	> 70 %	Tuned thresholds + warm cache
Error rate	< 0.1 %	Circuit breakers, retries
Token savings	> 60 %	`isartor_tokens_saved_total` vs estimated total

SLA Template (for downstream consumers)

## Isartor Prompt Firewall SLA

**Availability:** 99.9 % monthly uptime (< 43.8 min downtime/month)
**Latency:** P95 end-to-end < 2 seconds
**Error Budget:** 0.1 % of requests may return 5xx
**Maintenance Window:** Sundays 02:00–04:00 UTC (excluded from SLA)

### Remediation
- Cache tier failure: automatic fallback to cloud LLM (degraded mode)
- SLM failure: automatic fallback to cloud LLM (degraded mode)
- Cloud LLM failure: 502 Bad Gateway returned, retry recommended

### Monitoring
- Health endpoint: GET /healthz
- Metrics endpoint: Prometheus scrape via OTel Collector on port 8889
- Dashboard: Grafana at http://<grafana-host>:3000

Alert Rules (Prometheus)

groups:
  - name: isartor-slo
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(isartor_requests_total{http_status=~"5.."}[5m]))
          /
          sum(rate(isartor_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Isartor error rate exceeds 1%"

      - alert: HighP95Latency
        expr: |
          histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
          > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Isartor P95 latency exceeds 3 seconds"

      - alert: LowDeflectionRate
        expr: |
          1 - (
            sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
            /
            sum(rate(isartor_requests_total[1h]))
          ) < 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Isartor deflection rate below 50%"

      - alert: FirewallDown
        expr: up{job="isartor"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Isartor gateway is down"

Scenario-Based Tuning Recipes

Scenario A: Agentic Loop (High-Volume Identical Prompts)

Profile: Autonomous agent sends the same prompt hundreds of times per minute.

ISARTOR__CACHE_MODE=exact           # Semantic unnecessary for identical prompts
ISARTOR__CACHE_TTL_SECS=3600       # Long TTL — agent prompts are stable
ISARTOR__CACHE_MAX_CAPACITY=50000  # Large cache for many unique prompts

Expected deflection: 95–99 % (after warm-up).

Scenario B: Customer Support Bot (Paraphrased Questions)

Profile: End users ask the same questions in different ways.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.80  # Lower threshold to catch paraphrases
ISARTOR__CACHE_TTL_SECS=1800       # 30 min — support answers change slowly
ISARTOR__CACHE_MAX_CAPACITY=10000

Expected deflection: 60–80 %.

Scenario C: Code Generation (Low Cache Hit Rate)

Profile: Developers ask unique, complex coding questions.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.92  # High threshold — wrong cached code is costly
ISARTOR__CACHE_TTL_SECS=600        # Short TTL — code context changes quickly
ISARTOR__INFERENCE_ENGINE=embedded   # Let SLM handle simple code questions

Expected deflection: 20–40 % (SLM handles simple extraction).

Scenario D: RAG Pipeline (Document Q&A)

Profile: Queries against a knowledge base; similar questions are common.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.83  # Moderate threshold
ISARTOR__CACHE_TTL_SECS=3600       # Documents change infrequently
ISARTOR__CACHE_MAX_CAPACITY=20000  # Large cache for document variation

Expected deflection: 50–70 %.

Scenario E: Multi-Replica Kubernetes

Profile: Horizontally scaled behind a load balancer.

ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379
ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.85

Benefit: All replicas share the same cache → deflection rate applies cluster-wide.

PromQL Cheat Sheet

What	Query
Deflection rate (1 h)	`1 - (sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h])) / sum(increase(isartor_requests_total[1h])))`
Request rate	`rate(isartor_requests_total[5m])`
Request rate by layer	`sum by (final_layer) (rate(isartor_requests_total[5m]))`
P50 latency	`histogram_quantile(0.50, rate(isartor_request_duration_seconds_bucket[5m]))`
P95 latency	`histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))`
P99 latency	`histogram_quantile(0.99, rate(isartor_request_duration_seconds_bucket[5m]))`
Per-layer P95	`histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m])))`
Tokens saved (daily)	`sum(increase(isartor_tokens_saved_total[24h]))`
Tokens saved by layer	`sum by (final_layer) (rate(isartor_tokens_saved_total[5m]))`
Est. daily cost savings ($0.01/1K tok)	`sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01`
Error rate	`sum(rate(isartor_requests_total{http_status=~"5.."}[5m])) / sum(rate(isartor_requests_total[5m]))`
Cache hit ratio (exact)	`sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m])) / sum(rate(isartor_requests_total[5m]))`
Cache hit ratio (semantic)	`sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m])) / sum(rate(isartor_requests_total[5m]))`

Testing

Complete test runbook for Isartor — from automated test suites to manual feature verification and Copilot CLI integration testing.

Prerequisites

Requirement	Check
Rust toolchain	`cargo --version`
Built binary	`cargo build --release`
`curl` + `jq`	`curl --version && jq --version`

Quick Start — Automated

Unit & Integration Tests

# Run the full test suite
cargo test --all-features

# Run a specific test binary
cargo test --test unit_suite
cargo test --test integration_suite
cargo test --test scenario_suite

# Run a single test with output
cargo test --test scenario_suite deflection_rate_at_least_60_percent -- --nocapture
cargo test --test integration_suite body_survives_all_middleware -- --nocapture

Smoke Test Script

Run the entire manual test suite in one command:

# Start a fresh server, run all tests, stop after
./scripts/smoke-test.sh --stop-after

# Test an already-running server
./scripts/smoke-test.sh --no-start

# Full run including demo + verbose response bodies
./scripts/smoke-test.sh --run-demo --verbose

# Custom URL / API key
./scripts/smoke-test.sh --url http://localhost:9090 --api-key mykey --no-start

Lint & Format Checks

Run the same checks CI runs:

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings

Compression Pipeline Tests

Run the L2.5 compression module tests specifically:

# All compression tests (pipeline, stages, cache, optimize)
cargo test --all-features compression

# Specific modules
cargo test --all-features content_classifier
cargo test --all-features dedup_cache
cargo test --all-features log_crunch
cargo test --all-features optimize_request_body

Manual Step-by-Step

Note: Isartor runs without gateway auth by default (local-first). The test commands below explicitly set ISARTOR__GATEWAY_API_KEY to exercise authenticated request handling.

1 Start the Server

# Gateway-only startup (local API testing)
ISARTOR__FIRST_RUN_COMPLETE=1 \
./target/release/isartor up

# Full startup for proxy-aware testing (recommended for this guide)
ISARTOR__FIRST_RUN_COMPLETE=1 \
ISARTOR__GATEWAY_API_KEY=changeme \
./target/release/isartor up copilot

# With an OpenAI key (enables real L3 fallback)
ISARTOR__FIRST_RUN_COMPLETE=1 \
ISARTOR__GATEWAY_API_KEY=changeme \
ISARTOR__EXTERNAL_LLM_API_KEY=sk-... \
./target/release/isartor up copilot

Server is ready when you see:

INFO isartor: API gateway listening, addr: 0.0.0.0:8080
INFO isartor: CONNECT proxy starting, addr: 0.0.0.0:8081

2 Health & Liveness

# Liveness probe (no auth needed)
curl http://localhost:8080/healthz

# Rich health (shows layer status, proxy, prompt totals)
curl http://localhost:8080/health | jq .

Expected /health response shape:

{
  "status": "ok",
  "version": "0.1.25",
  "layers": { "l1a": "active", "l1b": "active", "l2": "active", "l3": "no_api_key" },
  "uptime_seconds": 5,
  "proxy": "active",
  "proxy_layer3": "native_upstream_passthrough",
  "prompt_total_requests": 0,
  "prompt_total_deflected_requests": 0
}

3 OpenAI-Compatible Endpoint (`/v1/chat/completions`)

API_KEY=changeme

curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | jq .

Send the same prompt twice to confirm L1a exact-cache kicks in:

for i in 1 2; do
  echo "--- Request $i ---"
  curl -sS http://localhost:8080/v1/chat/completions \
    -H "X-API-Key: $API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is 2+2?"}]}' \
    | jq '.choices[0].message.content, .model'
done

4 Anthropic-Compatible Endpoint (`/v1/messages`)

curl -sS http://localhost:8080/v1/messages \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-haiku-20240307",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | jq .

Expected shape: {"id":..., "type":"message", "role":"assistant", "content":[...], "model":...}

5 Native Endpoint (`/api/chat`)

curl -sS http://localhost:8080/api/chat \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "ping"}]}' | jq .

6 L1a — Exact Cache Hit

# Seed the cache with first request
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"capital of France?"}]}' \
  -o /dev/null

# Second identical request — should be served from L1a
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"capital of France?"}]}' \
  | jq '.model'
# → "isartor-cache" or similar (not "gpt-4o-mini")

7 L1b — Semantic Cache Hit

# Seed
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is the capital of France?"}]}' \
  -o /dev/null

# Paraphrase — should hit L1b (cosine similarity ≥ 0.85)
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Which city is France capital?"}]}' \
  | jq '.model'

8 Authentication Rejection

# No API key — should return 401/403
curl -sS -w "\nHTTP %{http_code}" http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hello"}]}'

9 Prompt Stats

# JSON endpoint
curl -sS -H "X-API-Key: $API_KEY" \
  "http://localhost:8080/debug/stats/prompts?limit=10" | jq .

# Per-agent observability endpoint
curl -sS -H "X-API-Key: $API_KEY" \
  "http://localhost:8080/debug/stats/agents" | jq .

# CLI command
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key $API_KEY

# CLI per-agent view
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key $API_KEY \
  --by-tool

Expected isartor stats output:

Isartor Prompt Stats
  URL:        http://localhost:8080
  Total:      7
  Deflected:  3

By Layer
  L1A  3
  L3   4

By Surface
  gateway  7

By Client
  openai   5
  anthropic 2

Recent Prompts
  2026-03-19T09:00:00Z gateway openai L1A via /v1/chat/completions (1ms, HTTP 200)

10 Proxy Recent Decisions

curl -sS -H "X-API-Key: $API_KEY" \
  "http://localhost:8080/debug/proxy/recent?limit=5" | jq .

11 isartor connect status

./target/release/isartor connect status \
  --gateway-url http://localhost:8080 \
  --gateway-api-key $API_KEY

12 Run the Built-in Demo

./target/release/isartor demo
# Replays 50 bundled prompts through L1a/L1b, prints deflection rate.
# Writes isartor_demo_result.txt

13 Stop the Server

./target/release/isartor stop

Copilot CLI Integration Test

Step 1 — Connect Copilot CLI

./target/release/isartor connect copilot \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

This writes ~/.isartor/env/copilot.sh with:

export HTTPS_PROXY="http://localhost:8081"
export NODE_EXTRA_CA_CERTS="/Users/<you>/.isartor/ca/isartor-ca.pem"
export ISARTOR_COPILOT_ENABLED=true

Step 2 — Activate the Proxy Environment

Critical: You must source the env file in the same shell where you run Copilot CLI:

source ~/.isartor/env/copilot.sh

# Verify the env is active
echo $HTTPS_PROXY        # → http://localhost:8081
echo $NODE_EXTRA_CA_CERTS  # → /Users/<you>/.isartor/ca/isartor-ca.pem

Step 3 — Use Copilot CLI (same shell)

# Ask Copilot a question — traffic will route through Isartor proxy
gh copilot suggest "list all files in a directory"

# Or explain
gh copilot explain "what does git rebase do"

Step 4 — Verify Traffic Hit Isartor

# Check proxy recent decisions
./target/release/isartor connect status \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

# Check prompt stats
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

You should see proxy_recent_requests > 0 and Copilot entries in By Client.

Step 5 — Ask Repeated Questions (cache test)

# Ask the same thing twice — second hit should be L1a
gh copilot suggest "list all files in a directory"
gh copilot suggest "list all files in a directory"

# Check stats — deflected count should have increased
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

Disconnect

./target/release/isartor connect copilot --disconnect
# then unset in your shell:
unset HTTPS_PROXY NODE_EXTRA_CA_CERTS ISARTOR_COPILOT_ENABLED

Feature Coverage Matrix

Feature	Test	Section
Health endpoint	`curl /health`	§2
Liveness probe	`curl /healthz`	§2
OpenAI `/v1/chat/completions`	curl + jq	§3
Anthropic `/v1/messages`	curl + jq	§4
Native `/api/chat`	curl + jq	§5
L1a exact-cache deflection	repeated prompt	§6
L1b semantic-cache deflection	paraphrased prompt	§7
Auth rejection	no X-API-Key	§8
Prompt stats endpoint	`/debug/stats/prompts`	§9
isartor stats CLI	`isartor stats`	§9
Proxy decisions endpoint	`/debug/proxy/recent`	§10
Connect status CLI	`isartor connect status`	§11
Built-in demo	`isartor demo`	§12
Copilot CLI proxy routing	source env + gh copilot	Copilot CLI
Cache hit via Copilot	repeated gh copilot	Copilot CLI §5

Troubleshooting

Symptom	Cause	Fix
`Connection refused :8080`	Server not started	Run `./target/release/isartor up`
`isartor update` fails after stop	Stale `HTTPS_PROXY` in shell	`unset HTTPS_PROXY HTTP_PROXY`
Copilot traffic not showing in stats	Wrong shell / env not sourced	`source ~/.isartor/env/copilot.sh` then restart Copilot CLI
L1b miss on paraphrase	Semantic index cold	Send several prompts first to warm the index
`l3: no_api_key` in health	No LLM key set	Set `ISARTOR__EXTERNAL_LLM_API_KEY` or use cache/demo mode

See also: Troubleshooting · Contributing

Contributing

Thanks for your interest in contributing to Isartor! Isartor is maintained by one developer as a side project. Here's how to make your contribution land quickly.

Before You Open a PR

Check existing issues — your idea may already be tracked.
Open an issue first for any non-trivial change.
One PR per issue — keep scope tight.

Looking for something to work on? Check out the good first issues label on GitHub.

Development Setup

Prerequisites

Rust 1.75+ — install via rustup
Docker — required for integration tests and the observability stack
curl + jq — for manual testing

Clone and Build

git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build

Run the Test Suite

# Full test suite
cargo test --all-features

# Or use Make
make test

# Run a specific test binary
cargo test --test unit_suite
cargo test --test integration_suite
cargo test --test scenario_suite

# Run a single test with output
cargo test --test scenario_suite deflection_rate_at_least_60_percent -- --nocapture

Lint & Format

# Format check (same as CI)
cargo fmt --all -- --check

# Apply formatting
cargo fmt --all

# Clippy lint check (same as CI)
cargo clippy --all-targets --all-features -- -D warnings

Release Build

cargo build --release
# or
make build

Benchmarks

# Criterion micro-benchmarks
cargo bench --bench cache_latency
cargo bench --bench e2e_pipeline

# Full benchmark harness (requires running Isartor instance)
make benchmark

# Dry-run smoke test (no server needed)
make benchmark-dry-run

PR Checklist

cargo test --all-features passes
cargo clippy --all-targets --all-features -- -D warnings has no new warnings
cargo fmt --all -- --check passes
PR description explains WHY, not just WHAT
Documentation updated if behaviour changes

What Gets Merged Quickly

Bug fixes with a test that reproduces the bug
Documentation improvements
Performance improvements with benchmark evidence

What Takes Longer

New features — needs design discussion in an issue first
Changes to the deflection layer logic — core path changes require careful review

Code Conventions

Tests are grouped into integration-test binaries (unit_suite, integration_suite, scenario_suite) that re-export submodules. When adding a test, place it in the appropriate binary rather than creating a standalone file.
Configuration uses ISARTOR__... environment variables with double underscores as separators.
The Axum middleware stack wraps inside-out. See src/main.rs for the documented layer order.
Use spawn_blocking for CPU-intensive work (embeddings, model inference) to avoid starving the Tokio runtime.
The src/compression/ module uses a Fusion Pipeline pattern: stateless CompressionStage trait objects executed in order. To add a new compression stage, implement the CompressionStage trait and wire it in src/compression/optimize.rs::build_pipeline().

Response Time

Issues and PRs are reviewed within 24–48 hours on weekdays. Weekend responses are not guaranteed.

See also: Testing · Architecture · Troubleshooting

Troubleshooting

Common issues, diagnostic steps, and FAQ for operating Isartor.

Startup Errors

`Failed to initialize candle TextEmbedder`

Symptom: Gateway panics on startup with:

Failed to initialize candle TextEmbedder (all-MiniLM-L6-v2)

Causes & Fixes:

Cause	Fix
Model files not downloaded	Run once with internet access; candle auto-downloads to `~/.cache/huggingface/`
Corrupted model cache	Delete `~/.cache/huggingface/` and restart
Cache directory not writable (`Permission denied (os error 13)`)	Set `HF_HOME` (or `ISARTOR_HF_CACHE_DIR`) to a writable path (e.g. `/tmp/huggingface`). In Docker, mount a volume there: `-e HF_HOME=/tmp/huggingface -v isartor-hf:/tmp/huggingface`.
Insufficient memory	Ensure ≥ 256 MB available for the embedding model

`Address already in use`

Symptom:

Error: error creating server listener: Address already in use (os error 48)

Fix:

# Find the process using port 8080
lsof -i :8080
# Kill it, or change the port:
export ISARTOR__HOST_PORT=0.0.0.0:9090

`missing field` or config deserialization errors

Symptom:

Error: missing field `layer2` in config

Fix: Ensure all required environment variables have the correct prefix and separator. Isartor uses double-underscore (__) as separator:

# Correct:
export ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081

# Wrong:
export ISARTOR_LAYER2_SIDECAR_URL=http://127.0.0.1:8081

See the Configuration Reference for the full list of variables.

Gateway auth / `401 Unauthorized`

Symptom: All requests return 401 Unauthorized.

By default, gateway_api_key is empty and auth is disabled — you should not see 401 errors unless you (or your deployment) explicitly set ISARTOR__GATEWAY_API_KEY.

If you enabled auth by setting a key, every request must include it:

export ISARTOR__GATEWAY_API_KEY=your-secret-key

Common causes of unexpected 401s:

The key in your request header doesn't match ISARTOR__GATEWAY_API_KEY.
You forgot to include X-API-Key or Authorization: Bearer in the request.

Cache Issues

Low Cache Hit Rate

Symptom: Deflection rate below expected levels despite repeated traffic.

Diagnostic steps:

Check cache mode:

echo $ISARTOR__CACHE_MODE   # should be "both" for most workloads

Check similarity threshold:
```
echo $ISARTOR__SIMILARITY_THRESHOLD   # default: 0.85
```
If too high (> 0.92), similar prompts won't match. Try lowering to 0.80.
Check TTL:
```
echo $ISARTOR__CACHE_TTL_SECS   # default: 300
```
Short TTL evicts entries before they can be reused.
Check Jaeger for cosine_similarity values on semantic cache spans. If scores are just below the threshold, lower it.

Stale Cache Responses

Symptom: Users receive outdated answers from cache.

Fix: Reduce TTL or restart the gateway to clear in-memory caches:

export ISARTOR__CACHE_TTL_SECS=60   # 1 minute

For Redis-backed caches, you can flush explicitly:

redis-cli -u $ISARTOR__REDIS_URL FLUSHDB

Redis Connection Refused

Symptom:

Layer 1a: Redis connection error — falling through

Diagnostic steps:

Verify Redis is running:

redis-cli -u $ISARTOR__REDIS_URL ping
# Expected: PONG

Check network connectivity (especially in Docker/K8s):

# Inside the gateway container:
curl -v telnet://redis:6379

Verify the URL format:

# Correct formats:
export ISARTOR__REDIS_URL=redis://127.0.0.1:6379
export ISARTOR__REDIS_URL=redis://user:password@redis.svc:6379/0

Check Redis memory limit — if Redis is OOM, it will reject writes.

Fallback behaviour: When Redis is unreachable, Isartor falls through to the next layer. No data is lost, but deflection rate drops.

Cache Memory Growing Unbounded

Symptom: Gateway memory usage increases over time.

Fix: The in-memory cache uses bounded LRU eviction. Check:

echo $ISARTOR__CACHE_MAX_CAPACITY   # default: 10000

If set too high, reduce it. Each entry ≈ 2–4 KB, so 10K entries ≈ 20–40 MB.

Embedding & SLM Issues

Slow Embedding Generation

Symptom: L1b latency > 10 ms.

Causes & Fixes:

Cause	Fix
CPU-bound contention	Increase CPU allocation for the container
Large prompt text	Embedder truncates to model max length (512 tokens), but longer text = more CPU
Cold start	First embedding call warms up the candle BertModel (~2 s). Subsequent calls are fast.

SLM Sidecar Unreachable

Symptom:

Layer 2: Failed to connect to SLM sidecar — falling through

Diagnostic steps:

Check if the sidecar is running:
```
curl http://127.0.0.1:8081/v1/models
```

Verify configuration:

echo $ISARTOR__LAYER2__SIDECAR_URL   # default: http://127.0.0.1:8081

Check the sidecar logs for errors (model loading, OOM, etc.).

Increase timeout if the sidecar is slow:

export ISARTOR__LAYER2__TIMEOUT_SECONDS=60

Fallback behaviour: When the SLM sidecar is unreachable, Isartor treats all requests as COMPLEX and forwards to Layer 3.

SLM Misclassification (Tiered: TEMPLATE / SNIPPET / COMPLEX)

The default classifier mode is tiered, which sorts requests into three categories instead of the legacy binary SIMPLE/COMPLEX split:

Tier	Description
TEMPLATE	Config files, type definitions, documentation, boilerplate
SNIPPET	Short single-function code, simple middleware (<50 lines)
COMPLEX	Multi-file implementations, test suites, full endpoints

TEMPLATE and SNIPPET requests are answered locally by the SLM; COMPLEX requests are forwarded to Layer 3. The legacy binary mode (SIMPLE/COMPLEX) is still available via ISARTOR__LAYER2__CLASSIFIER_MODE=binary.

An answer quality guard also rejects SLM answers that are too short (<10 chars) or start with uncertainty phrases, escalating them to Layer 3.

Symptom: Users receive low-quality answers for complex questions (misclassified as TEMPLATE/SNIPPET) or unnecessarily hit the cloud for simple ones.

Diagnostic steps:

In Jaeger, search for router.decision attribute to see classification distribution across TEMPLATE, SNIPPET, and COMPLEX.

Send known-simple and known-complex prompts and check the classification:

curl -s -X POST http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $KEY" \
  -d '{"prompt": "Generate a tsconfig.json"}' | jq '.layer'
# Expected: layer 2 (TEMPLATE)

Consider switching to a larger SLM model for better classification accuracy.
To fall back to the legacy binary classifier, set ISARTOR__LAYER2__CLASSIFIER_MODE=binary.

Embedded Candle Engine Errors

Symptom:

Layer 2: Embedded classification failed – falling through

Causes & Fixes:

Cause	Fix
Model file missing	Set `ISARTOR__EMBEDDED__MODEL_PATH` to a valid GGUF file
Insufficient memory	Candle GGUF models need 1–4 GB RAM
Feature not compiled	Build with `--features embedded-inference`

Cloud LLM Issues

`502 Bad Gateway` from Layer 3

Symptom: Requests that reach Layer 3 return 502.

Diagnostic steps:

Check provider connectivity:

curl -s $ISARTOR__EXTERNAL_LLM_URL \
  -H "Authorization: Bearer $ISARTOR__EXTERNAL_LLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}'

Verify API key is valid and has quota.

For Azure OpenAI, check deployment ID and API version:

echo $ISARTOR__AZURE_DEPLOYMENT_ID
echo $ISARTOR__AZURE_API_VERSION

Rate Limiting from Cloud Provider

Symptom: Intermittent 429 errors from the cloud LLM.

Fix:

Increase deflection rate (lower threshold, longer TTL) to reduce cloud traffic.
Request higher rate limits from your provider.
Implement client-side retry with exponential backoff (application level).

Wrong Provider Configured

Symptom: Authentication errors or unexpected response formats.

Fix: Verify the provider matches the URL and API key:

# OpenAI
export ISARTOR__LLM_PROVIDER=openai

# Azure
export ISARTOR__LLM_PROVIDER=azure

# Anthropic
export ISARTOR__LLM_PROVIDER=anthropic

# xAI
export ISARTOR__LLM_PROVIDER=xai

# Google Gemini
export ISARTOR__LLM_PROVIDER=gemini

# Ollama (local — no API key required)
export ISARTOR__LLM_PROVIDER=ollama

See the Configuration Reference for the full list of supported providers.

Observability Issues

No Traces in Jaeger

Cause	Fix
Monitoring disabled	`export ISARTOR__ENABLE_MONITORING=true`
Wrong endpoint	`export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317`
Collector not running	`docker compose -f docker-compose.observability.yml up otel-collector`
Firewall blocking gRPC	Ensure port 4317 is open between gateway and collector

No Metrics in Prometheus

Cause	Fix
Prometheus not scraping collector	Check `prometheus.yml` targets include `otel-collector:8889`
Collector metrics pipeline broken	Verify `otel-collector-config.yaml` exports to Prometheus
No requests sent yet	Send a test request — metrics appear after first request

Grafana Shows "No Data"

Cause	Fix
Data source not configured	Add Prometheus source: URL `http://prometheus:9090`
Wrong time range	Expand the time range in Grafana to cover the test period
Dashboard not provisioned	Check `docker/grafana/provisioning/` paths are mounted

Console Shows "OTel disabled" Despite Setting env var

Cause: Config file takes precedence, or the env var prefix is wrong.

Fix:

# Correct (double underscore):
export ISARTOR__ENABLE_MONITORING=true

# Wrong (single underscore):
export ISARTOR_ENABLE_MONITORING=true  # ❌ not picked up

Performance & Degraded Operation

High Tail Latency (P99 > 10 s)

Diagnostic steps:

Check which layer is the bottleneck:

histogram_quantile(0.99,
  sum by (le, layer_name) (
    rate(isartor_layer_duration_seconds_bucket[5m])
  )
)

Common causes:
- L3 Cloud: provider is slow → switch to a faster model or provider.
- L2 SLM: model inference is slow → use a smaller quantised model.
- L1b Semantic: embedding is slow → check CPU contention.

Gateway OOM (Out of Memory)

Diagnostic steps:

Check cache capacity:
```
echo $ISARTOR__CACHE_MAX_CAPACITY
```
Reduce capacity or switch to Redis backend.
If using embedded SLM, check model size vs. container memory limit.

Requests Queuing / High Connection Count

Symptom: Clients see connection timeouts or slow responses even for cache hits.

Causes & Fixes:

Cause	Fix
Too many concurrent requests	Scale horizontally (add replicas)
`spawn_blocking` pool exhaustion	Increase Tokio blocking threads: `TOKIO_WORKER_THREADS=8`
SLM inference blocking async runtime	Ensure SLM runs on blocking pool (default in Isartor)

Degraded Mode (SLM Down, Cache Only)

When the SLM sidecar is unreachable, Isartor automatically degrades:

L1a/L1b cache still works → cached requests are served.
L2 SLM → all requests treated as COMPLEX (regardless of classifier mode) → forwarded to L3.
Impact: Higher cloud costs, but no downtime.

Monitor with:

# If SLM layer stops resolving requests, something is wrong
sum(rate(isartor_requests_total{final_layer="L2_SLM"}[5m])) == 0

Docker & Deployment Issues

Docker Build Fails

Symptom: cargo build fails inside Docker.

Common fixes:

Ensure Dockerfile uses the correct Rust toolchain version.
For aws-lc-rs (TLS): install cmake, gcc, make in build stage.
Check that .dockerignore isn't excluding required files.

Container Can't Reach Host Services

Symptom: Gateway inside Docker can't connect to sidecar on localhost.

Fix: Use Docker network names or host.docker.internal:

# docker-compose.yml
environment:
  - ISARTOR__LAYER2__SIDECAR_URL=http://sidecar:8081   # service name
  # or for host:
  - ISARTOR__LAYER2__SIDECAR_URL=http://host.docker.internal:8081

Health Check Failing

Symptom: Orchestrator keeps restarting the container.

Fix: The health endpoint is GET /healthz. Ensure the health check matches:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
  interval: 10s
  timeout: 5s
  retries: 3

FAQ

Q: What is `cache_mode` and which should I use?

A: cache_mode controls which cache layers are active:

Mode	What it does	Best for
`exact`	Only SHA-256 hash match	Deterministic agent loops
`semantic`	Only cosine similarity	Diverse user queries
`both`	Exact first, then semantic	Most workloads (default)

Q: What happens if Redis goes down?

A: Isartor gracefully falls through. The exact cache layer logs a warning and forwards the request downstream. No crash, no data loss. Deflection rate drops until Redis recovers, and more requests reach the cloud LLM (higher cost).

Q: Can I change the embedding model?

A: Yes. The in-process embedder uses candle with a pure-Rust BertModel, which supports multiple models. Set:

export ISARTOR__EMBEDDING_MODEL=bge-small-en-v1.5

The model is auto-downloaded on first startup. Note: changing the model invalidates the semantic cache (different embedding dimensions/space).

Q: How much does Isartor cost to run?

A: Isartor itself is free (Apache 2.0). The infrastructure cost depends on your deployment:

Mode	Estimated Cost
Minimalist (single binary, no GPU)	~$5–15/month (small VM or container)
With SLM sidecar (CPU)	~$20–50/month (4-core VM)
With SLM on GPU	~$50–200/month (GPU instance)
Enterprise (K8s + Redis + vLLM)	~$200–500/month

The ROI comes from cloud LLM savings. At 70 % deflection and $0.01/1K tokens, Isartor typically pays for itself within the first week.

Q: Is Isartor production-ready?

A: Isartor is designed for production use with:

✅ Bounded, concurrent caches (no unbounded memory growth)
✅ Graceful degradation (every layer has a fallback)
✅ OpenTelemetry observability (traces, metrics, structured logs)
✅ Health check endpoint (/healthz)
✅ Configurable via environment variables (12-factor app)
✅ Integration tests covering all middleware layers

For enterprise deployments, use Redis-backed caches and a production Kubernetes cluster. See the Enterprise Guide.

Q: Can I use Isartor with LangChain / LlamaIndex / AutoGen?

A: Yes. Isartor exposes an OpenAI-compatible API. Point any SDK at the gateway URL:

import openai
client = openai.OpenAI(
    base_url="http://your-isartor-host:8080/v1",
    api_key="your-gateway-key",
)

See Integrations for full examples.

Q: How do I upgrade Isartor?

# Binary
cargo install --path . --force

# Docker
docker pull ghcr.io/isartor-ai/isartor:latest
docker compose up -d --pull always

In-memory caches are cleared on restart. Redis caches persist.

Q: Why does `isartor update` or GitHub access fail with `localhost:8081` / `Connection refused` after I stopped Isartor?

A: Your shell likely still has proxy environment variables from a prior isartor connect ... session, so non-Isartor commands are still trying to reach GitHub through the local CONNECT proxy on localhost:8081.

Fix on macOS / Linux:

unset HTTPS_PROXY HTTP_PROXY ALL_PROXY https_proxy http_proxy all_proxy
unset NODE_EXTRA_CA_CERTS SSL_CERT_FILE REQUESTS_CA_BUNDLE
unset ISARTOR_COPILOT_ENABLED ISARTOR_ANTIGRAVITY_ENABLED

Then confirm the shell is clean:

env | grep -i proxy

You can also clean up client-side configuration:

isartor connect copilot --disconnect
isartor connect claude --disconnect
isartor connect antigravity --disconnect

Q: Why does `isartor update` fail with `Permission denied (os error 13)`?

A: Your current isartor binary is installed in a system-managed directory.

Recommended fix: move to a user-writable install location:

mkdir -p ~/.local/bin
cp /usr/local/bin/isartor ~/.local/bin/isartor
chmod +x ~/.local/bin/isartor
export PATH="$HOME/.local/bin:$PATH"
hash -r

Then confirm: which isartor

Q: Why does `isartor` keep my terminal busy?

A: isartor runs the API gateway in the foreground by default. Start in detached mode:

isartor up --detach

Stop later with: isartor stop

Q: How do I monitor deflection rate in real-time?

A: Use the Grafana dashboard included in dashboards/prometheus-grafana.json or the PromQL query:

1 - (
  sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[5m]))
  /
  sum(rate(isartor_requests_total[5m]))
)

Q: Can I run Isartor without any cloud LLM?

A: Partially. Layers 1 and 2 work standalone (cache + SLM). But Layer 3 requires a cloud LLM API key. Without one, uncached COMPLEX requests will return a 502 error. For fully local operation, ensure your SLM can handle all traffic (set a very aggressive SIMPLE classification).

Why Most LLM Gateways Can't Pass a FedRAMP Review

Published on the Isartor blog — targeting platform engineers and security architects at regulated enterprises.

The CISO's Nightmare

Picture this: a CISO at a federal agency is six months into an LLM gateway evaluation. The vendor has given assurances — "our gateway is secure, all data stays in your environment." The compliance team runs a network capture during the proof-of-concept. Three unexpected domains light up:

telemetry.vendor.io — anonymous usage metrics
license.vendor.io — license key validation on every startup
registry.vendor.io — model version checks

The FedRAMP audit fails. The project is cancelled. Six months of engineering work discarded because nobody read the gateway's egress behavior carefully enough before the evaluation began.

This is not a hypothetical. It happens routinely in regulated environments. The mistake is usually honest — gateway teams build their products for cloud-native deployments and add telemetry and license checks as an afterthought, without thinking about what happens when those systems need to run in an air-gapped facility.

The Hidden Phone-Home Problem

Most LLM gateways have outbound connection patterns that are not documented in their README. Let's be specific about what these are and why each one is a blocker in a FedRAMP or HIPAA environment:

License validation servers. A gateway that validates its licence key against a remote server cannot operate in a network segment with no outbound internet access. Worse, the validation traffic typically contains the licence key and the server's hostname — both of which may be considered sensitive data in a classified environment. Under FedRAMP Moderate, SC-7 (Boundary Protection) requires that external connections be explicitly authorised and documented. An undocumented licence-check endpoint fails this control.

Anonymous usage telemetry. Many open-source gateways ship with opt-out telemetry that sends aggregate usage statistics to the developer's servers. Even "anonymous" telemetry can include prompt length distributions, model names, or error rates that a regulated environment may consider sensitive. Under HIPAA, any data that could be used to identify a patient — including metadata about the prompts that process PHI — must stay within the covered entity's environment.

Model registry lookups. Gateways that support automatic model updates or capability discovery make outbound calls to check for new model versions. In an air-gapped environment, there is no path for these calls to succeed — and if the gateway blocks on a registry timeout, latency spikes cascade through the application.

OTel exporters enabled by default. OpenTelemetry is essential for observability, but a gateway that ships with OTLP_EXPORTER_ENDPOINT pointing at a cloud-hosted collector creates a data exfiltration risk. Trace data contains prompt content, response content, latency, and error messages. An OTel exporter sending this to an external endpoint in a HIPAA environment would be a reportable breach.

Each of these problems has the same root cause: the gateway was designed for cloud-native deployments and retrofitted for security requirements, rather than designed with air-gap constraints from the start.

What "Truly Air-Gapped" Actually Means

A gateway that can genuinely pass an air-gap review must satisfy three requirements:

1. A static binary with no runtime dependencies. Every runtime dependency — a Python interpreter, a Node.js runtime, a JVM — is a potential attack surface and a source of unexpected network calls. A statically compiled binary eliminates the entire class of "your dependency phoned home without you knowing" vulnerabilities. It also eliminates the download-on-first-run pattern where models or plugins are fetched from the internet when the gateway starts.

2. Offline licence validation. Licence validation must work without a network call. The correct approach is HMAC-based offline validation: the licence key embeds a cryptographic signature that the binary verifies locally using a public key baked in at compile time. No server call required. No licence-check traffic to document in your FedRAMP boundary diagram.

3. All models bundled — no download on first run. Any model that is downloaded at runtime creates a bootstrap dependency on internet connectivity. For an air-gapped deployment, all models must be available in the container image (or on a mounted volume) before the gateway starts. This is non-negotiable for environments where the deployment system has no outbound internet access at all.

Isartor is designed to meet all three requirements. The binary is compiled with Rust's --target x86_64-unknown-linux-musl producing a fully static binary with zero shared library dependencies. Licence validation uses HMAC offline verification. The latest-airgapped Docker image is built to pre-bundle (or pre-cache) all embedding models so that, once the image is transferred to the air-gapped environment and ISARTOR__OFFLINE_MODE=true is set, no additional model downloads or outbound internet access are required at runtime.

The Configuration

Here is the complete environment variable configuration for a compliant air-gapped deployment of Isartor in front of a self-hosted vLLM instance:

# ── Air-gap enforcement ──────────────────────────────────────────────
# Block all outbound cloud connections at the application layer.
export ISARTOR__OFFLINE_MODE=true

# ── Internal LLM routing (L3) ────────────────────────────────────────
# Route surviving cache-misses to your internal model server.
export ISARTOR__EXTERNAL_LLM_URL=http://vllm.internal.corp:8000/v1
export ISARTOR__LLM_PROVIDER=openai          # vLLM exposes OpenAI-compat API
export ISARTOR__EXTERNAL_LLM_MODEL=meta-llama/Llama-3-8B-Instruct

# ── Observability (internal collector only) ──────────────────────────
export ISARTOR__ENABLE_MONITORING=true
export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.internal.corp:4317

Running isartor connectivity-check with this configuration produces:

Isartor Connectivity Audit
──────────────────────────
Required (L3 cloud routing):
  → http://vllm.internal.corp:8000/v1  [CONFIGURED]
    (BLOCKED — offline mode active)

Optional (observability / monitoring):
  → http://otel-collector.internal.corp:4317  [CONFIGURED]

Internal only (no external):
  → (in-memory cache — no network connection)  [CONFIGURED - internal]

Zero hidden telemetry connections: ✓ VERIFIED
Air-gap compatible: ✓ YES (L3 disabled or offline mode active)

This output is the screenshot your compliance team needs. Every connection Isartor makes is explicit, documented, and internal.

The FedRAMP Control Mapping

Understanding how a deployment posture maps to specific NIST 800-53 controls is what separates a security claim from a security argument. Here are the four controls most directly supported by Isartor's air-gapped deployment posture:

AU-2 (Audit Logging): AU-2 requires that the system generate audit records for events relevant to security. Isartor logs every prompt, every deflection decision, and every L3 forwarding event as a structured JSON record with a distributed tracing span. The logs include the layer that handled the request (L1a, L1b, L2, L3), the latency, and whether the request was deflected or forwarded. These records can be ingested by any SIEM that accepts JSON log streams.

SC-7 (Boundary Protection): SC-7 requires the system to monitor and control communications at external boundary points. ISARTOR__OFFLINE_MODE=true implements a hard application-layer block on all outbound connections to non-internal endpoints. This is verified by the phone-home audit test in tests/phone_home_audit.rs, which runs on every commit to main in CI. The CI badge on the repository proves continuous enforcement.

SI-4 (Information System Monitoring): SI-4 requires monitoring of the information system to detect attacks and indicators of compromise. Isartor's OpenTelemetry integration exports traces and metrics to an internal collector. The deflection stack metrics — cache hit rate, L3 call rate, latency per layer — provide a real-time signal that can be baselined and alerted on. An anomalous spike in L3 calls could indicate a cache poisoning attempt.

CM-6 (Configuration Settings): CM-6 requires the organisation to establish and document configuration settings. Every Isartor configuration parameter is controlled by an environment variable with a documented default and a documented security implication. The ISARTOR__OFFLINE_MODE flag, in particular, has a documented effect: it is a single switch that moves the system from "possibly communicates with cloud" to "provably does not communicate with cloud."

Call to Action

If you are a platform engineer or security architect at a regulated enterprise evaluating LLM gateway options, start here:

Read the Air-Gapped Deployment Guide for the complete pre-deployment checklist.
Pull ghcr.io/isartor-ai/isartor:latest-airgapped and run isartor connectivity-check in your environment.
Review the phone-home audit test to understand exactly what is being verified in CI.
Open an issue on GitHub if you have compliance requirements not covered here — FedRAMP High, IL5, ITAR, and sector-specific requirements are all on the roadmap.

The binary that passes your network capture is the binary that passes your FedRAMP review.

Isartor Documentation