Welcome to Isartor

Isartor

Open-source Prompt Firewall — deflect up to 95% of redundant LLM traffic before it leaves your infrastructure.

Pure Rust · Single Binary · Zero Hidden Telemetry · Air-Gappable


AI coding agents and personal assistants repeat themselves — a lot. Copilot, Claude Code, Cursor, and OpenClaw send the same system instructions, the same context preambles, and often the same user prompts across every turn. Standard API gateways forward all of it to cloud LLMs regardless.

Isartor sits between your tools and the cloud. It intercepts every prompt and runs a cascade of local algorithms — from sub-millisecond hashing to in-process neural inference — to resolve requests before they reach the network. Only the genuinely hard prompts make it through.

The Deflection Stack

Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the stack to reach the cloud.

Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
                 │ hit                │ hit                 │ simple             │ compressed                │
                 ▼                    ▼                     ▼                    ▼                           ▼
              Response             Response            Local Response     Optimised Prompt            Cloud Response
LayerWhat It DoesTypical Latency
L1a — Exact CacheSub-millisecond duplicate detection via fast hashing. Traps infinite agent loops instantly.< 1 ms
L1b — Semantic CacheCatches meaning-equivalent prompts ("Price?" ≈ "Cost?") using pure-Rust embeddings.1–5 ms
L2 — SLM RouterTriages intent with an embedded Small Language Model to resolve simple tasks locally.50–200 ms
L2.5 — Context OptimiserCompresses repeated instruction payloads (CLAUDE.md, copilot-instructions) via session dedup and minification.< 1 ms
L3 — Cloud LogicRoutes surviving complex prompts to OpenAI, Anthropic, or Azure with fallback resilience.Network-bound

Layers 1a and 1b deflect 71% of repetitive agentic traffic and 38% of diverse task traffic before any neural inference runs.

How It Works

Getting started with Isartor takes three steps:

1. Install

curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

Or use Docker:

docker run -p 8080:8080 ghcr.io/isartor-ai/isartor:latest

2. Connect

Point any OpenAI-compatible client at Isartor — just change the base URL:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-api-key",
)

Works with the official SDKs, LangChain, LlamaIndex, AutoGen, GitHub Copilot, OpenClaw, and any other OpenAI-compatible tool.

Recent OpenAI-compatible improvements for coding agents include:

  • GET /v1/models for model discovery
  • stream: true support on /v1/chat/completions with proper SSE chunks
  • tools, tool_choice, functions, and function_call passthrough
  • tool_calls preserved in upstream responses

3. Save

Isartor deflects repetitive and simple prompts locally. You keep the same responses, pay for fewer tokens, and get lower latency — with zero code changes beyond the URL.


Explore the Docs

🚀 Getting Started Install Isartor and send your first request.

🔌 Integrations Connect Copilot CLI, Cursor, Claude Code, and more.

📦 Deployment From a single binary to a multi-replica K8s cluster.

⚙️ Configuration Every environment variable and config key.

🏗️ Architecture Deep dive into the Deflection Stack and trait providers.

📊 Observability OpenTelemetry traces, Prometheus metrics, Grafana dashboards.

Installation

Isartor ships as a single statically linked binary — no runtime dependencies required.

curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

Docker

The image ships a statically linked isartor binary and downloads the embedding model on first start (then reuses the on-disk hf-hub cache). No API key is needed for the cache layers.

docker run -p 8080:8080 ghcr.io/isartor-ai/isartor:latest

To persist the model cache across restarts (recommended):

docker run -p 8080:8080 \
  -e HF_HOME=/tmp/huggingface \
  -v isartor-hf:/tmp/huggingface \
  ghcr.io/isartor-ai/isartor:latest

To use Azure OpenAI for Layer 3 (recommended: Docker secrets via *_FILE). Important: ISARTOR__EXTERNAL_LLM_URL must be the base Azure endpoint only (no /openai/... path), e.g. https://<resource>.openai.azure.com:

# Put your key in a file (no trailing newline is ideal, but Isartor trims whitespace)
echo -n "YOUR_AZURE_OPENAI_KEY" > ./azure_openai_key

docker run -p 8080:8080 \
  -e ISARTOR__LLM_PROVIDER=azure \
  -e ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
  -e ISARTOR__AZURE_DEPLOYMENT_ID=<deployment> \
  -e ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
  -e ISARTOR__EXTERNAL_LLM_API_KEY_FILE=/run/secrets/azure_openai_key \
  -v $(pwd)/azure_openai_key:/run/secrets/azure_openai_key:ro \
  ghcr.io/isartor-ai/isartor:latest

The startup banner appears after all layers are ready (< 30 s on a modern machine).

Image size: ~120 MB compressed / ~260 MB on disk (includes all-MiniLM-L6-v2 embedding model, statically linked Rust binary).

Windows (PowerShell) — Single Command

irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex

Build from Source

git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release
./target/release/isartor up

Requires Rust 1.75 or later.

Verify Installation

Check that the binary is available:

isartor --version

Run the built-in demo. It works without an API key, but if you configure a provider first it also shows a live upstream round-trip:

isartor set-key -p groq
isartor check
isartor demo

Verify the health endpoint:

curl http://localhost:8080/health
# {"status":"ok","version":"0.1.0","layers":{...},"uptime_seconds":5,"demo_mode":true}

Quick Start

This guide walks you through starting Isartor, making your first request, observing a cache hit, and checking stats. If you haven't installed Isartor yet, see the Installation guide.

Starting Isartor

isartor up           # start the API gateway only
isartor up --detach  # start in background and return to the shell
isartor up copilot   # start gateway + CONNECT proxy for Copilot CLI

Other useful commands:

isartor init         # generate a commented config scaffold
isartor set-key -p openai  # configure your LLM provider API key
isartor check        # verify provider/model/key masking and live connectivity
isartor demo         # run the post-install showcase
isartor stop         # stop a running Isartor instance (uses PID file)
isartor update       # self-update to the latest version from GitHub releases

Making Your First Request

Isartor exposes an OpenAI-compatible API. Send a request to the /v1/chat/completions endpoint:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma-2-2b-it",
    "messages": [
      {"role": "user", "content": "Explain the quantum Hall effect in detail, including its significance for condensed matter physics and any applications in modern technology."}
    ]
  }'

Expected JSON Response (snippet):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The quantum Hall effect is a phenomenon..."
      }
    }
  ],
  "usage": { ... }
}

Console Log (snippet):

INFO  [slm_triage] Layer 3 fallback: OpenAI
INFO  [cache] Layer 1a miss: quantum Hall effect prompt

The first request is a cache miss — Layer 2 triages it and Layer 3 routes it to your configured cloud provider.

OpenAI-compatible clients can also:

  • call GET /v1/models to discover the configured model
  • send "stream": true and receive OpenAI-style SSE responses
  • use tool/function calling fields such as tools, tool_choice, and functions

You can also use the native API:

curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Calculate 2+2"}'

Seeing a Cache Hit

Repeat the same request:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma-2-2b-it",
    "messages": [
      {"role": "user", "content": "Explain the quantum Hall effect in detail, including its significance for condensed matter physics and any applications in modern technology."}
    ]
  }'

Expected JSON Response (snippet):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The quantum Hall effect is a phenomenon..."
      }
    }
  ],
  "usage": { ... }
}

Console Log (snippet):

INFO  [cache] Layer 1a exact match: quantum Hall effect prompt
INFO  [slm_triage] Short-circuit: cache hit

This time the response comes from the Layer 1a exact cache — sub-millisecond, zero tokens consumed, no cloud call.

Checking Stats

View prompt totals, layer hit rates, and recent routing history:

isartor stats

Connecting an AI Tool

Isartor works as a drop-in replacement for any OpenAI-compatible client. Point your favourite AI tool at http://localhost:8080/v1 and it will route through the Deflection Stack automatically.

import openai

client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Summarise this document."}],
)

If your client probes models first, this also works:

curl -sS http://localhost:8080/v1/models

For detailed setup guides for GitHub Copilot CLI, Claude Code, Cursor, and other tools, see the Integrations section.


For advanced configuration, see the Configuration Reference and Architecture.

Architecture

Pattern: Hexagonal Architecture (Ports & Adapters) Location: src/core/, src/adapters/, src/factory.rs

High-Level Overview

Isartor is an AI Prompt Firewall that intercepts LLM traffic and routes it through a multi-layer Deflection Stack. Each layer can short-circuit and return a response without reaching the cloud, dramatically reducing cost and latency.

For a detailed breakdown of the deflection layers, see the Deflection Stack page.

flowchart TD
    A[Request] --> B[Auth]
    B --> C[Cache L1a: LRU/Redis]
    C --> D[Cache L1b: Candle/TEI]
    D --> E[SLM Router: Candle/vLLM]
    E --> F[Context Optimiser: CompressionPipeline]
    F --> G[Cloud Fallback: OpenAI/Anthropic]
    G --> H[Response]

    subgraph F_detail [L2.5 CompressionPipeline]
        direction LR
        F1[ContentClassifier] --> F2[DedupStage]
        F2 --> F3[LogCrunchStage]
    end

Pluggable Trait Provider Pattern

All layers are implemented as Rust traits and adapters. Backends are selected at startup via ISARTOR__ environment variables — no code changes or recompilation required.

Rather than feature-flag every call-site, we define Ports (trait interfaces in src/core/ports.rs) and swap the concrete Adapter at startup. This keeps the Deflection Stack logic completely agnostic to the backing implementation.

ComponentMinimalist (Single Binary)Enterprise (K8s)
L1a Exact CacheIn-memory LRU (ahash + parking_lot)Redis cluster (shared across replicas)
L1b Semantic CacheIn-process candle BertModelExternal TEI sidecar (optional)
L2 SLM RouterEmbedded candle GGUF inferenceRemote vLLM / TGI server (GPU pool)
L2.5 Context OptimiserIn-process CompressionPipeline (classifier → dedup → log_crunch)In-process CompressionPipeline (extensible with custom stages)
L3 Cloud LogicDirect to OpenAI / AnthropicDirect to OpenAI / Anthropic

Adding a New Adapter

  1. Define the struct in src/adapters/cache.rs or src/adapters/router.rs.
  2. Implement the port trait (ExactCache or SlmRouter).
  3. Add a variant to the config enum (CacheBackend or RouterBackend) in src/config.rs.
  4. Wire it in src/factory.rs with a new match arm.
  5. Write tests — each adapter module has a #[cfg(test)] mod tests.

No other files need to change. The middleware and pipeline code operate only on Arc<dyn ExactCache> / Arc<dyn SlmRouter>.

Scalability Model (3-Tier)

Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. The same binary serves all three tiers; the runtime behaviour is entirely configuration-driven.

Level 1 (Edge)           Level 2 (Compose)        Level 3 (K8s)
┌────────────────┐       ┌────────────────┐       ┌────────────────┐
│ Single Process  │       │ Firewall + GPU  │       │ N Firewall Pods │
│ memory cache    │──▶    │ Sidecar         │──▶    │ + Redis Cluster │
│ embedded candle │       │ memory cache    │       │ + vLLM Pool     │
│ context opt.    │       │ (optional)      │       │ (optional)      │
└────────────────┘       └────────────────┘       └────────────────┘

Key insight: Switching to cache_backend=redis unlocks true multi-replica scaling. Without it, each firewall pod maintains an independent cache.

See the deployment guides for tier-specific setup:

Directory Layout

src/
├── core/
│   ├── mod.rs            # Re-exports
│   ├── ports.rs          # Trait interfaces (ExactCache, SlmRouter)
│   └── context_compress.rs # Re-export shim (backward compat)
├── adapters/
│   ├── mod.rs            # Re-exports
│   ├── cache.rs          # InMemoryCache, RedisExactCache
│   └── router.rs         # EmbeddedCandleRouter, RemoteVllmRouter
├── compression/
│   ├── mod.rs            # Re-exports all pipeline types
│   ├── pipeline.rs       # CompressionPipeline executor + CompressionStage trait
│   ├── cache.rs          # InstructionCache (per-session dedup state)
│   ├── optimize.rs       # Request body rewriting (JSON → pipeline → reassembly)
│   └── stages/
│       ├── content_classifier.rs  # Gate: instruction vs conversational
│       ├── dedup.rs               # Cross-turn instruction dedup
│       └── log_crunch.rs          # Static minification
├── middleware/
│   └── context_optimizer.rs  # L2.5 Axum middleware
├── factory.rs            # build_exact_cache(), build_slm_router()
└── config.rs             # CacheBackend, RouterBackend enums + AppConfig

See Also

The Deflection Stack

Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the Deflection Stack to reach the cloud.

Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
                 │ hit                │ hit                 │ simple             │ compressed                │
                 ▼                    ▼                     ▼                    ▼                           ▼
              Response             Response            Local Response     Optimised Prompt            Cloud Response

Layers at a Glance

LayerAlgorithm / MechanismWhat It DoesTypical Latency
L1a — Exact CacheFast Hashing (ahash)Sub-millisecond duplicate detection. Traps infinite agent loops instantly.< 1 ms
L1b — Semantic CacheCosine Similarity (Embeddings)Computes mathematical meaning via pure-Rust candle models (all-MiniLM-L6-v2) to catch variations ("Price?" ≈ "Cost?").1–5 ms
L2 — SLM RouterNeural Classification (LLM)Triages intent using an embedded Small Language Model (e.g. Qwen-1.5B) to resolve simple data extraction tasks.50–200 ms
L2.5 — Context OptimiserInstruction Dedup + MinifyCompresses repeated instruction files (CLAUDE.md, copilot-instructions.md) via session dedup and static minification to reduce cloud input tokens.< 1 ms
L3 — Cloud LogicLoad Balancing & RetriesRoutes surviving complex prompts to OpenAI, Anthropic, or Azure, with built-in fallback resilience.Network-bound

Layers 1a and 1b deflect 71% of repetitive agentic traffic (FAQ/agent loop patterns) and 38% of diverse task traffic before any neural inference runs.

Layer Details

L1a — Exact Cache

Algorithm: Fast hashing with ahash

L1a is the first line of defence. It computes a hash of the incoming prompt and checks it against an in-memory LRU cache (single-binary mode) or a shared Redis cluster (enterprise mode).

  • Hit: Returns the cached response immediately (sub-millisecond).
  • Miss: The request continues to L1b.

Cache keys are namespaced before hashing (native|prompt, openai|prompt, anthropic|prompt, etc.) to ensure one endpoint never returns another endpoint's response schema. On a cache hit, ChatResponse.layer is normalised to 1 regardless of which layer originally produced the response.

ModeImplementation
MinimalistIn-memory LRU (ahash + parking_lot)
EnterpriseRedis cluster (shared across replicas, async redis crate)

L1b — Semantic Cache

Algorithm: Cosine similarity over sentence embeddings (all-MiniLM-L6-v2)

L1b catches semantically equivalent prompts that differ in wording. A sentence embedding is computed for the incoming prompt using a pure-Rust candle BertModel, then compared against the vector cache using cosine similarity.

  • Hit (similarity above threshold): Returns the cached response (1–5 ms).
  • Miss: The request continues to L2.

Embedding pipeline:

  • Model: sentence-transformers/all-MiniLM-L6-v2 — 384-dimensional embeddings (~90 MB).
  • Runtime: Pure-Rust candle stack — zero C/C++ dependencies.
  • Pooling: Mean pooling with attention mask, followed by L2 normalisation.
  • Thread safety: BertModel is wrapped in std::sync::Mutex; inference runs on tokio::task::spawn_blocking.
  • Architecture: TextEmbedder is initialised once at startup, stored as Arc<TextEmbedder> in AppState.

The vector cache is maintained in tandem with exact cache entries. Insertions and evictions update the index automatically, providing sub-millisecond vector search latency for thousands of embeddings.

ModeImplementation
MinimalistIn-process candle BertModel
EnterpriseExternal TEI sidecar (optional)

L2 — SLM Router

Algorithm: Neural classification via Small Language Model

L2 runs a lightweight language model to classify the prompt's intent. Simple requests (data extraction, FAQ-style queries) can be resolved locally without reaching the cloud.

  • Simple intent: Returns a locally generated response (50–200 ms).
  • Complex intent: The request continues to L2.5.
  • Disabled (enable_slm_router = false): Layer is a no-op; request falls through to L3.
ModeImplementation
MinimalistEmbedded candle GGUF inference (e.g. Gemma-2-2B-IT, CPU)
EnterpriseRemote vLLM / TGI server (GPU pool)

L2.5 — Context Optimiser

Algorithm: CompressionPipeline — Modular staged compression

Agentic coding tools (Copilot, Claude Code, Cursor) send large instruction files (CLAUDE.md, copilot-instructions.md, skills blocks) with every turn. L2.5 detects and compresses these payloads before they reach the cloud, saving input tokens on every L3 call.

Pipeline architecture (src/compression/):

L2.5 uses a modular CompressionPipeline with pluggable stages that execute in order. Each stage is a stateless CompressionStage trait object. If a stage sets short_circuit = true, subsequent stages are skipped.

Built-in stages (run in order):

  1. ContentClassifier — Gate stage: detects instruction vs conversational content. Short-circuits on conversational messages so downstream stages skip work.
  2. DedupStage — Session-aware cross-turn deduplication. Hashes instruction content per session; on repeat turns, replaces with a compact hash reference. Short-circuits on dedup hit.
  3. LogCrunchStage — Static minification: strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.

Adding custom stages:

Implement the CompressionStage trait and add your stage to the pipeline via build_pipeline() in src/compression/optimize.rs.

Configuration:

VariableDefaultDescription
ISARTOR__ENABLE_CONTEXT_OPTIMIZERtrueMaster switch for L2.5
ISARTOR__CONTEXT_OPTIMIZER_DEDUPtrueEnable cross-turn instruction deduplication
ISARTOR__CONTEXT_OPTIMIZER_MINIFYtrueEnable static minification

Observability:

  • Instrumented as: layer2_5_context_optimizer span in distributed traces.
  • Response header: x-isartor-context-optimized: bytes_saved=<N> on optimised requests.
  • Span fields: context.bytes_saved, context.strategy (e.g. "classifier+dedup", "classifier+log_crunch").
ModeImplementation
MinimalistIn-process CompressionPipeline (classifier → dedup → log_crunch)
EnterpriseIn-process CompressionPipeline (extensible with custom stages)

L3 — Cloud Logic

Algorithm: Load balancing & retries

L3 is the final layer. Only the hardest prompts — those not resolved by cache, SLM, or context optimisation — reach the external cloud LLMs.

  • Routes to OpenAI, Anthropic, Azure OpenAI, or xAI via rig-core.
  • Built-in fallback resilience with load balancing and retries.
  • Offline mode (offline_mode = true): Blocks L3 routing explicitly instead of silently pretending success.
  • Stale fallback: On L3 failure, checks the namespaced exact-cache key first, then a legacy un-namespaced key for backward compatibility.
ModeImplementation
MinimalistDirect to OpenAI / Anthropic
EnterpriseDirect to OpenAI / Anthropic

How Layers Interact

The deflection stack is implemented as Axum middleware plus a final handler. For authenticated routes, the execution order is:

  1. Body bufferBufferedBody stores the request body so multiple layers can read it.
  2. Request-level monitoring — Observability instrumentation.
  3. Auth — API key validation.
  4. Layer 1 cache — L1a exact match, then L1b semantic match.
  5. Layer 2 SLM triage — Intent classification and local response.
  6. Layer 2.5 context optimiser — Instruction dedup + minification via CompressionPipeline.
  7. Layer 3 handler — Cloud LLM fallback.

Implementation note: Axum middleware wraps inside-out — the last .layer(...) added runs first. The stack order in src/main.rs documents this explicitly and must be preserved.

Public health routes (/health, /healthz) intentionally bypass the deflection stack. The authenticated routes are /api/chat, /api/v1/chat, /v1/chat/completions, and /v1/messages.

See Also

Architecture Decision Records

Key design decisions, trade-offs, and rationale behind Isartor's architecture.

Each ADR follows a lightweight format: Context → Decision → Consequences.


ADR-001: Multi-Layer Deflection Stack Architecture

Date: 2024 · Status: Accepted

Context

AI Prompt Firewall traffic follows a power-law distribution: the majority of prompts are simple or repetitive, while only a small fraction requires expensive cloud LLMs. Sending all traffic to a single provider wastes tokens and money.

Decision

Implement a sequential Deflection Stack with 4+ layers, each capable of short-circuiting:

  • Layer 0 — Operational defense (auth, rate limiting, concurrency control)
  • Layer 1 — Semantic + exact cache (zero-cost hits)
  • Layer 2 — Local SLM triage (classify intent, execute simple tasks locally)
  • Layer 2.5 — Context optimiser (retrieve + rerank to minimise token usage)
  • Layer 3 — Cloud LLM fallback (only the hardest prompts)

Layer 2.5 (Context Optimiser): Retrieves and reranks candidate documents or responses to minimize downstream token usage. Typically implements top-K selection, reranking, or context window optimization before forwarding to the LLM. Instrumented as the context_optimise span in observability.

Consequences

  • Positive: 60–80% of traffic can be resolved before Layer 3, dramatically reducing cost.
  • Positive: Each layer adds latency only when needed — cache hits are sub-millisecond.
  • Positive: Clear separation of concerns; each layer is independently testable.
  • Negative: Deflection Stack adds conceptual complexity vs. a simple reverse proxy.
  • Negative: Each layer needs its own error handling and timeout strategy.

ADR-002: Axum + Tokio as Runtime Foundation

Date: 2024 · Status: Accepted

Context

The firewall must handle high concurrency (thousands of simultaneous connections) with low latency overhead. The binary should be small, statically linked, and deployable to minimal environments.

Decision

Use Axum 0.8 on Tokio 1.x for the async HTTP server. Build with --target x86_64-unknown-linux-musl and opt-level = "z" + LTO for a ~5 MB static binary.

Consequences

  • Positive: Tokio's work-stealing scheduler handles 10K+ concurrent connections efficiently.
  • Positive: Axum's type-safe extractors catch errors at compile time.
  • Positive: Static musl binary runs in distroless containers (no libc, no shell).
  • Negative: Rust's compilation times are longer than Go/Node.js equivalents.
  • Negative: Ecosystem is smaller — fewer off-the-shelf middleware components.

ADR-003: Embedded Candle Classifier (Layer 2)

Date: 2024 · Status: Accepted

Context

For minimal deployments (edge, VPS, air-gapped), requiring an external sidecar (llama.cpp, Ollama, TGI) adds operational complexity. Many classification tasks can be handled by a 2B parameter model on CPU.

Decision

Embed a Gemma-2-2B-IT GGUF model directly in the Rust process using the candle framework. The model is loaded on first start via hf-hub (auto-downloaded from Hugging Face) and wrapped in a tokio::sync::Mutex for thread-safe inference on spawn_blocking.

Consequences

  • Positive: Zero external dependencies for Layer 2 classification — a single binary handles everything.
  • Positive: No HTTP overhead for classification calls; inference is an in-process function call.
  • Positive: Works in air-gapped environments with pre-cached models.
  • Negative: ~1.5 GB memory overhead for the Q4_K_M model weights.
  • Negative: CPU inference is slower than GPU (50–200 ms classification, 200–2000 ms generation).
  • Negative: Mutex serialises inference calls — throughput limited to one inference at a time.
  • Trade-off: For higher throughput, upgrade to Level 2 (llama.cpp sidecar on GPU).

ADR-004: Three Deployment Tiers

Date: 2024 · Status: Accepted

Context

Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. A single deployment model cannot serve all use cases optimally.

Decision

Define three explicit deployment tiers that share the same binary and configuration surface:

TierStrategyTarget
Level 1Monolithic binary, embedded candleVPS, edge, bare metal
Level 2Firewall + llama.cpp sidecarsDocker Compose, single host + GPU
Level 3Stateless pods + inference poolsKubernetes, Helm, HPA

The tier is selected purely by environment variables and infrastructure, not by code changes.

Consequences

  • Positive: A single codebase and binary serves all deployment scenarios.
  • Positive: Users start at Level 1 and upgrade incrementally — no migrations.
  • Positive: Clear documentation entry points for each tier.
  • Negative: Some config variables are irrelevant at certain tiers (e.g., ISARTOR__LAYER2__SIDECAR_URL is unused at Level 1 with embedded candle).
  • Negative: Testing all three tiers requires different infrastructure setups.

ADR-005: llama.cpp as Sidecar (Level 2) Instead of Ollama

Date: 2024 · Status: Accepted

Context

The original design used Ollama (~1.5 GB image) as the local SLM engine. While Ollama has a convenient API and model management, it's heavyweight for a sidecar.

Decision

Replace Ollama with llama.cpp server (ghcr.io/ggml-org/llama.cpp:server, ~30 MB) as the default sidecar in docker-compose.sidecar.yml. Two instances run side by side:

  • slm-generation (port 8081) — Phi-3-mini for classification and generation
  • slm-embedding (port 8082) — all-MiniLM-L6-v2 with --embedding flag

Consequences

  • Positive: 50× smaller container images (30 MB vs. 1.5 GB).
  • Positive: Faster cold starts; no model pull step needed (uses --hf-repo auto-download).
  • Positive: OpenAI-compatible API — firewall code doesn't need to change.
  • Negative: Ollama's model management UX (pull, list, delete) is lost.
  • Negative: Each model needs its own llama.cpp instance (no multi-model serving).
  • Migration: Ollama-based Compose files (docker-compose.yml, docker-compose.azure.yml) are retained for backward compatibility.
  • Update (ADR-011): The slm-embedding sidecar (port 8082) is now optional. Layer 1 semantic cache embeddings are generated in-process via candle (pure-Rust BertModel).

ADR-006: rig-core for Multi-Provider LLM Client

Date: 2024 · Status: Accepted

Context

Layer 3 must route to multiple cloud LLM providers (OpenAI, Azure OpenAI, Anthropic, xAI). Implementing each provider's API client from scratch would be error-prone and hard to maintain.

Decision

Use rig-core (v0.32.0) as the unified LLM client. Rig provides a consistent CompletionModel abstraction over all supported providers.

Consequences

  • Positive: Single configuration surface (ISARTOR__LLM_PROVIDER + ISARTOR__EXTERNAL_LLM_API_KEY) switches providers.
  • Positive: Provider-specific quirks (Azure deployment IDs, Anthropic versioning) handled by rig.
  • Negative: Adds a dependency; rig's release cadence may not match our needs.
  • Negative: Limited to providers rig supports (but covers all major ones).

ADR-007: AIMD Adaptive Concurrency Control

Date: 2024 · Status: Accepted

Context

A fixed concurrency limit either over-provisions (wasting resources) or under-provisions (rejecting requests during traffic spikes). The firewall needs to dynamically adjust its limit based on real-time latency.

Decision

Implement an Additive Increase / Multiplicative Decrease (AIMD) concurrency limiter at Layer 0:

  • If P95 latency < target → limit += 1 (additive increase).
  • If P95 latency > target → limit *= 0.5 (multiplicative decrease).
  • Bounded by configurable min/max concurrency limits.

Consequences

  • Positive: Self-tuning: the limit converges to the optimal value for the current load.
  • Positive: Protects downstream services (sidecars, cloud LLMs) from overload.
  • Negative: During cold start, the limit starts low and ramps up — initial requests may see 503s.
  • Tuning: Target latency must be calibrated per deployment tier.

ADR-008: Unified API Surface

Date: 2024 · Status: Superseded

Context

The original design maintained two API versions: a v1 middleware-based pipeline (/api/chat) and a v2 orchestrator-based pipeline (/api/v2/chat). Maintaining two code paths increased complexity with no clear benefit once the middleware pipeline matured.

Decision

Consolidate into a single endpoint:

  • /api/chat — Middleware-based pipeline. Each layer is an Axum middleware (auth → cache → SLM triage → handler).
  • The v2 endpoint (/api/v2/chat) and its pipeline_* configuration fields have been removed.
  • Orchestrator and trait-based pipeline components remain in src/pipeline/ for potential future reintegration.

Consequences

  • Positive: Single code path to maintain, test, and observe.
  • Positive: Simplified configuration surface — no more PIPELINE_* env vars.
  • Positive: Eliminates user confusion about which endpoint to use.
  • Negative: Orchestrator-based features (structured processing_log, explicit PipelineContext) are not exposed until reintegrated.

ADR-009: Distroless Container Image

Date: 2024 · Status: Accepted

Context

The firewall binary is statically linked (musl). The runtime container only needs to execute a single binary.

Decision

Use gcr.io/distroless/static-debian12 as the runtime base image. It contains no shell, no package manager, no libc — only the static binary.

Consequences

  • Positive: Minimal attack surface — no shell to exec into, no tools for attackers.
  • Positive: Tiny image size (base ~2 MB + binary ~5 MB = ~7 MB total).
  • Positive: Passes most container security scanners with zero CVEs.
  • Negative: Cannot docker exec into the container for debugging (no shell).
  • Negative: Cannot install additional tools at runtime.
  • Workaround: Use docker logs, Jaeger traces, and Prometheus metrics for debugging.

ADR-010: OpenTelemetry for Observability

Date: 2024 · Status: Accepted

Context

The firewall needs distributed tracing and metrics. Vendor-specific SDKs (Datadog, New Relic, etc.) create lock-in.

Decision

Use OpenTelemetry (OTLP gRPC) as the sole telemetry interface. Traces and metrics are exported to an OTel Collector, which can forward to any backend (Jaeger, Prometheus, Grafana, Datadog, etc.).

Consequences

  • Positive: Vendor-neutral — switch backends by reconfiguring the collector, not the app.
  • Positive: OTLP is a CNCF standard with wide ecosystem support.
  • Positive: When ISARTOR__ENABLE_MONITORING=false, no OTel SDK is initialised — zero overhead.
  • Negative: Requires an OTel Collector as middleware (adds one more service in Level 2/3).
  • Negative: Auto-instrumentation is less mature in Rust than in Java/Python.

ADR-011: Pure-Rust Candle for In-Process Sentence Embeddings

StatusAccepted (superseded: fastembed → candle)
Date2025-06 (updated 2025-07)
DecidersCore team
Relates toADR-003 (Embedded Candle), ADR-005 (llama.cpp sidecar)

Context

Layer 1 (semantic cache) must generate sentence embeddings for every incoming prompt to compute cosine similarity against the vector cache. Previously, this was done via fastembed (ONNX Runtime, BAAI/bge-small-en-v1.5), which introduced a C++ dependency (onnxruntime-sys) that broke cross-compilation on ARM64 macOS and complicated the build matrix.

Decision

Use candle (candle-core, candle-nn, candle-transformers 0.9) with hf-hub and tokenizers to run sentence-transformers/all-MiniLM-L6-v2 in-process via a pure-Rust BertModel. The model weights (~90 MB) are downloaded once from Hugging Face Hub on first startup and cached in ~/.cache/huggingface/. Inference is invoked through tokio::task::spawn_blocking since BERT forward passes are CPU-bound.

  • Model: sentence-transformers/all-MiniLM-L6-v2 — 384-dimensional embeddings, optimised for sentence similarity.
  • Runtime: Pure-Rust candle stack — zero C/C++ dependencies, seamless cross-compilation to any rustc target.
  • Pooling: Mean pooling with attention mask, followed by L2 normalisation.
  • Thread safety: The inner BertModel is wrapped in std::sync::Mutex because forward() takes &mut self. This is acceptable because inference is always called from spawn_blocking, never holding the lock across .await points.
  • Architecture: TextEmbedder is initialised once at startup, stored as Arc<TextEmbedder> in AppState, and injected into the cache middleware.

Alternatives Considered

AlternativeWhy rejected
fastembed (ONNX Runtime)C++ dependency (onnxruntime-sys) breaks ARM64 cross-compilation; ~5 MB shared library
llama.cpp sidecar (all-MiniLM-L6-v2)Network round-trip on hot path, extra container to manage
sentence-transformers (Python)Crosses FFI boundary, adds Python runtime dependency
ort (raw ONNX Runtime bindings)Same C++ dependency problem as fastembed

Consequences

  • Positive: Eliminates ~2–5 ms network latency per embedding call on the cache hot path.
  • Positive: Zero C/C++ dependencies — cargo build works on any platform without cmake or pre-built binaries.
  • Positive: Zero sidecar dependency for Level 1 — the minimal Dockerfile runs self-contained.
  • Positive: Model weights are auto-downloaded from Hugging Face Hub; reproducible builds.
  • Negative: First startup downloads model weights (~90 MB) if not pre-cached.
  • Negative: Mutex serialises concurrent embedding calls within a single process (acceptable at current scale; can be replaced with a pool of models if needed).

ADR-012: Pluggable Trait Provider (Hexagonal Architecture)

StatusAccepted
Date2025-06
DecidersCore team
Relates toADR-003 (Embedded Candle), ADR-004 (Three Deployment Tiers)

Context

As Isartor grew from a single-process binary (Level 1) to a multi-tier deployment (Level 1 → 2 → 3), the cache and SLM router components became tightly coupled to their in-process implementations. Scaling to Level 3 (Kubernetes, multiple replicas) requires:

  1. Shared cache — in-process LRU caches are isolated per pod; cache hits are inconsistent, duplicating work.
  2. GPU-backed inference — in-process Candle inference is CPU-bound; Level 3 needs a dedicated GPU inference pool (vLLM / TGI) that can scale independently.

Hard-coding these choices into the firewall binary would require compile-time feature flags or code branching, making the binary non-portable across tiers.

Decision

Adopt the Ports & Adapters (Hexagonal Architecture) pattern:

  • Ports (src/core/ports.rs) — Define ExactCache and SlmRouter as async_trait traits (Send + Sync), representing the interfaces the firewall depends on.
  • Adapters (src/adapters/) — Provide concrete implementations:
    • InMemoryCache (ahash + LRU + parking_lot) and RedisExactCache for ExactCache
    • EmbeddedCandleRouter and RemoteVllmRouter for SlmRouter
  • Factory (src/factory.rs) — build_exact_cache(&config) and build_slm_router(&config, &http_client) read AppConfig.cache_backend and AppConfig.router_backend at startup and return the appropriate Box<dyn Trait>.
  • Configuration (src/config.rs) — CacheBackend enum (Memory | Redis) and RouterBackend enum (Embedded | Vllm) with associated connection URLs, selectable via ISARTOR__CACHE_BACKEND and ISARTOR__ROUTER_BACKEND env vars.

The same binary serves all three deployment tiers; the runtime behaviour is entirely configuration-driven.

Alternatives Considered

AlternativeWhy rejected
Compile-time feature flags (#[cfg(feature = "redis")])Produces different binaries per tier; complicates CI and container builds
Service mesh sidecar (Envoy filter for caching)Adds infrastructure complexity; cache logic is domain-specific
Plugin system (dynamic .so loading)Over-engineered; dyn Trait with compile-time-known variants is simpler
Runtime scripting (Lua / Wasm policy)Unnecessary indirection; Rust trait dispatch is zero-cost

Consequences

  • Positive: One binary, all tiers — only env vars change between Level 1 (embedded everything) and Level 3 (Redis + vLLM).
  • Positive: Horizontal scalability — with cache_backend=redis, all pods share the same cache; with router_backend=vllm, GPU inference scales independently.
  • Positive: Testability — unit tests inject mock adapters via the trait interface.
  • Positive: Extensibility — adding a new backend (e.g., Memcached, Triton) requires only a new adapter implementing the trait.
  • Negative: Minor runtime overhead from dyn Trait dynamic dispatch (single vtable lookup per call — negligible vs. network I/O).
  • Negative: EmbeddedCandleRouter remains a skeleton; full candle-based classification requires the embedded-inference feature flag to be completed.

← Back to Architecture

AI Tool Integrations

Isartor is an OpenAI-compatible and Anthropic-compatible gateway that deflects repeated or simple prompts at Layer 1 (cache) and Layer 2 (local SLM) before they reach the cloud. Clients integrate by overriding their base URL to point at Isartor or by registering Isartor as an MCP server — no proxy, no MITM, no CA certificates.

Endpoints

Isartor's server defaults to: http://localhost:8080.

Authenticated chat endpoints:

EndpointProtocolPath
Native Isartor (recommended for direct use)NativePOST /api/chat / POST /api/v1/chat
OpenAI ModelsOpenAIGET /v1/models
OpenAI Chat CompletionsOpenAIPOST /v1/chat/completions
Anthropic MessagesAnthropicPOST /v1/messages
Cache lookup / store (used by MCP clients)NativePOST /api/v1/cache/lookup / POST /api/v1/cache/store

Authentication

Isartor can enforce a gateway key on authenticated routes when Layer 0 auth is enabled.

Supported headers:

  • X-API-Key: <gateway_api_key>
  • Authorization: Bearer <gateway_api_key> (useful for OpenAI/Anthropic-compatible clients)

By default, gateway_api_key is empty and auth is disabled (local-first). To enable gateway authentication, set ISARTOR__GATEWAY_API_KEY to a secret value. In production, always set a strong key.

Observability headers

All endpoints in the Deflection Stack include:

  • X-Isartor-Layer: l1a | l1b | l2 | l3 | l0
  • X-Isartor-Deflected: true if resolved locally (no cloud call)

Example: OpenAI-compatible request

curl -sS http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "2 + 2?"}
    ]
  }'

If gateway auth is enabled, also add:

-H 'Authorization: Bearer your-secret-key'

Many OpenAI-compatible SDKs and coding agents also call:

curl -sS http://localhost:8080/v1/models

OpenAI-compatible agent features supported by Isartor:

  • GET /v1/models for model discovery
  • stream: true on /v1/chat/completions with OpenAI-style SSE and data: [DONE]
  • tools, tool_choice, functions, and function_call passthrough
  • tool_calls preserved in provider responses
  • tool-aware exact cache keys, with semantic cache skipped for tool-use flows

Example: Anthropic-compatible request

curl -sS http://localhost:8080/v1/messages \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "claude-sonnet-4-6",
    "system": "Be concise.",
    "max_tokens": 100,
    "messages": [
      {
        "role": "user",
        "content": [{"type": "text", "text": "What is the capital of France?"}]
      }
    ]
  }'

If gateway auth is enabled, also add:

-H 'X-API-Key: your-secret-key'

Supported tools at a glance

ToolCommandMechanism
GitHub Copilot CLIisartor connect copilotMCP server (cache-only)
GitHub Copilot in VS Codeisartor connect copilot-vscodeManaged settings.json debug overrides
OpenClawisartor connect openclawManaged OpenClaw provider config (openclaw.json)
OpenCodeisartor connect opencodeGlobal provider + auth config
Claude Code + GitHub Copilotisartor connect claude-copilotClaude base URL override + Copilot-backed L3
Claude Codeisartor connect claudeBase URL override
Claude Desktopisartor connect claude-desktopManaged local MCP registration (isartor mcp)
Cursor IDEisartor connect cursorBase URL override + MCP
OpenAI Codex CLIisartor connect codexBase URL override
Gemini CLIisartor connect geminiBase URL override
Antigravityisartor connect antigravityBase URL override
Generic / other toolsisartor connect genericBase URL override

Add --gateway-api-key <key> to any connect command only if you have explicitly enabled gateway auth.

Connection status

# Check all connected clients
isartor connect status

Global troubleshooting

SymptomCauseFix
"connection refused"Isartor not runningRun isartor up first
Gateway returns 401Auth enabled but key not configuredAdd --gateway-api-key to connect command

For tool-specific troubleshooting, see each integration page above.

GitHub Copilot CLI

Copilot CLI integrates via an MCP (Model Context Protocol) server that Isartor registers as a stdio subprocess. Isartor also exposes the same MCP tools over Streamable HTTP at http://localhost:8080/mcp/ for editors and web agents that prefer HTTP/SSE transport. Both transports expose two tools:

  • isartor_chat — cache lookup only. Returns the cached answer on hit (L1a exact or L1b semantic), or an empty string on miss. On a miss, Copilot uses its own LLM to answer — Isartor never routes through its configured L3 provider for Copilot traffic.
  • isartor_cache_store — stores a prompt/response pair in Isartor's cache so future identical or similar prompts are deflected locally.

This design means Copilot still owns the conversation loop, while Isartor acts as a transparent cache layer that reduces redundant cloud calls. On a cache hit, Isartor returns the cached text and does not call its own Layer 3 provider. Copilot CLI may still emit its normal final-answer event after the tool result, but that is a Copilot-side render step rather than an Isartor L3 forward.

Prerequisites

  • Isartor installed (curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh)
  • GitHub Copilot CLI installed

Step-by-step setup

# 1. Start Isartor
isartor up --detach

# 2. Register the MCP server with Copilot CLI
isartor connect copilot

# 3. Start Copilot normally — plain chat prompts will use Isartor cache first
copilot

How it works

  1. isartor connect copilot adds an isartor entry to ~/.copilot/mcp-config.json
  2. isartor connect copilot also installs a managed instruction block in ~/.copilot/copilot-instructions.md
  3. When Copilot CLI starts, it launches isartor mcp as a stdio subprocess and loads the Isartor instruction block
  4. The MCP server exposes isartor_chat (cache lookup) and isartor_cache_store (cache write)
  5. For plain conversational prompts, Copilot now prefers this flow:
    • Call isartor_chat with the user's prompt
    • Cache hit: return the cached answer immediately, verbatim
    • Cache miss: answer with Copilot's own model, then call isartor_cache_store
  6. When Copilot calls isartor_chat:
    • Cache hit (L1a exact or L1b semantic): returns the cached answer instantly
    • Cache miss: returns empty → Copilot uses its own LLM
  7. After Copilot gets an answer from its LLM, it can call isartor_cache_store to populate the cache for future requests

HTTP/SSE MCP endpoint

Isartor now exposes the same MCP tool surface at /mcp/ using Streamable HTTP:

  • POST /mcp/ — client → server JSON-RPC
  • GET /mcp/ — server → client SSE stream
  • DELETE /mcp/ — explicit session teardown

The HTTP transport uses the MCP Mcp-Session-Id header after initialize, and supports both JSON responses and SSE responses for POST requests. A minimal editor config looks like:

{"servers":{"isartor":{"type":"http","url":"http://localhost:8080/mcp/"}}}

Important note about "still going to L3"

If you inspect Copilot CLI JSON traces, you may still see a normal final_answer event after isartor_chat returns a cache hit. That does not mean Isartor forwarded the prompt to its own Layer 3 provider. The important signal is Isartor's own log and headers:

  • Cache lookup: L1a exact hit or Cache lookup: L1b semantic hit
  • no new Layer 3: Forwarding to LLM via Rig entry for that prompt

In other words:

  • Isartor L3 call = bad for a cache hit
  • Copilot final-answer render after a tool hit = expected CLI behavior

Isartor now installs stricter Copilot instructions that tell Copilot to emit the cached tool result verbatim on cache hits, without paraphrasing or extra tool calls.

Cache endpoints (used by MCP internally)

The MCP server calls these HTTP endpoints on the Isartor gateway:

# Cache lookup — returns cached response or 204 No Content
curl -X POST http://localhost:8080/api/v1/cache/lookup \
  -H "Content-Type: application/json" \
  -d '{"prompt": "capital of France"}'

# Cache store — saves a prompt/response pair
curl -X POST http://localhost:8080/api/v1/cache/store \
  -H "Content-Type: application/json" \
  -d '{"prompt": "capital of France", "response": "The capital of France is Paris."}'

Custom gateway URL

# If Isartor runs on a non-default port
isartor connect copilot --gateway-url http://localhost:18080

Disconnecting

isartor connect copilot --disconnect

This removes the isartor entry from ~/.copilot/mcp-config.json. It also removes the managed Isartor block from ~/.copilot/copilot-instructions.md.

Troubleshooting

SymptomCauseFix
Copilot has no isartor_chat toolMCP server not registeredRun isartor connect copilot
Copilot works but bypasses cacheIsartor instructions not installed or custom instructions disabledRun isartor connect copilot again and do not launch Copilot with --no-custom-instructions
Cache never hits for CopilotResponses not stored after LLM answersAsk Copilot to call isartor_cache_store after answering

GitHub Copilot in VS Code

Route GitHub Copilot's code completions and chat requests in VS Code through Isartor, so repetitive prompts are deflected locally via the L1a/L1b cache layers. This reduces cloud API calls, lowers latency for repeated patterns, and gives you per-tool visibility in isartor stats.

How is this different from Copilot CLI? The Copilot CLI integration uses an MCP server for the terminal-based copilot command. This page covers VS Code — the editor extension that provides inline completions and Copilot Chat.


Prerequisites

  • Isartor installed and running (isartor up --detach)
  • GitHub Copilot VS Code extension installed (requires a Copilot subscription)
  • An LLM provider API key configured in Isartor for Layer 3 fallback (isartor set-key -p openai or similar)

Step 1 — Start Isartor

# Install (if not already)
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

# Configure your LLM provider key (OpenAI, Anthropic, Azure, etc.)
isartor set-key -p openai

# Start the gateway in the background
isartor up --detach

Verify it's running:

curl http://localhost:8080/health
# {"status":"ok", ...}

Step 2 — Configure VS Code

Recommended:

isartor connect copilot-vscode

This command:

  • auto-detects the VS Code settings.json path on macOS, Linux, and Windows
  • backs up the original file to settings.json.isartor-backup
  • writes the three github.copilot.advanced.debug.* overrides
  • refuses to write if Isartor is not reachable

Manual alternative: open your VS Code User Settings (JSON) and add:

{
  "github.copilot.advanced": {
    "debug.overrideProxyUrl": "http://localhost:8080",
    "debug.overrideCAPIUrl": "http://localhost:8080/v1",
    "debug.chatOverrideProxyUrl": "http://localhost:8080/v1/chat/completions"
  }
}
SettingWhat It Does
debug.overrideProxyUrlRoutes Copilot's main API traffic through Isartor
debug.overrideCAPIUrlOverrides the completions API endpoint (inline suggestions)
debug.chatOverrideProxyUrlOverrides the Copilot Chat endpoint

Custom port? If Isartor runs on a different port, replace 8080 with your port everywhere above.

Step 3 — Restart VS Code

Close and reopen VS Code (or run "Developer: Reload Window" from the command palette). Copilot will now route requests through Isartor.

Step 4 — Verify

Open any code file and trigger a Copilot suggestion (start typing a comment or function). Then check Isartor's stats:

isartor stats

You should see requests flowing through Isartor's layers. Repeat the same prompt and you'll see L1a cache hits — Isartor deflected the duplicate without a cloud call.

For per-tool breakdown:

isartor stats --by-tool

Copilot VS Code traffic appears as copilot in the tool column (identified from the User-Agent header). The table now includes requests, cache hits/misses, average latency, retries, errors, and L1a/L1b safety.


How It Works

VS Code Copilot Extension
        │
        ▼ (HTTP request to overrideProxyUrl)
   ┌─────────────┐
   │   Isartor    │
   │  Gateway     │
   │              │
   │  L1a ──► L1b ──► L3 (Cloud)
   │  hit?    hit?    forward
   └─────────────┘
        │
        ▼
   Response back to VS Code
  1. Copilot sends completion/chat requests to Isartor instead of GitHub's servers
  2. L1a Exact Cache — sub-millisecond hit for identical prompts (< 1 ms)
  3. L1b Semantic Cache — catches variations of the same prompt (1–5 ms)
  4. L3 Cloud — only genuinely new prompts reach your configured LLM provider
  5. Response flows back to Copilot transparently — no change to the editor UX

Disconnecting

isartor connect copilot-vscode --disconnect

If a backup exists, Isartor restores it. Otherwise it removes only the three managed github.copilot.advanced.debug.* keys.

Benefits

BenefitHow
Reduced API costsRepetitive completions are served from cache
Lower latencyCache hits return in < 5 ms vs hundreds of ms for cloud
Visibilityisartor stats --by-tool shows Copilot request counts, cache hit/miss safety, latency, retries, and errors
PrivacyCached prompts never leave your machine on repeat requests
Model flexibilityRoute L3 to any provider (OpenAI, Anthropic, Azure, local Ollama)

Advanced Configuration

Use a specific LLM provider for Layer 3

Isartor routes surviving (non-cached) prompts to your configured L3 provider. You can use any supported provider:

# OpenAI (default)
isartor set-key -p openai

# Anthropic
isartor set-key -p anthropic

# Azure OpenAI
export ISARTOR__LLM_PROVIDER=azure
export ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com
export ISARTOR__AZURE_DEPLOYMENT_ID=<deployment>
isartor set-key -p azure

Adjust cache sensitivity

Tune the semantic cache threshold to control how similar a prompt must be to trigger an L1b hit:

# Default: 0.92 (higher = stricter matching)
export ISARTOR__SIMILARITY_THRESHOLD=0.90

See the Configuration Reference for all available options.

Enable monitoring

export ISARTOR__ENABLE_MONITORING=true
export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317

See Metrics & Tracing for Grafana dashboards and OTel setup.


Known Limitations

  1. Copilot Chat override — The debug.chatOverrideProxyUrl setting may not be fully respected by all versions of the Copilot Chat extension (tracking issue). Inline code completions (debug.overrideCAPIUrl) work reliably. If chat requests bypass Isartor, try using the global VS Code proxy setting as a workaround:

    {
      "http.proxy": "http://localhost:8080"
    }
    

    Note: This routes all VS Code HTTP traffic through Isartor, not just Copilot. Use a PAC script if you need finer control.

  2. Authentication — These debug.* settings bypass Copilot's normal GitHub authentication. Isartor handles the LLM provider auth via its own API key configuration. Your Copilot subscription is still required for the extension to activate.

  3. Extension updates — VS Code may update the Copilot extension automatically. If the proxy stops working after an update, verify the settings are still present in settings.json and restart VS Code.


Troubleshooting

SymptomCauseFix
Copilot suggestions stop workingIsartor not runningRun isartor up --detach and verify with curl http://localhost:8080/health
isartor connect copilot-vscode cannot find VS Code settingsNon-standard editor config pathPass through manual JSON editing as a fallback
No requests in isartor statsSettings not appliedVerify settings.json has the override block, then reload VS Code
Chat works but completions don'tWrong endpoint URLEnsure debug.overrideCAPIUrl ends with /v1
Completions work but chat doesn'tKnown chat override limitationAdd debug.chatOverrideProxyUrl or use http.proxy as workaround
Auth errors from CopilotMissing L3 provider keyRun isartor set-key -p openai (or your provider)
High latency on first requestModel loadingFirst request downloads the embedding model (~25 MB); subsequent requests are fast

Reverting

To stop routing Copilot through Isartor, remove the github.copilot.advanced block from your settings.json and reload VS Code:

// Remove this entire block:
"github.copilot.advanced": {
    "debug.overrideProxyUrl": "http://localhost:8080",
    "debug.overrideCAPIUrl": "http://localhost:8080/v1",
    "debug.chatOverrideProxyUrl": "http://localhost:8080/v1/chat/completions"
}

OpenClaw

OpenClaw is a self-hosted AI assistant that can connect chat apps and agent workflows to LLM providers. The pragmatic Isartor setup is to register Isartor as a custom OpenAI-compatible OpenClaw provider and let OpenClaw use that provider as its primary model path.

This is similar in spirit to the LiteLLM integration docs, but with one important difference:

  • LiteLLM is a multi-model gateway and catalog
  • Isartor is a prompt firewall / gateway that currently exposes the upstream model you configured in Isartor itself

So the best OpenClaw UX is: configure the model in Isartor first, then let isartor connect openclaw mirror that model into OpenClaw's provider config.

Pragmatic setup

# 1. Configure Isartor's upstream provider/model
isartor set-key -p groq
isartor check

# 2. Start Isartor
isartor up --detach

# 3. Make sure OpenClaw is onboarded
openclaw onboard --install-daemon

# 4. Register Isartor as an OpenClaw provider
isartor connect openclaw

# 5. Verify OpenClaw sees the provider/model and auth
openclaw models status --agent main --probe

# 6. Smoke test a prompt
openclaw agent --agent main -m "Hello from OpenClaw through Isartor"

What isartor connect openclaw does

It writes or updates your OpenClaw config (default: ~/.openclaw/openclaw.json) with:

  1. models.providers.isartor
  2. a single managed model entry matching Isartor's current upstream model
  3. agents.defaults.model.primary = "isartor/<your-model>"
  4. the main / default agent model override when one is present
  5. a refresh of stale per-agent models.json registries so OpenClaw regenerates them with the latest baseUrl and apiKey

Example generated provider block:

models: {
  providers: {
    isartor: {
      baseUrl: "http://localhost:8080/v1",
      apiKey: "isartor-local",
      api: "openai-completions",
      models: [
        {
          id: "openai/gpt-oss-120b",
          name: "Isartor (openai/gpt-oss-120b)"
        }
      ]
    }
  }
}

And the default model becomes:

agents: {
  defaults: {
    model: {
      primary: "isartor/openai/gpt-oss-120b"
    }
  }
}

Base URL and auth path

OpenClaw must talk to Isartor's OpenAI-compatible /v1 surface.

  • Correct base URL: http://localhost:8080/v1
  • Wrong base URL: http://localhost:8080

Why this matters:

  • OpenClaw appends /chat/completions for OpenAI-compatible custom providers
  • Isartor exposes that route as /v1/chat/completions
  • using the root gateway URL can produce 404 errors such as gateway unknown L0 via chat/completions

isartor connect openclaw writes the /v1 path for you, so prefer the connector over hand-editing the provider block.

Reconnecting after changing the gateway API key

OpenClaw stores custom-provider state in two places:

  1. ~/.openclaw/openclaw.json
  2. per-agent models.json registries under ~/.openclaw/agents/<agentId>/agent/

Those per-agent registries can keep an old apiKey or baseUrl even after openclaw.json changes. That is why you can still see 401 after fixing the key in the top-level config.

The supported fix is simply:

isartor connect openclaw --gateway-api-key <your-key>
openclaw models status --agent main --probe
openclaw agent --agent main -m "Hello from OpenClaw through Isartor"

The connector now refreshes openclaw.json, updates the main / default agent model override, and removes stale per-agent models.json files so OpenClaw regenerates them with the new auth.

Why this is the best fit

The upstream LiteLLM/OpenClaw docs assume the gateway can expose a multi-model catalog and route among many providers behind one endpoint.

Isartor is different today:

  • OpenClaw talks to Isartor over the OpenAI-compatible /v1/chat/completions surface
  • Isartor forwards using its configured upstream provider/model
  • OpenClaw model refs should therefore mirror the model currently configured in Isartor

That means:

  • if you change Isartor's provider/model later, rerun isartor connect openclaw
  • if you change Isartor's gateway API key later, rerun isartor connect openclaw --gateway-api-key ...
  • do not expect isartor/openai/... and isartor/anthropic/... fallbacks to behave like LiteLLM provider switching unless Isartor itself grows multi-provider routing later

Options

FlagDefaultDescription
--modelIsartor's configured upstream modelOverride the single model ID exposed to OpenClaw
--config-pathauto-detectedPath to openclaw.json
--gateway-api-key(none)Gateway key if auth is enabled

Files written

  • ~/.openclaw/openclaw.json — managed OpenClaw provider config
  • ~/.openclaw/agents/<agentId>/agent/models.json — regenerated by OpenClaw after Isartor clears stale custom-provider caches
  • openclaw.json.isartor-backup — backup, when a prior config existed

Disconnecting

isartor connect openclaw --disconnect

If a backup exists, Isartor restores it. Otherwise it removes only the managed models.providers.isartor entry and related isartor/... default-model references.

For day-to-day use:

  1. Pick your upstream provider with isartor set-key
  2. Validate with isartor check
  3. Keep Isartor running with isartor up --detach
  4. Let OpenClaw use isartor/<configured-model> as its primary model
  5. Use openclaw models status --agent main --probe whenever you want to confirm what OpenClaw currently sees

If you later switch Isartor from, for example, Groq to OpenAI or Azure:

isartor set-key -p openai
isartor check
isartor connect openclaw

That refreshes OpenClaw's provider model to match the new Isartor config.

What Isartor does for OpenClaw

BenefitHow
Cache repeated agent promptsOpenClaw often repeats the same context and system framing. L1a exact cache resolves those instantly.
Catch paraphrasesL1b semantic cache resolves similar follow-ups locally when safe.
Compress repeated instructionsL2.5 trims repeated context before cloud fallback.
Keep one stable gateway URLOpenClaw only needs isartor/<model> while Isartor owns the upstream provider configuration.
Observabilityisartor stats --by-tool lets you track OpenClaw cache hits, latency, and savings.

Troubleshooting

SymptomCauseFix
OpenClaw cannot reach the providerIsartor not runningRun isartor up --detach first
OpenClaw onboarding/custom provider returns 404Base URL points at http://localhost:8080 instead of http://localhost:8080/v1Use isartor connect openclaw or update the custom provider base URL to end with /v1
OpenClaw still shows the old modelIsartor model changed after initial connectRe-run isartor connect openclaw
Auth errors (401) after reconnectingOpenClaw is still using stale per-agent provider stateRe-run isartor connect openclaw --gateway-api-key <key> so Isartor refreshes openclaw.json and clears stale per-agent models.json registries
"Model is not allowed"OpenClaw allowlist still excludes the managed modelRe-run isartor connect openclaw so the managed model is re-added to the allowlist

OpenCode

OpenCode integrates via a global provider config and auth store. Isartor registers an isartor provider backed by @ai-sdk/openai-compatible and points it at the gateway's /v1 endpoint.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure OpenCode
isartor connect opencode

# 3. Start OpenCode
opencode

How it works

  1. isartor connect opencode backs up ~/.config/opencode/opencode.json
  2. It writes an isartor provider definition to that config file
  3. It writes a matching auth entry to ~/.local/share/opencode/auth.json
  4. The provider uses @ai-sdk/openai-compatible with baseURL set to http://localhost:8080/v1
  5. If gateway auth is disabled, Isartor writes a dummy local auth key so OpenCode still has a credential to send

Files written

  • ~/.config/opencode/opencode.json
  • ~/.local/share/opencode/auth.json

Backups:

  • ~/.config/opencode/opencode.json.isartor-backup
  • ~/.local/share/opencode/auth.json.isartor-backup

Disconnecting

isartor connect opencode --disconnect

Disconnect restores the original files from backup when available. If no backup exists, it removes only the managed isartor entries.

Troubleshooting

SymptomCauseFix
OpenCode cannot see the Isartor providerConfig file not writtenRun isartor connect opencode again
OpenCode shows auth errorsGateway auth mismatchRe-run with --gateway-api-key or update ISARTOR__GATEWAY_API_KEY
OpenCode cannot list models/v1/models unreachableVerify curl http://localhost:8080/v1/models

Claude Code + GitHub Copilot

Use Claude Code's editor and CLI workflow while routing Layer 3 through your existing GitHub Copilot subscription via Isartor. Repeated prompts are still deflected by Isartor's L1a/L1b cache first, ad L2 SLM (if turned on) so cache hits consume zero Copilot quota.

Current status: experimental. The connector and Copilot-backed L3 routing are implemented, but Isartor's Anthropic compatibility surface is still text-oriented today. That means plain Claude Code prompting works best right now; more advanced Anthropic tool-use blocks may still require follow-up work.

Prerequisites

  1. Active GitHub Copilot subscription
  2. Isartor installed
  3. Claude Code installed
# Install Isartor
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

# Install Claude Code
npm install -g @anthropic-ai/claude-code

Setup

isartor connect claude-copilot

This starts GitHub device-flow authentication, stores the OAuth token locally, updates ./isartor.toml, and writes Claude Code settings into ~/.claude/settings.json.

When no --github-token is provided, Isartor now prefers browser/device-flow OAuth first. It will reuse a previously saved OAuth credential, but it will not silently reuse legacy saved PATs.

Path B — Use an existing GitHub token

isartor connect claude-copilot --github-token ghp_YOUR_TOKEN

Use --github-token only when you intentionally want to override the default browser login flow with a PAT.

Path C — Choose custom Copilot models

isartor connect claude-copilot \
  --github-token ghp_YOUR_TOKEN \
  --model gpt-4.1 \
  --fast-model gpt-4o-mini

After the command finishes, restart Isartor so the new Layer 3 config is loaded:

isartor stop
isartor up --detach
claude

One-click smoke test

./scripts/claude-copilot-smoke-test.sh
# or
make smoke-claude-copilot

The script automatically:

  • reads the saved Copilot credential from ~/.isartor/providers/copilot.json
  • picks a supported Copilot-backed model
  • starts a temporary Isartor instance
  • runs a Claude Code smoke prompt
  • prints an ROI demo showing L3, L1a exact-hit, and L1b semantic-hit behavior

What the command changes

~/.claude/settings.json

The command writes these Claude Code environment overrides:

SettingValuePurpose
ANTHROPIC_BASE_URLhttp://localhost:8080 (or your gateway URL)Routes Claude Code to Isartor
ANTHROPIC_AUTH_TOKENdummy or your gateway keySatisfies Claude Code auth requirements
ANTHROPIC_MODELselected modelPrimary Copilot-backed model
ANTHROPIC_DEFAULT_SONNET_MODELselected modelDefault Claude Code Sonnet mapping
ANTHROPIC_DEFAULT_HAIKU_MODELfast modelLightweight/background tasks
DISABLE_NON_ESSENTIAL_MODEL_CALLS1Reduce unnecessary quota burn
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC1Compatibility flag across Claude Code versions
ENABLE_TOOL_SEARCHtruePreserve Claude Code tool search behavior
CLAUDE_CODE_MAX_OUTPUT_TOKENS16000Stay under Copilot's output cap

./isartor.toml

The command also sets Isartor Layer 3 to use the Copilot provider:

llm_provider = "copilot"
external_llm_model = "claude-sonnet-4.5"
external_llm_api_key = "ghp_..."
external_llm_url = "https://api.githubcopilot.com/chat/completions"

Available Copilot-backed models

ModelTypeNotes
claude-sonnet-4.5BalancedGood default for Claude-style behavior
claude-haiku-4.5FastLower-latency Claude-family option
gpt-4oStrong general modelGood for broad coding tasks
gpt-4o-miniFast + cheapGood default fast/background model
gpt-4.1IncludedSafe fallback choice
o3-miniReasoningHigher-latency reasoning model

What Isartor saves

Without Isartor:

Every Claude Code prompt -> GitHub Copilot API -> quota consumed

With Isartor:

Repeated prompt (L1a hit) -> served locally -> 0 Copilot quota
Similar prompt (L1b hit)  -> served locally -> 0 Copilot quota
Novel prompt (cache miss) -> forwarded to Copilot -> quota consumed

Example session:

100 Claude Code prompts
  40 exact repeats      -> L1a -> 0 quota
  25 semantic variants  -> L1b -> 0 quota
  35 novel prompts      -> L3  -> 35 Copilot-backed requests

Result: 35 routed requests instead of 100

Limitations

  • GitHub Copilot output is capped; Isartor writes CLAUDE_CODE_MAX_OUTPUT_TOKENS=16000
  • The current /v1/messages compatibility path is still text-oriented, so some advanced Anthropic tool-use flows may not yet behave exactly like direct Anthropic routing
  • Extended-thinking / provider-specific Anthropic features are not preserved
  • If the chosen Copilot model is unavailable to your account, requests fail instead of silently falling back to Anthropic

Disconnect

isartor connect claude-copilot --disconnect

This restores the backed-up ~/.claude/settings.json and ./isartor.toml.

Troubleshooting

ErrorCauseFix
Authentication failedBrowser login incomplete, token invalid, or expiredRe-run isartor connect claude-copilot and finish GitHub sign-in
No active GitHub Copilot subscriptionSigned-in GitHub user has no active Copilot seat / entitlementCheck https://github.com/features/copilot and enterprise seat assignment
Model not foundAccount cannot access the requested modelRetry with --model gpt-4.1
Claude Code still uses AnthropicIsartor not restarted after config changeRun isartor stop && isartor up --detach
401 from IsartorGateway auth enabled but Claude settings use dummy tokenRe-run with the gateway key available in local config
Tool call failedCurrent Anthropic compatibility is still text-firstUse simpler prompting for now; full tool-use compatibility is follow-up work

Claude Code

Claude Code integrates via ANTHROPIC_BASE_URL, pointing all API traffic at Isartor's /v1/messages endpoint.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Claude Code
isartor connect claude

# 3. Claude Code now routes through Isartor automatically

How it works

  1. isartor connect claude sets ANTHROPIC_BASE_URL in ~/.claude/settings.json
  2. Claude Code sends requests to Isartor's /v1/messages endpoint
  3. Isartor forwards to the Anthropic API as Layer 3 when the request is not deflected

Disconnecting

isartor connect claude --disconnect

Troubleshooting

SymptomCauseFix
Claude not routing through Isartorsettings.json not updatedRun isartor connect claude

Claude Desktop

Claude Desktop integrates with Isartor via a local MCP server. The recommended setup is isartor connect claude-desktop, which registers isartor mcp in Claude Desktop's config so Claude can use Isartor's cache-aware tools.

Step-by-step setup

# 1. Start Isartor
isartor up --detach

# 2. Register Isartor in Claude Desktop
isartor connect claude-desktop

# 3. Restart Claude Desktop

After restart, open Claude Desktop's tools/connectors UI and confirm the isartor MCP server is present.

What the connector writes

isartor connect claude-desktop updates Claude Desktop's local MCP config and keeps a backup next to it.

Typical config paths:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux (best-effort path): ~/.config/Claude/claude_desktop_config.json

The generated MCP entry looks like:

{
  "mcpServers": {
    "isartor": {
      "command": "/path/to/isartor",
      "args": ["mcp"],
      "env": {
        "ISARTOR_GATEWAY_URL": "http://localhost:8080"
      }
    }
  }
}

If gateway auth is enabled, the connector also writes ISARTOR__GATEWAY_API_KEY into the managed server env block.

What Claude Desktop gets

The Isartor MCP server exposes these tools:

  • isartor_chat — cache-first lookup through Isartor's L1a/L1b layers
  • isartor_cache_store — store prompt/response pairs back into Isartor after a cache miss

This gives Claude Desktop a low-risk integration path that fits the current MCP model without relying on Anthropic base-URL overrides.

Advanced / manual setup

If you prefer to edit the config yourself, add a local MCP server entry that runs:

isartor mcp

Isartor also exposes MCP over HTTP/SSE at:

http://localhost:8080/mcp/

That remote MCP surface is useful for clients that support HTTP/SSE registration directly, but isartor connect claude-desktop currently uses the local stdio flow because it is the most reliable Claude Desktop path today.

Disconnecting

isartor connect claude-desktop --disconnect

This restores the backup when one exists; otherwise it removes only the managed mcpServers.isartor entry.

Troubleshooting

SymptomCauseFix
Claude Desktop shows no isartor toolsClaude Desktop was not restartedQuit and relaunch Claude Desktop after isartor connect claude-desktop
Tools appear but calls failIsartor is not runningStart the gateway with isartor up --detach
MCP server is present but unauthorizedGateway auth enabledRe-run isartor connect claude-desktop --gateway-api-key <key>
You want the original config backManaged config needs rollbackRun isartor connect claude-desktop --disconnect

Note on desktop extensions

Claude Desktop now supports desktop extensions, but Isartor's first-class integration in this repo uses the simpler local MCP server flow today. That keeps setup light and works with the existing isartor mcp implementation immediately.

Cursor IDE

Cursor IDE integrates via the OpenAI Base URL override in Cursor's model settings, and optionally via MCP server registration for tool-based integration.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Cursor
isartor connect cursor

# 3. Open Cursor → Settings → Cursor Settings → Models
# 4. Enable "Override OpenAI Base URL" and enter: http://localhost:8080/v1
# 5. Paste the API key shown in the connect output
# 6. Add a custom model name (e.g. gpt-4o) and enable it
# 7. Use Ask or Plan mode (Agent mode doesn't support custom keys yet)

How it works

  1. isartor connect cursor writes a reference env file to ~/.isartor/env/cursor.sh
  2. It also registers Isartor as an MCP server in ~/.cursor/mcp.json
  3. In Cursor, override the OpenAI Base URL to point at Isartor's /v1 endpoint
  4. Cursor can use Isartor's GET /v1/models endpoint to discover the configured model
  5. All chat completions requests route through Isartor's L1/L2/L3 deflection stack
  6. Isartor supports OpenAI streaming SSE, tool-call passthrough, and HTTP/SSE MCP at http://localhost:8080/mcp/ for compatible Cursor workflows
  7. Cursor's Ask and Plan modes are supported; Agent mode requires native keys

Cursor's generated MCP config points at:

{"mcpServers":{"isartor":{"type":"http","url":"http://localhost:8080/mcp/"}}}

Disconnecting

isartor connect cursor --disconnect

Troubleshooting

SymptomCauseFix
Cursor not routing through IsartorBase URL override not setOpen Cursor Settings → Models → enable Override OpenAI Base URL
Cursor model picker is emptyCursor cannot reach model discoveryVerify http://localhost:8080/v1/models is reachable from Cursor

OpenAI Codex CLI

OpenAI Codex CLI integrates via OPENAI_BASE_URL, routing requests through Isartor's OpenAI-compatible /v1 surface, including /v1/chat/completions and /v1/models.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Codex
isartor connect codex

# 3. Source the env file
source ~/.isartor/env/codex.sh

# 4. Run Codex
codex --model o3-mini

How it works

  1. isartor connect codex writes OPENAI_BASE_URL and OPENAI_API_KEY to ~/.isartor/env/codex.sh
  2. Codex can query /v1/models to discover the configured model
  3. Codex sends chat requests to Isartor's /v1/chat/completions endpoint
  4. Isartor supports OpenAI streaming SSE and tool-call passthrough for compatible agent workflows
  5. Isartor forwards to the configured upstream as Layer 3 when not deflected
  6. Use --model to select any model name configured in your L3 provider

Disconnecting

isartor connect codex --disconnect

Troubleshooting

SymptomCauseFix
Codex not routing through IsartorEnv vars not loadedRun source ~/.isartor/env/codex.sh in your shell
Codex cannot list models/v1/models unreachable or auth mismatchTest curl http://localhost:8080/v1/models with the same auth settings

Gemini CLI

Gemini CLI integrates via GEMINI_API_BASE_URL, routing requests through Isartor's gateway.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure Gemini CLI
isartor connect gemini

# 3. Source the env file
source ~/.isartor/env/gemini.sh

# 4. Run Gemini CLI
gemini

How it works

  1. isartor connect gemini writes GEMINI_API_BASE_URL and GEMINI_API_KEY to ~/.isartor/env/gemini.sh
  2. Gemini CLI sends requests to Isartor's gateway
  3. Isartor forwards to the configured upstream as Layer 3 when not deflected

Disconnecting

isartor connect gemini --disconnect

Troubleshooting

SymptomCauseFix
Gemini not routing through IsartorEnv vars not loadedRun source ~/.isartor/env/gemini.sh in your shell

Antigravity

Antigravity integrates via an OpenAI-compatible base URL override. Isartor generates a shell env file that sets OPENAI_BASE_URL and OPENAI_API_KEY to route all LLM calls through the Deflection Stack.

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Generate the env file
isartor connect antigravity

# 3. Activate the environment
source ~/.isartor/env/antigravity.sh

# 4. Start Antigravity
# (it will now use Isartor as its OpenAI endpoint)

How it works

  1. isartor connect antigravity creates ~/.isartor/env/antigravity.sh
  2. The file exports OPENAI_BASE_URL pointing at http://localhost:8080/v1
  3. It exports OPENAI_API_KEY with your gateway key (or a local placeholder)
  4. When sourced, Antigravity sends all OpenAI-compatible calls through Isartor

Files written

  • ~/.isartor/env/antigravity.sh

Disconnecting

isartor connect antigravity --disconnect

Then restart your shell to clear the exported variables.

Troubleshooting

SymptomCauseFix
Connection refusedIsartor not runningRun isartor up first
Auth errors (401)Gateway auth enabledRe-run with --gateway-api-key
Env not appliedShell not sourcedRun source ~/.isartor/env/antigravity.sh

Generic Connector

For tools not explicitly supported, use the generic connector to generate an env script that sets the tool's base URL environment variable to point at Isartor.

Compatible tools

The generic connector works with any OpenAI-compatible tool, including:

  • Windsurf
  • Zed
  • Cline
  • Roo Code
  • Aider
  • Continue
  • Antigravity (also available via isartor connect antigravity)
  • OpenClaw (also available via isartor connect openclaw)
  • Any other tool that reads an OPENAI_BASE_URL or similar environment variable

OpenAI-compatible features exposed by Isartor include:

  • GET /v1/models for model discovery
  • POST /v1/chat/completions
  • stream: true SSE responses
  • tool/function calling passthrough (tools, tool_choice, functions, tool_calls)

Step-by-step setup

# 1. Start Isartor
isartor up

# 2. Configure the tool (example: Windsurf)
isartor connect generic \
  --tool-name Windsurf \
  --base-url-var OPENAI_BASE_URL \
  --api-key-var OPENAI_API_KEY

# 3. Source the env file
source ~/.isartor/env/windsurf.sh

# 4. Start the tool

Arguments

FlagRequiredDescription
--tool-nameyesDisplay name (also used for env script filename)
--base-url-varyesEnv var the tool reads for its API base URL
--api-key-varnoEnv var the tool reads for its API key
--no-append-v1noDon't append /v1 to the gateway URL

Disconnecting

isartor connect generic \
  --tool-name Windsurf \
  --base-url-var OPENAI_BASE_URL \
  --disconnect

Troubleshooting

SymptomCauseFix
Tool not routing through IsartorEnv vars not loadedRun source ~/.isartor/env/<tool>.sh in your shell
Tool says no models are availableIt expects OpenAI model discoveryVerify it can reach http://localhost:8080/v1/models

Level 1 — Minimal Deployment

Single static binary, embedded candle inference + in-process candle sentence embeddings, zero C/C++ dependencies.

This guide covers deploying Isartor as a standalone process — no sidecars, no Docker Compose, no orchestrator. The firewall binary embeds a Gemma-2-2B-IT GGUF model via candle for Layer 2 classification and uses candle's BertModel (sentence-transformers/all-MiniLM-L6-v2) for Layer 1 semantic cache embeddings — all entirely in-process, pure Rust.


When to Use Level 1

✅ Good Fit❌ Consider Level 2/3 Instead
€5–€20/month VPS (Hetzner, DigitalOcean, Linode)GPU inference for generation quality
ARM edge devices (Raspberry Pi 5, Jetson Nano)More than ~50 concurrent users
Air-gapped / offline environmentsProduction observability stack required
Development & local experimentationMulti-node high-availability
CI/CD test runners

Prerequisites

RequirementMinimumRecommended
RAM2 GB free4 GB free
Disk2 GB (model download)5 GB
CPU2 cores4+ cores (AVX2 recommended)
Rust (build from source)1.75+Latest stable
OSLinux (x86_64 / aarch64), macOSUbuntu 22.04 LTS

Memory budget: Gemma-2-2B Q4_K_M ≈ 1.5 GB, candle BertModel ≈ 90 MB, tokenizer ≈ 4 MB, firewall runtime ≈ 50 MB. Total: ~1.7 GB resident.


The fastest way to get started is to leverage the pre-built, cross-platform binaries generated by the CI/CD pipeline.

Install via script:

curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex

This script detects your target OS and processor architecture, downloads the correct release binary, and adds it to your path automatically.


Option B: Build from Source

1. Clone & Build

git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release

The release binary is at ./target/release/isartor (~5 MB statically linked).

2. Configure Environment

Create a minimal .env file or export variables directly:

# Required — your cloud LLM key for Layer 3 fallback
export ISARTOR__EXTERNAL_LLM_API_KEY="sk-..."

# Optional — override defaults
export ISARTOR__GATEWAY_API_KEY="my-secret-key"
export ISARTOR__HOST_PORT="0.0.0.0:8080"
export ISARTOR__LLM_PROVIDER="openai"          # openai | azure | anthropic | xai
export ISARTOR__EXTERNAL_LLM_MODEL="gpt-4o-mini"

# Cache mode — "both" enables exact + semantic cache. Semantic embeddings
# are generated in-process via candle BertModel — no sidecar needed.
export ISARTOR__CACHE_MODE="both"

# Pluggable backends — Level 1 uses the defaults (no change needed):
#   ISARTOR__CACHE_BACKEND=memory     — in-process LRU (ahash + parking_lot)
#   ISARTOR__ROUTER_BACKEND=embedded  — in-process Candle GGUF SLM
# These are ideal for a single-process deployment with zero dependencies.

3. Start the Firewall

./target/release/isartor up

On first start, the embedded classifier will auto-download the Gemma-2-2B-IT GGUF model from Hugging Face Hub (~1.5 GB). Subsequent starts load from the local cache (~/.cache/huggingface/).

INFO  isartor > Listening on 0.0.0.0:8080
INFO  isartor::layer1::embeddings > Initialising candle TextEmbedder (all-MiniLM-L6-v2)...
INFO  isartor::layer1::embeddings > TextEmbedder ready (~90 MB BertModel loaded)
INFO  isartor::services::local_inference > Downloading model from mradermacher/gemma-2-2b-it-GGUF...
INFO  isartor::services::local_inference > Model loaded (1.5 GB), ready for inference

4. Verify

# Health check
curl http://localhost:8080/health

# Test the firewall
curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -H "X-API-Key: my-secret-key" \
  -d '{"prompt": "Hello, how are you?"}' | jq .

Option B: Docker (Single Container)

For environments where you prefer a container but don't need a full Compose stack.

Build the Image

cd isartor
docker build -t isartor:latest -f docker/Dockerfile .

Run

docker run -d \
  --name isartor \
  -p 8080:8080 \
  -e ISARTOR__GATEWAY_API_KEY="my-secret-key" \
  -e ISARTOR__EXTERNAL_LLM_API_KEY="sk-..." \
  -e ISARTOR__CACHE_MODE="both" \
  -e HF_HOME=/tmp/huggingface \
  -v isartor-models:/tmp/huggingface \
  isartor:latest

Note: The -v flag mounts a named volume for the Hugging Face cache so the model downloads persist across container restarts.

The official Docker image runs as non-root and uses HF_HOME=/tmp/huggingface to ensure the cache is writable.


Option C: systemd Service (Production Linux)

For long-running production deployments on bare metal or VPS.

1. Install the Binary

# Build
cargo build --release

# Install to /usr/local/bin
sudo cp target/release/isartor /usr/local/bin/isartor
sudo chmod +x /usr/local/bin/isartor

2. Create a System User

sudo useradd --system --no-create-home --shell /usr/sbin/nologin isartor

3. Create Environment File

sudo mkdir -p /etc/isartor
sudo tee /etc/isartor/env <<'EOF'
ISARTOR__HOST_PORT=0.0.0.0:8080
ISARTOR__GATEWAY_API_KEY=your-production-key
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...
ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__CACHE_MODE=both
ISARTOR__CACHE_BACKEND=memory
ISARTOR__ROUTER_BACKEND=embedded
RUST_LOG=isartor=info
EOF
sudo chmod 600 /etc/isartor/env

4. Create systemd Unit

sudo tee /etc/systemd/system/isartor.service <<'EOF'
[Unit]
Description=Isartor Prompt Firewall
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=isartor
Group=isartor
EnvironmentFile=/etc/isartor/env
ExecStart=/usr/local/bin/isartor
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/cache/isartor

[Install]
WantedBy=multi-user.target
EOF

5. Create Model Cache Directory

sudo mkdir -p /var/cache/isartor
sudo chown isartor:isartor /var/cache/isartor

6. Enable & Start

sudo systemctl daemon-reload
sudo systemctl enable isartor
sudo systemctl start isartor

# Check status
sudo systemctl status isartor
sudo journalctl -u isartor -f

Model Pre-Caching (Air-Gapped / Offline)

If the deployment target has no internet access, pre-download the model on a connected machine and copy it over.

On the Connected Machine

# Install huggingface-cli
pip install huggingface-hub

# Download the GGUF file
huggingface-cli download mradermacher/gemma-2-2b-it-GGUF \
  gemma-2-2b-it.Q4_K_M.gguf \
  --local-dir ./models

# Also grab the tokenizer (from the base model)
huggingface-cli download google/gemma-2-2b-it \
  tokenizer.json \
  --local-dir ./models

Transfer to Target

scp -r ./models/ user@target-host:/var/cache/isartor/

By default, hf-hub uses ~/.cache/huggingface/. In the official Docker image, Isartor sets HF_HOME=/tmp/huggingface (non-root safe). Set HF_HOME or ISARTOR_HF_CACHE_DIR to point to your pre-cached directory if needed.


Level 1 Configuration Reference

These are the most relevant ISARTOR__* variables for Level 1 deployments. For the full reference, see the Configuration Reference.

VariableDefaultLevel 1 Notes
ISARTOR__HOST_PORT0.0.0.0:8080Bind address
ISARTOR__GATEWAY_API_KEY""Set to enable gateway auth
ISARTOR__CACHE_MODEbothboth recommended — candle BertModel provides in-process semantic embeddings
ISARTOR__CACHE_BACKENDmemoryIn-process LRU — ideal for single-process Level 1
ISARTOR__ROUTER_BACKENDembeddedIn-process Candle GGUF SLM — zero external dependencies
ISARTOR__CACHE_TTL_SECS300Cache TTL in seconds
ISARTOR__CACHE_MAX_CAPACITY10000Max entries per cache
ISARTOR__LLM_PROVIDERopenaiopenai · azure · anthropic · xai
ISARTOR__EXTERNAL_LLM_API_KEY(empty)Required for Layer 3 fallback
ISARTOR__EXTERNAL_LLM_MODELgpt-4o-miniCloud LLM model name
ISARTOR__ENABLE_MONITORINGfalseEnable for stdout OTel (no collector needed)

Embedded Classifier Defaults (Compiled)

SettingDefault ValueDescription
repo_idmradermacher/gemma-2-2b-it-GGUFHF repo for the GGUF model
gguf_filenamegemma-2-2b-it.Q4_K_M.ggufModel file (~1.5 GB)
max_classify_tokens20Token limit for classification
max_generate_tokens256Token limit for simple task execution
temperature0.0Greedy decoding for classification
repetition_penalty1.1Avoids degenerate loops

Performance Expectations

MetricTypical Value (4-core x86_64)
Cold start (model download)30–120 s (depends on bandwidth; ~1.5 GB Gemma + ~90 MB candle BertModel)
Warm start (cached model)3–8 s
Classification latency50–200 ms
Simple task execution200–2000 ms
Firewall overhead (no inference)< 1 ms
Memory (steady state)~1.6 GB
Binary size~5 MB

Upgrading to Level 2

When your traffic outgrows Level 1, the migration path is straightforward:

  1. Add the generation sidecarISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081 (replaces embedded candle with the more powerful Phi-3-mini on GPU).
  2. Optionally add an embedding sidecarISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082 (only needed for external embedding inference; the default L1b semantic cache already uses in-process candle BertModel).
  3. Deploy via Docker Compose — See Level 2 — Sidecar Deployment.

Note: The pluggable backend defaults (cache_backend=memory, router_backend=embedded) remain appropriate for Level 2 single-host deployments. You only need to switch to cache_backend=redis and router_backend=vllm at Level 3 when scaling horizontally.

No code changes required — only environment variables and infrastructure.

Level 2 — Sidecar Deployment

Split architecture: Isartor firewall + llama.cpp generation sidecar on a single host.

This guide covers deploying Isartor with a dedicated AI sidecar for generation. The firewall delegates Layer 2 inference to a lightweight llama.cpp container via HTTP, while Layer 1 semantic cache embeddings run in-process via candle BertModel (no embedding sidecar required). The overall stack runs on a single machine via Docker Compose.


When to Use Level 2

✅ Good Fit❌ Consider Level 1 or Level 3
Single host with GPU (NVIDIA, AMD)No GPU available → Level 1 embedded candle
Want GPU-accelerated Layer 2 generationMulti-node scaling → Level 3 Kubernetes
Want full observability stack (Jaeger, Grafana)Budget VPS (< 4 GB RAM) → Level 1
Development with production-like topologyAuto-scaling inference pools → Level 3
10–100 concurrent users> 100 concurrent users → Level 3

Prerequisites

RequirementMinimumRecommended
RAM8 GB16 GB
Disk10 GB20 GB (model cache)
CPU4 cores8+ cores
GPU (optional)NVIDIA with 4 GB VRAMNVIDIA with 8+ GB VRAM
Docker24.0+Latest
Docker Composev2.20+Latest
NVIDIA Container Toolkit (GPU)LatestLatest

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Single Host                              │
│                                                                 │
│  ┌─────────────┐    ┌───────────────────┐    ┌──────────────┐  │
│  │   Client     │───▶│  Isartor Firewall │    │  Jaeger UI   │  │
│  │             │    │  :8080             │    │  :16686      │  │
│  └─────────────┘    │  (candle L1        │    └──────────────┘  │
│                     │   embeddings       │                      │
│                     │   built-in)        │                      │
│                     └──┬────────────────┘                       │
│                        │                                        │
│              HTTP :8081│                                        │
│                        ▼                                        │
│               ┌────────────┐                  ┌──────────────┐ │
│               │ slm-gen    │                  │  Grafana     │ │
│               │ Phi-3-mini │                  │  :3000       │ │
│               │ (llama.cpp)│                  └──────────────┘ │
│               └────────────┘                                    │
│                                               ┌──────────────┐ │
│               ┌─────────────────────────┐     │  Prometheus  │ │
│               │    OTel Collector :4317  │────▶│  :9090       │ │
│               └─────────────────────────┘     └──────────────┘ │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ Optional: slm-embed :8082 (llama.cpp)                    │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Services

ServiceImagePortPurposeMemory Limit
gatewayisartor:latest (built)8080Prompt Firewall (includes candle BertModel for Layer 1 embeddings)256 MB
slm-generationghcr.io/ggml-org/llama.cpp:server8081Phi-3-mini-4k (Q4_K_M) — intent classification + generation4 GB
slm-embedding (optional)ghcr.io/ggml-org/llama.cpp:server8082all-MiniLM-L6-v2 (Q8_0) — external embedding sidecar (default uses in-process candle)512 MB
otel-collectorotel/opentelemetry-collector-contrib:0.96.04317OTLP gRPC receiver128 MB
jaegerjaegertracing/all-in-one:1.5516686Distributed tracing UI256 MB
prometheusprom/prometheus:v2.51.09090Metrics storage (7d retention)256 MB
grafanagrafana/grafana:10.4.03000Dashboards256 MB

Quick Start (CPU Only)

1. Clone the Repository

git clone https://github.com/isartor-ai/isartor.git
cd isartor/docker

2. Configure Layer 3 (Optional)

Layers 0–2 work without a cloud LLM key. If you want Layer 3 fallback:

cp .env.full.example .env.full

Edit .env.full and set your provider:

ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...

3. Start the Full Stack

docker compose -f docker-compose.sidecar.yml up --build

First launch downloads model files (~1.5 GB for Phi-3 + ~50 MB for MiniLM). Subsequent starts use the cached isartor-slm-models volume.

4. Wait for Health Checks

The firewall waits for both sidecars to become healthy before starting:

docker compose -f docker-compose.sidecar.yml ps

All services should show healthy or running.

5. Verify

# Health check
curl http://localhost:8080/healthz

# Test the firewall
curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is 2+2?"}' | jq .

# If you enabled gateway auth, add:
#   -H "X-API-Key: your-secret-key"

# Check traces in Jaeger
open http://localhost:16686

GPU Passthrough (NVIDIA)

To enable GPU acceleration for the llama.cpp sidecars:

1. Install NVIDIA Container Toolkit

# Ubuntu / Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

2. Add GPU Resources to Compose

Create a docker-compose.gpu.override.yml:

services:
  slm-generation:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # The default --n-gpu-layers 99 in docker-compose.sidecar.yml
    # already offloads all layers to GPU when available.

  slm-embedding:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

3. Start with GPU Override

docker compose \
  -f docker-compose.sidecar.yml \
  -f docker-compose.gpu.override.yml \
  up --build

Expected GPU Impact

MetricCPU Only (8-core)GPU (RTX 3060 12 GB)
Phi-3 classification500–2000 ms30–100 ms
Phi-3 generation (256 tokens)5–15 s0.5–2 s
MiniLM embedding20–50 ms5–10 ms

Available Compose Files

The docker/ directory contains several Compose configurations for different use cases:

FileDescriptionProvider
docker-compose.sidecar.ymlRecommended. Full stack with llama.cpp sidecars + observabilityAny (configurable)
docker-compose.ymlLegacy stack with Ollama (heavier)OpenAI
docker-compose.azure.ymlLegacy stack with Ollama, pre-configured for Azure OpenAIAzure
docker-compose.observability.ymlObservability-focused stack (Ollama + OTel + Jaeger + Grafana)Azure

We recommend docker-compose.sidecar.yml for all new deployments. The llama.cpp sidecars are ~30 MB each vs. Ollama's ~1.5 GB.


Environment Variables (Level 2 Specific)

These variables are relevant to the sidecar architecture. For the full reference, see the Configuration Reference.

Firewall ↔ Sidecar Communication

VariableDefaultDescription
ISARTOR__LAYER2__SIDECAR_URLhttp://127.0.0.1:8081Generation sidecar URL (use Docker service name in Compose: http://slm-generation:8081)
ISARTOR__LAYER2__MODEL_NAMEphi-3-miniModel name for OpenAI-compatible requests
ISARTOR__LAYER2__TIMEOUT_SECONDS30HTTP timeout for generation calls
ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URLhttp://127.0.0.1:8082Embedding sidecar URL — optional (default uses in-process candle; use http://slm-embedding:8082 in Compose)
ISARTOR__EMBEDDING_SIDECAR__MODEL_NAMEall-minilmEmbedding model name (sidecar only)
ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS10HTTP timeout for embedding calls (sidecar only)

Pluggable Backends

VariableDefaultDescription
ISARTOR__CACHE_BACKENDmemoryIn-process LRU — ideal for single-host Docker Compose
ISARTOR__ROUTER_BACKENDembeddedIn-process Candle SLM classification — no external dependency

Scalability note: These defaults are appropriate for Level 2 (single host). When moving to Level 3 (multi-replica K8s), switch to cache_backend=redis and router_backend=vllm for horizontal scaling.

Cache

VariableDefaultDescription
ISARTOR__CACHE_MODEbothUse both — in-process candle BertModel provides semantic embeddings at all tiers
ISARTOR__SIMILARITY_THRESHOLD0.85Cosine similarity threshold for cache hits

Observability

VariableDefaultDescription
ISARTOR__ENABLE_MONITORINGtrue (in Compose)Enable OTel trace/metric export
ISARTOR__OTEL_EXPORTER_ENDPOINThttp://otel-collector:4317OTel Collector gRPC endpoint

Operational Commands

Logs

# All services
docker compose -f docker-compose.sidecar.yml logs -f

# Firewall only
docker compose -f docker-compose.sidecar.yml logs -f gateway

# Sidecars
docker compose -f docker-compose.sidecar.yml logs -f slm-generation slm-embedding

Restart a Service

docker compose -f docker-compose.sidecar.yml restart gateway

Tear Down (Preserve Model Cache)

docker compose -f docker-compose.sidecar.yml down
# Models persist in the 'isartor-slm-models' volume

Tear Down (Clean Everything)

docker compose -f docker-compose.sidecar.yml down -v
# Removes all volumes including model cache — next start re-downloads models

View Model Cache Size

docker volume inspect isartor-slm-models

Networking Notes

  • All services share a Docker bridge network created by Compose.
  • The firewall references sidecars by Docker service name (slm-generation, slm-embedding), not localhost.
  • Only the firewall (8080), Jaeger UI (16686), Grafana (3000), and Prometheus (9090) are exposed to the host.
  • Sidecar ports (8081, 8082) are also exposed for debugging but can be removed in production by deleting the ports: mapping.

Scaling Within Level 2

Before moving to Level 3, you can vertically scale Level 2:

OptimisationHow
More GPU VRAMUse larger quantisation (Q8_0 instead of Q4_K_M) for better quality
Bigger modelSwap Phi-3-mini for Phi-3-medium or Qwen2-7B in the Compose command
More cacheIncrease ISARTOR__CACHE_MAX_CAPACITY and ISARTOR__CACHE_TTL_SECS
Faster embeddingUse nomic-embed-text (768-dim) for richer semantic matching
More concurrencyScale horizontally with multiple firewall replicas behind a load balancer

Upgrading to Level 3

When a single host is no longer sufficient:

  1. Extract the firewall into stateless Kubernetes pods (it's already stateless).
  2. Replace sidecars with an auto-scaling inference pool (vLLM, TGI, or Triton).
  3. Add an internal load balancer between firewall pods and the inference pool.
  4. Move observability to a managed solution (Datadog, Grafana Cloud, Azure Monitor).

See Level 3 — Enterprise Deployment for the full Kubernetes guide.

Level 3 — Enterprise Deployment

Fully decoupled microservices: stateless firewall pods + auto-scaling GPU inference pools.

This guide covers deploying Isartor on Kubernetes with Helm, horizontal pod autoscaling, dedicated GPU inference pools (vLLM or TGI), service mesh integration, and production-grade observability.


When to Use Level 3

✅ Good Fit❌ Overkill For
100+ concurrent users< 50 users → Level 2 Docker Compose
Multi-region / multi-zone HASingle-machine development → Level 1
Auto-scaling GPU inferenceNo GPU budget → Level 1 embedded candle
Compliance: mTLS, audit logs, RBACHobby projects / PoCs
Cost optimisation via scale-to-zeroTeams without Kubernetes experience

Architecture

                        ┌────────────────────┐
                        │    Ingress / ALB    │
                        │  (TLS termination)  │
                        └──────────┬─────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │      Firewall Deployment     │
                    │      (N stateless pods)       │
                    │                              │
                    │  ┌────────┐   ┌────────┐    │
                    │  │ Pod 1  │   │ Pod N  │    │
                    │  │isartor │   │isartor │    │
                    │  └────────┘   └────────┘    │
                    │                              │
                    │  HPA: CPU / custom metrics   │
                    └──────────────┬───────────────┘
                                   │
                          Internal ClusterIP
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                     │
     ┌────────▼───────┐  ┌────────▼───────┐   ┌────────▼───────┐
  │ Inference Pool  │  │ Embedding Pool  │   │ Cloud LLM      │
  │ (vLLM / TGI)   │  │ (TEI / llama)   │   │ (OpenAI / etc) │
  │                 │  │                 │   │ (Layer 3 only)  │
  │ GPU Nodes       │  │ CPU/GPU Nodes   │   └────────────────┘
  │ HPA on GPU util │  │ HPA on RPS      │
  └─────────────────┘  └─────────────────┘

Component Summary

ComponentReplicasScaling MetricResource
Firewall2–20CPU utilisation / request rateCPU nodes
Inference Pool (vLLM)1–NGPU utilisation / queue depthGPU nodes
Embedding Pool (TEI)1–NRequests per secondCPU or GPU nodes (optional; default uses in-process candle)
OTel Collector1 (DaemonSet or Deployment)CPU nodes
Ingress Controller1–2CPU nodes

Prerequisites

RequirementDetails
Kubernetes cluster1.28+ (EKS, GKE, AKS, or bare metal)
Helmv3.12+
kubectlMatching cluster version
GPU nodes (for inference pool)NVIDIA GPU Operator installed, or GKE/EKS GPU node pools
Container registryFor pushing the Isartor firewall image
Ingress controllernginx-ingress, Istio, or cloud ALB

Step 1: Build & Push the Firewall Image

# Build
docker build -t your-registry.io/isartor:v0.1.0 -f docker/Dockerfile .

# Push
docker push your-registry.io/isartor:v0.1.0

Step 2: Namespace & Secrets

kubectl create namespace isartor

# Cloud LLM API key (Layer 3 fallback)
kubectl create secret generic isartor-llm-secret \
  --namespace isartor \
  --from-literal=api-key='sk-...'

# Firewall API key (Layer 0 auth)
kubectl create secret generic isartor-gateway-secret \
  --namespace isartor \
  --from-literal=gateway-api-key='your-production-key'

Step 3: Firewall Deployment

# k8s/gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: isartor-gateway
  namespace: isartor
  labels:
    app: isartor-gateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: isartor-gateway
  template:
    metadata:
      labels:
        app: isartor-gateway
    spec:
      containers:
        - name: gateway
          image: your-registry.io/isartor:v0.1.0
          ports:
            - containerPort: 8080
              name: http
          env:
            - name: ISARTOR__HOST_PORT
              value: "0.0.0.0:8080"
            - name: ISARTOR__GATEWAY_API_KEY
              valueFrom:
                secretKeyRef:
                  name: isartor-gateway-secret
                  key: gateway-api-key
            # Pluggable backends — scaled for multi-replica K8s
            - name: ISARTOR__CACHE_BACKEND
              value: "redis"          # Shared cache across all firewall pods
            - name: ISARTOR__REDIS_URL
              value: "redis://redis.isartor:6379"
            - name: ISARTOR__ROUTER_BACKEND
              value: "vllm"           # GPU-backed vLLM inference pool
            - name: ISARTOR__VLLM_URL
              value: "http://isartor-inference:8081"
            - name: ISARTOR__VLLM_MODEL
              value: "gemma-2-2b-it"
            # Cache
            - name: ISARTOR__CACHE_MODE
              value: "both"
            - name: ISARTOR__SIMILARITY_THRESHOLD
              value: "0.85"
            - name: ISARTOR__CACHE_TTL_SECS
              value: "300"
            - name: ISARTOR__CACHE_MAX_CAPACITY
              value: "50000"
            # Inference pool (internal service)
            - name: ISARTOR__LAYER2__SIDECAR_URL
              value: "http://isartor-inference:8081"
            - name: ISARTOR__LAYER2__MODEL_NAME
              value: "phi-3-mini"
            - name: ISARTOR__LAYER2__TIMEOUT_SECONDS
              value: "30"
            # Embedding pool (optional — default uses in-process candle)
            - name: ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL
              value: "http://isartor-embedding:8082"
            - name: ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME
              value: "all-minilm"
            - name: ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS
              value: "10"
            # Layer 3 — Cloud LLM
            - name: ISARTOR__LLM_PROVIDER
              value: "openai"
            - name: ISARTOR__EXTERNAL_LLM_MODEL
              value: "gpt-4o-mini"
            - name: ISARTOR__EXTERNAL_LLM_API_KEY
              valueFrom:
                secretKeyRef:
                  name: isartor-llm-secret
                  key: api-key
            # Observability
            - name: ISARTOR__ENABLE_MONITORING
              value: "true"
            - name: ISARTOR__OTEL_EXPORTER_ENDPOINT
              value: "http://otel-collector.isartor:4317"
          resources:
            requests:
              cpu: "250m"
              memory: "128Mi"
            limits:
              cpu: "1000m"
              memory: "256Mi"
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: isartor-gateway
  namespace: isartor
spec:
  selector:
    app: isartor-gateway
  ports:
    - port: 8080
      targetPort: http
      name: http
  type: ClusterIP

Step 4: Inference Pool (vLLM)

vLLM provides high-throughput, GPU-optimised inference with continuous batching.

# k8s/inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: isartor-inference
  namespace: isartor
  labels:
    app: isartor-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: isartor-inference
  template:
    metadata:
      labels:
        app: isartor-inference
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "microsoft/Phi-3-mini-4k-instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8081"
            - "--max-model-len"
            - "4096"
            - "--gpu-memory-utilization"
            - "0.9"
          ports:
            - containerPort: 8081
              name: http
          resources:
            requests:
              nvidia.com/gpu: 1
              memory: "8Gi"
            limits:
              nvidia.com/gpu: 1
              memory: "16Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 60
            periodSeconds: 10
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: isartor-inference
  namespace: isartor
spec:
  selector:
    app: isartor-inference
  ports:
    - port: 8081
      targetPort: http
      name: http
  type: ClusterIP

Alternative: Text Generation Inference (TGI)

Replace vLLM with TGI if you prefer Hugging Face's inference server:

containers:
  - name: tgi
    image: ghcr.io/huggingface/text-generation-inference:latest
    args:
      - "--model-id"
      - "microsoft/Phi-3-mini-4k-instruct"
      - "--port"
      - "8081"
      - "--max-input-length"
      - "4096"
      - "--max-total-tokens"
      - "8192"

Alternative: llama.cpp Server (CPU / Light GPU)

For budget clusters without heavy GPU nodes:

containers:
  - name: llama-cpp
    image: ghcr.io/ggml-org/llama.cpp:server
    args:
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8081"
      - "--hf-repo"
      - "microsoft/Phi-3-mini-4k-instruct-gguf"
      - "--hf-file"
      - "Phi-3-mini-4k-instruct-q4.gguf"
      - "--ctx-size"
      - "4096"
      - "--n-gpu-layers"
      - "99"

Step 5: Embedding Pool (TEI) — Optional

Note: The gateway generates Layer 1 embeddings in-process via candle BertModel. This external embedding pool is optional for high-throughput deployments that want to offload embedding generation.

Text Embeddings Inference (TEI) provides optimised embedding generation.

# k8s/embedding-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: isartor-embedding
  namespace: isartor
  labels:
    app: isartor-embedding
spec:
  replicas: 2
  selector:
    matchLabels:
      app: isartor-embedding
  template:
    metadata:
      labels:
        app: isartor-embedding
    spec:
      containers:
        - name: tei
          image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest
          args:
            - "--model-id"
            - "sentence-transformers/all-MiniLM-L6-v2"
            - "--port"
            - "8082"
          ports:
            - containerPort: 8082
              name: http
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "1Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: isartor-embedding
  namespace: isartor
spec:
  selector:
    app: isartor-embedding
  ports:
    - port: 8082
      targetPort: http
      name: http
  type: ClusterIP

Step 6: Horizontal Pod Autoscaler

Gateway HPA

# k8s/gateway-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: isartor-gateway-hpa
  namespace: isartor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: isartor-gateway
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

Inference Pool HPA (Custom Metrics)

For GPU-based scaling, use custom metrics from Prometheus:

# k8s/inference-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: isartor-inference-hpa
  namespace: isartor
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: isartor-inference
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"

Note: GPU-based HPA requires the Prometheus Adapter or KEDA to expose GPU metrics to the HPA controller.


Step 7: Ingress

nginx-ingress Example

# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: isartor-ingress
  namespace: isartor
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - api.isartor.example.com
      secretName: isartor-tls
  rules:
    - host: api.isartor.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: isartor-gateway
                port:
                  number: 8080

Istio VirtualService (Service Mesh)

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: isartor-vs
  namespace: isartor
spec:
  hosts:
    - api.isartor.example.com
  gateways:
    - isartor-gateway
  http:
    - match:
        - uri:
            prefix: /api/
      route:
        - destination:
            host: isartor-gateway
            port:
              number: 8080
      timeout: 120s
      retries:
        attempts: 2
        perTryTimeout: 60s

Step 8: Apply Everything

# Apply in order
kubectl apply -f k8s/gateway-deployment.yaml
kubectl apply -f k8s/inference-deployment.yaml
kubectl apply -f k8s/embedding-deployment.yaml
kubectl apply -f k8s/gateway-hpa.yaml
kubectl apply -f k8s/inference-hpa.yaml
kubectl apply -f k8s/ingress.yaml

# Verify
kubectl get pods -n isartor
kubectl get svc -n isartor
kubectl get hpa -n isartor

Redis Configuration for Distributed Cache

Enterprise deployments use Redis to share the exact-match cache across all firewall pods. Configure the cache provider via environment variables or isartor.yaml:

Environment Variables

ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379

YAML Configuration

exact_cache:
  provider: redis
  redis_url: "redis://redis-cluster.svc:6379"
  # Optional: redis_db: 0

Kubernetes Topology with Redis

Deploy Redis as a StatefulSet within the cluster, accessible only via ClusterIP:

[Ingress]
   |
[Isartor Deployment] <--> [Redis StatefulSet]
   |
   +--> [vLLM Deployment (GPU nodes)]
  • Isartor pods scale horizontally for network I/O and cache hits.
  • Redis ensures cache consistency across all pods.
  • The vLLM GPU pool scales independently for inference throughput.

vLLM Configuration for SLM Routing

Enterprise deployments replace the embedded candle SLM with a remote vLLM inference pool for higher throughput. Configure the router backend via environment variables or isartor.yaml:

Environment Variables

ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm-openai.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct

YAML Configuration

slm_router:
  provider: remote_http
  remote_url: "http://vllm-openai.svc:8000"
  model: "meta-llama/Llama-3-8B-Instruct"

Docker Compose Example (Enterprise Sidecar)

For development or staging environments that mirror enterprise topology:

services:
  isartor:
    image: isartor-ai/isartor:latest
    ports:
      - "8080:8080"
    environment:
      - ISARTOR__CACHE_BACKEND=redis
      - ISARTOR__REDIS_URL=redis://redis-cluster:6379
      - ISARTOR__ROUTER_BACKEND=vllm
      - ISARTOR__VLLM_URL=http://vllm-openai:8000
      - ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
    depends_on:
      - redis
      - vllm-openai

  redis:
    image: redis:7
    ports:
      - "6379:6379"

  vllm-openai:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"

Observability in Level 3

For Kubernetes deployments, you have several options:

ApproachStackEffort
Self-managedOTel Collector DaemonSet → Jaeger + Prometheus + GrafanaMedium
Managed (AWS)AWS X-Ray + CloudWatch + Managed GrafanaLow
Managed (GCP)Cloud Trace + Cloud MonitoringLow
Managed (Azure)Azure Monitor + Application InsightsLow
Third-partyDatadog / New Relic / Grafana CloudLow

The gateway exports traces and metrics via OTLP gRPC to whatever ISARTOR__OTEL_EXPORTER_ENDPOINT points at. See Metrics & Tracing for detailed setup.


Scalability Deep-Dive

Level 3 is designed for horizontal scaling. The Pluggable Trait Provider architecture ensures every component can scale independently:

Stateless Gateway Pods

The Isartor gateway binary is fully stateless when configured with cache_backend=redis and router_backend=vllm. All request-scoped state (cache, inference) is offloaded to external services, meaning:

  • Gateway pods scale linearly — add replicas via HPA without coordination overhead.
  • Zero warm-up penalty — new pods serve requests immediately (no model loading, no cache priming).
  • Rolling updates — deploy new versions with zero downtime; old and new pods share the same Redis cache.

Shared Cache via Redis

With ISARTOR__CACHE_BACKEND=redis:

BenefitImpact
Consistent hit rateAll pods read/write the same cache — no per-pod cold caches
Memory efficiencyCache memory is centralised, not duplicated N times
PersistenceRedis AOF/RDB survives pod restarts
Cluster modeRedis Cluster or ElastiCache provides sharded, HA caching

GPU Inference Pool (vLLM)

With ISARTOR__ROUTER_BACKEND=vllm:

BenefitImpact
Independent GPU scalingScale inference replicas separately from gateway pods
Continuous batchingvLLM's PagedAttention maximises GPU utilisation
Mixed hardwareGateway runs on cheap CPU nodes; inference on GPU nodes
Cost controlScale inference to zero when idle (KEDA + queue-depth trigger)

Scaling Dimensions

DimensionKnobMetric
Gateway replicasHPA minReplicas / maxReplicasCPU utilisation, request rate
Inference replicasHPA on custom GPU metricsGPU utilisation, queue depth
Cache capacityISARTOR__CACHE_MAX_CAPACITYCache hit rate, memory usage
ConcurrencyHPA + replica scalingP95 latency, request rate
RedisRedis Cluster nodesKey count, memory, eviction rate

Cost Optimisation

StrategyDescription
Spot / preemptible nodesUse for inference pods (they're stateless and restart quickly)
Scale-to-zeroUse KEDA with queue-depth trigger to scale inference to 0 when idle
Right-size GPUA100 80 GB for large models, T4/L4 for Phi-3-mini (4 GB VRAM is sufficient)
Shared GPUNVIDIA MPS or MIG to run multiple inference pods per GPU
Semantic cacheHigher ISARTOR__CACHE_MAX_CAPACITY = fewer inference calls
Smaller quantisationQ4_K_M uses less VRAM at marginal quality cost

Security Checklist

  • TLS termination at ingress (cert-manager + Let's Encrypt or cloud certs)
  • mTLS between services (Istio / Linkerd / Cilium)
  • ISARTOR__GATEWAY_API_KEY from Kubernetes Secret, not plaintext
  • ISARTOR__EXTERNAL_LLM_API_KEY from Kubernetes Secret
  • Network policies restricting pod-to-pod communication
  • RBAC: least-privilege ServiceAccounts for each workload
  • Pod security standards: restricted or baseline
  • Image scanning (Trivy, Snyk) in CI pipeline
  • Audit logging enabled on the cluster

Downgrading to Level 2

If Kubernetes overhead doesn't justify the scale:

  1. Export your env vars from the Kubernetes ConfigMap/Secret.
  2. Map them into docker/.env.full.
  3. Run docker compose -f docker-compose.sidecar.yml up --build.

No code changes — the binary is identical across all three tiers.

Air-Gapped / Offline Deployment

Overview

Isartor is architecturally the most air-gap-friendly LLM gateway available. Its pure-Rust statically compiled binary embeds all inference models at build time, requires no runtime dependencies, and validates licenses with an offline HMAC check — so Isartor itself does not initiate unsolicited telemetry or license calls to external services.

The zero-phone-home guarantee applies to Isartor-managed network paths: the --offline flag disables L3 cloud routing and external observability backends at the application layer, and our CI phone-home audit test (see tests/phone_home_audit.rs) exercises these code paths on every commit.

Supported regulated industries: defense, healthcare (HIPAA), finance (SOX), and government (FedRAMP).


Pre-Deployment Checklist

Complete these steps before deploying Isartor in an air-gapped environment:

  1. Download the airgapped Docker image

    docker pull ghcr.io/isartor-ai/isartor:latest-airgapped
    

    This image includes local copies of the L1b embedding models to minimize or avoid external downloads during normal operation in most setups. See Image Size Comparison for size details and be sure to follow any additional configuration steps required by your environment to operate fully offline.

  2. Transfer to your air-gapped environment via your organisation's approved media transfer process (USB, air-gap data diode, etc.).

  3. Enable offline mode

    export ISARTOR__OFFLINE_MODE=true
    

    Alternatively, pass --offline on the command line:

    isartor --offline
    
  4. Disable L3 or point it at an internal LLM endpoint

    • For strictly air-gapped / zero-egress deployments, you must enable offline mode (step 3). Leaving ISARTOR__EXTERNAL_LLM_API_KEY unset alone does not prevent the gateway from attempting outbound L3 calls to the default external endpoint on cache misses.
    • To run fully local (cache + SLM only) with no outbound attempts, enable offline mode and leave ISARTOR__EXTERNAL_LLM_API_KEY unset.
    • To route L3 to a self-hosted model, see Connecting to an Internal LLM.
  5. Run isartor check to confirm zero external connections:

    isartor check
    

    Expected output (with offline mode active):

    Isartor Connectivity Audit
    ──────────────────────────
    Required (L3 cloud routing):
      → api.openai.com:443     [NOT CONFIGURED]
        (BLOCKED — offline mode active)
    
    Optional (observability / monitoring):
      → http://localhost:4317  [NOT CONFIGURED]
    
    Internal only (no external):
      → (in-memory cache — no network connection)  [CONFIGURED - internal]
    
    Zero hidden telemetry connections: ✓ VERIFIED
    Air-gap compatible: ✓ YES (L3 disabled or offline mode active)
    
  6. Run isartor audit verify (planned — see issue #3) to confirm the signed audit log is functioning correctly.


Connecting to an Internal LLM

In this configuration Isartor acts as a fully air-gapped deflection layer in front of an internal LLM. 100% of traffic stays inside the perimeter: L1a and L1b handle cached / semantically similar prompts locally, and only genuine cache misses are forwarded to your self-hosted model over the internal network.

# Route L3 to a self-hosted vLLM instance on the internal network.
export ISARTOR__EXTERNAL_LLM_URL=http://vllm.internal.corp:8000/v1
export ISARTOR__LLM_PROVIDER=openai          # vLLM exposes an OpenAI-compat API
export ISARTOR__EXTERNAL_LLM_MODEL=meta-llama/Llama-3-8B-Instruct

# Enable offline mode to block any accidental external connections.
export ISARTOR__OFFLINE_MODE=true

# Start the gateway.
isartor

Note: ISARTOR__EXTERNAL_LLM_URL sets the L3 endpoint URL. Point it at your internal vLLM or TGI server.

With this configuration:

  • L1a (exact cache) deflects duplicate prompts instantly (< 1 ms).
  • L1b (semantic cache) deflects semantically similar prompts (1–5 ms).
  • L3 forwards surviving cache-miss prompts to your internal vLLM.
  • Zero bytes leave the network perimeter.

Startup Status Banner

When offline mode is active, Isartor prints a status banner at startup so operators can confirm the configuration at a glance:

  ┌──────────────────────────────────────────────────────┐
  │  [Isartor] OFFLINE MODE ACTIVE                       │
  ├──────────────────────────────────────────────────────┤
  │  ✓ L1a Exact Cache:     active                       │
  │  ✓ L1b Semantic Cache:  active                       │
  │  - L2 SLM Router:       disabled (ENABLE_SLM_ROUTER=false)│
  │  ✗ L3 Cloud Logic:      DISABLED (offline mode)      │
  │  ✗ Telemetry export:    DISABLED if external endpoint │
  │  ✓ License validation:  offline HMAC check           │
  └──────────────────────────────────────────────────────┘

Environment Variables Reference

VariableDefaultDescription
ISARTOR__OFFLINE_MODEfalseEnable air-gap mode. Blocks L3 cloud calls.
ISARTOR__EXTERNAL_LLM_URLInternal LLM endpoint (vLLM, TGI, etc.).
ISARTOR__EXTERNAL_LLM_MODELgpt-4o-miniModel name passed to the internal LLM.
ISARTOR__SIMILARITY_THRESHOLD0.85Cosine similarity threshold for L1b cache hits. Lower values increase local deflection.
ISARTOR__OTEL_EXPORTER_ENDPOINThttp://localhost:4317OTel collector endpoint. External URLs are suppressed in offline mode.

For the complete variable listing, see the Configuration Reference.


Image Size Comparison

ImageTagIncludes modelsCompressed size
BaselatestNo (downloads on first run)~120 MB
Air-gappedlatest-airgappedYes (all-MiniLM-L6-v2 embedded)~210 MB

The latest-airgapped image is approximately 90 MB larger due to the pre-bundled embedding model. This is the recommended image for any environment with restricted outbound internet access.


Compliance Notes

FedRAMP / NIST 800-53

This deployment posture supports the following NIST 800-53 controls:

ControlDescriptionHow Isartor Supports It
AU-2Audit LoggingEvery prompt, deflection decision, and L3 call is logged as a structured JSON event with tracing spans.
SC-7Boundary ProtectionISARTOR__OFFLINE_MODE=true enforces a hard block on all outbound connections. The phone-home audit CI test verifies this.
SI-4Information System MonitoringOpenTelemetry traces + Prometheus metrics provide real-time visibility into the deflection stack. Internal-only OTel endpoints are supported.
CM-6Configuration SettingsAll settings are controlled via environment variables with documented defaults. No runtime code changes are needed.

HIPAA

When ISARTOR__OFFLINE_MODE=true and L3 is pointed at an internal model:

  • PHI in prompts never leaves the network perimeter.
  • The L1b semantic cache computes embeddings in-process using a pure-Rust candle model — no external API calls.
  • Audit logs are written to stdout for ingestion by your internal SIEM.

Disclaimer

This document describes deployment architecture. The controls described above are architectural claims based on code behaviour — they are not a formal compliance certification. Consult your compliance team and engage a qualified assessor for formal FedRAMP authorization or HIPAA compliance review.


Further Reading

Configuration Reference

Complete reference for every Isartor configuration variable, CLI command, and provider option.


Configuration Loading Order

Isartor loads configuration in the following order (later sources override earlier ones):

  1. Compiled defaults — baked into the binary
  2. isartor.toml — if present in the working directory or ~/.isartor/
  3. Environment variablesISARTOR__... with double-underscore separators

Generate a starter config file with:

isartor init

Master Configuration Table

YAML KeyEnvironment VariableTypeDefaultDescription
server.hostISARTOR__HOSTstring0.0.0.0Host address for server binding
server.portISARTOR__PORTint8080Port for HTTP server
exact_cache.providerISARTOR__CACHE_BACKENDstringmemoryLayer 1a cache backend: memory or redis
exact_cache.redis_urlISARTOR__REDIS_URLstring(none)Redis connection string (if provider=redis)
exact_cache.redis_dbISARTOR__REDIS_DBint0Redis database index
semantic_cache.providerISARTOR__SEMANTIC_BACKENDstringcandleLayer 1b semantic cache: candle (in-process) or tei (external)
semantic_cache.remote_urlISARTOR__TEI_URLstring(none)TEI endpoint (if provider=tei)
slm_router.providerISARTOR__ROUTER_BACKENDstringembeddedLayer 2 router: embedded or vllm
slm_router.remote_urlISARTOR__VLLM_URLstring(none)vLLM/TGI endpoint (if provider=vllm)
slm_router.modelISARTOR__VLLM_MODELstringgemma-2-2b-itModel name/path for SLM router
slm_router.model_pathISARTOR__MODEL_PATHstring(baked-in)Path to GGUF model file (embedded mode)
slm_router.classifier_modeISARTOR__LAYER2__CLASSIFIER_MODEstringtieredClassifier mode: tiered (TEMPLATE/SNIPPET/COMPLEX) or binary (legacy SIMPLE/COMPLEX)
slm_router.max_answer_tokensISARTOR__LAYER2__MAX_ANSWER_TOKENSu642048Max tokens the SLM may generate for a local answer
fallback.openai_api_keyISARTOR__OPENAI_API_KEYstring(none)OpenAI API key for Layer 3 fallback
fallback.anthropic_api_keyISARTOR__ANTHROPIC_API_KEYstring(none)Anthropic API key for Layer 3 fallback
llm_providerISARTOR__LLM_PROVIDERstringopenaiLLM provider (see below for full list)
external_llm_modelISARTOR__EXTERNAL_LLM_MODELstringgpt-4o-miniModel name to request from the provider
external_llm_api_keyISARTOR__EXTERNAL_LLM_API_KEYstring(none)API key for the configured LLM provider (not needed for ollama)
l3_timeout_secsISARTOR__L3_TIMEOUT_SECSu64120HTTP timeout applied to all Layer 3 provider requests
enable_context_optimizerISARTOR__ENABLE_CONTEXT_OPTIMIZERbooltrueMaster switch for L2.5 context optimiser
context_optimizer_dedupISARTOR__CONTEXT_OPTIMIZER_DEDUPbooltrueEnable cross-turn instruction deduplication
context_optimizer_minifyISARTOR__CONTEXT_OPTIMIZER_MINIFYbooltrueEnable static minification (comments, rules, blanks)

Sections

Server

  • server.host, server.port: Bind address and port.

Layer 1a: Exact Match Cache

  • exact_cache.provider: memory or redis
  • exact_cache.redis_url, exact_cache.redis_db: Redis config

Layer 1b: Semantic Cache

  • semantic_cache.provider: candle or tei
  • semantic_cache.remote_url: TEI endpoint
  • Requests that carry x-isartor-session-id, x-thread-id, x-session-id, or x-conversation-id are isolated into a session-aware cache scope. The same scope can also be provided in request bodies via session_id, thread_id, conversation_id, or metadata.*. If no session identifier is present, Isartor keeps the legacy global-cache behavior.

Layer 2: SLM Router

  • slm_router.provider: embedded or vllm
  • slm_router.remote_url, slm_router.model, slm_router.model_path: Router config
  • slm_router.classifier_mode: tiered (default — TEMPLATE/SNIPPET/COMPLEX) or binary (legacy SIMPLE/COMPLEX)
  • slm_router.max_answer_tokens: Max tokens the SLM may generate for a local answer (default 2048)

Layer 2.5: Context Optimiser

L2.5 compresses repeated instruction payloads (CLAUDE.md, copilot-instructions.md, skills blocks) before they reach the cloud, reducing input tokens on every L3 call.

  • enable_context_optimizer: Master switch (default true). Set to false to disable L2.5 entirely.
  • context_optimizer_dedup: Enable cross-turn instruction deduplication (default true). When the same instruction block is seen in consecutive turns of the same session, it is replaced with a compact hash reference.
  • context_optimizer_minify: Enable static minification (default true). Strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.

The pipeline processes system/instruction messages from OpenAI, Anthropic, and native request formats. See Deflection Stack — L2.5 for architecture details.

Layer 3: Cloud Fallbacks

  • fallback.openai_api_key, fallback.anthropic_api_key: API keys for external LLMs
  • llm_provider: Select the active provider. All providers are powered by rig-core except copilot, which uses Isartor's native GitHub Copilot adapter:
    • openai (default), azure, anthropic, xai
    • gemini, mistral, groq, deepseek
    • cohere, galadriel, hyperbolic, huggingface
    • mira, moonshot, ollama (local, no key), openrouter
    • perplexity, together
    • copilot (GitHub Copilot subscription-backed L3)
  • external_llm_model: Model name for the selected provider (e.g. gpt-4o-mini, gemini-2.0-flash, mistral-small-latest, llama-3.1-8b-instant, deepseek-chat, command-r, sonar, moonshot-v1-128k)
  • external_llm_api_key: API key for the configured provider (not needed for ollama)
  • l3_timeout_secs: Shared timeout, in seconds, for all Layer 3 provider HTTP calls

TOML Config Example

Generate a scaffold with isartor init, then edit isartor.toml:

[server]
host = "0.0.0.0"
port = 8080

[exact_cache]
provider = "memory"           # "memory" or "redis"
# redis_url = "redis://127.0.0.1:6379"
# redis_db = 0

[semantic_cache]
provider = "candle"           # "candle" or "tei"
# remote_url = "http://localhost:8082"

[slm_router]
provider = "embedded"         # "embedded" or "vllm"
# remote_url = "http://localhost:8000"
# model = "gemma-2-2b-it"

# L2.5 Context Optimiser (all enabled by default)
# enable_context_optimizer = true
# context_optimizer_dedup = true
# context_optimizer_minify = true

[fallback]
# openai_api_key = "sk-..."
# anthropic_api_key = "sk-ant-..."

# llm_provider = "openai"
# external_llm_model = "gpt-4o-mini"
# external_llm_api_key = "sk-..."

Per-Tier Defaults

SettingLevel 1 (Minimal)Level 2 (Sidecar)Level 3 (Enterprise)
Cache backendmemorymemoryredis
Semantic backendcandlecandletei (optional)
SLM routerembeddedembedded or sidecarvllm
LLM provideropenaiopenaiany
Monitoringfalsetruetrue

Provider-Specific Configuration

Each provider requires ISARTOR__EXTERNAL_LLM_API_KEY (except Ollama) and a matching ISARTOR__LLM_PROVIDER value:

# OpenAI (default)
export ISARTOR__LLM_PROVIDER=openai
export ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini

# Azure OpenAI
export ISARTOR__LLM_PROVIDER=azure

# Anthropic
export ISARTOR__LLM_PROVIDER=anthropic
export ISARTOR__EXTERNAL_LLM_MODEL=claude-3-haiku-20240307

# xAI (Grok)
export ISARTOR__LLM_PROVIDER=xai

# Google Gemini
export ISARTOR__LLM_PROVIDER=gemini
export ISARTOR__EXTERNAL_LLM_MODEL=gemini-2.0-flash

# Ollama (local — no API key required)
export ISARTOR__LLM_PROVIDER=ollama
export ISARTOR__EXTERNAL_LLM_MODEL=llama3

# GitHub Copilot (configured automatically by `isartor connect claude-copilot`)
export ISARTOR__LLM_PROVIDER=copilot
export ISARTOR__EXTERNAL_LLM_MODEL=claude-sonnet-4.5

Setting API Keys with the CLI

Use isartor set-key for interactive key management:

isartor set-key --provider openai
isartor set-key --provider anthropic
isartor set-key --provider xai

This writes the key to isartor.toml or the appropriate env file.


CLI Commands

CommandDescription
isartor upStart the API gateway only (recommended default). Flag: --detach to run in background
isartor up <copilot|claude|antigravity>Start the gateway plus the CONNECT proxy for that client
isartor initGenerate a commented isartor.toml config scaffold
isartor demoRun the post-install showcase (cache-only, or live + cache when a provider is configured)
isartor checkAudit outbound connections
isartor connect <client>Configure AI clients to route through Isartor
isartor connect copilotConfigure Copilot CLI with CONNECT proxy + TLS MITM
isartor connect claude-copilotConfigure Claude Code to use GitHub Copilot through Isartor
isartor statsShow total prompts, counts by layer, and recent prompt routing history
isartor set-key --provider <name>Set LLM provider API key (writes to isartor.toml or env file)
isartor stopStop a running Isartor instance (uses PID file). Flags: --force (SIGKILL), --pid-file <path>
isartor updateSelf-update to the latest (or specific) version. Flags: --version <tag>, --dry-run, --force

See also: Architecture · Metrics & Tracing · Troubleshooting

Metrics & Tracing

Definitive reference for Isartor's OpenTelemetry traces, metrics, structured logging, and observability stack — from local development to Kubernetes.


Overview

Isartor uses OpenTelemetry for distributed tracing and metrics, plus tracing-subscriber with a JSON layer for structured logging.

SignalProtocolDefault Endpoint
TracesOTLP gRPChttp://localhost:4317
MetricsOTLP gRPChttp://localhost:4317
Logsstdout (JSON)

When ISARTOR__ENABLE_MONITORING=false (default), only the console log layer is active — zero OTel overhead.

Architecture

┌─────────────┐                  ┌──────────────────┐
│  Isartor    │  OTLP gRPC      │  OTel Collector   │
│  Gateway    │─────────────────▶│  :4317            │
│             │  (traces +       │                   │
│             │   metrics)       │  Pipelines:       │
└─────────────┘                  │  traces → Jaeger  │
                                 │  metrics → Prom   │
                                 └───┬──────────┬────┘
                                     │          │
                          ┌──────────▼──┐  ┌────▼──────────┐
                          │   Jaeger    │  │  Prometheus   │
                          │   :16686    │  │  :9090        │
                          │   (UI)      │  │  (scrape)     │
                          └─────────────┘  └───────┬───────┘
                                                   │
                                           ┌───────▼───────┐
                                           │   Grafana     │
                                           │   :3000       │
                                           │  (dashboards) │
                                           └───────────────┘

Enabling Monitoring

ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://localhost:4317
RUST_LOG=info,h2=warn,hyper=warn,tower=warn       # optional override

When ISARTOR__ENABLE_MONITORING=false (the default), Isartor uses console-only logging via tracing-subscriber with RUST_LOG filtering. No OTel SDK is initialised — zero overhead.


Telemetry Initialisation (src/telemetry.rs)

init_telemetry() returns an OtelGuard (RAII). The guard holds the SdkTracerProvider and SdkMeterProvider; dropping it flushes pending telemetry and shuts down exporters gracefully.

ComponentDescription
JSON stdout layerStructured logs emitted as JSON when monitoring is on
Pretty console layerHuman-readable output when monitoring is off
OTLP trace exportergRPC via opentelemetry-otlp → Collector
OTLP metric exportergRPC via opentelemetry-otlp → Collector
EnvFilterReads RUST_LOG, defaults to info,h2=warn,hyper=warn,tower=warn

Service identity:

service.name    = "isartor-gateway"
service.version = env!("CARGO_PKG_VERSION")   # e.g. "0.1.0"

Distributed Traces — Span Reference

Every request gets a root span (gateway_request) from the monitoring middleware. Child spans are created per-layer:

Root Span

Span NameSourceKey Attributes
gateway_requestsrc/middleware/monitoring.rshttp.method, http.route, http.status_code, client.address, isartor.final_layer

http.status_code and isartor.final_layer are recorded after the response returns (empty → filled pattern).

Layer 0 — Auth

Span NameSourceKey Attributes
(inline tracing::debug!/warn!)src/middleware/auth.rs

Auth is lightweight; no dedicated span is created. Events are logged at debug/warn level.

Layer 1a — Exact Cache

Span NameSourceKey Attributes
l1a_exact_cache_getsrc/adapters/cache.rscache.backend (memory|redis), cache.key, cache.hit
l1a_exact_cache_putsrc/adapters/cache.rscache.backend, cache.key, response_len

Layer 1b — Semantic Cache

Span NameSourceKey Attributes
l1b_semantic_cache_searchsrc/vector_cache.rscache.entries_scanned, cache.hit, cosine_similarity
l1b_semantic_cache_insertsrc/vector_cache.rscache.evicted, cache.size_after

cosine_similarity — the best-match score formatted to 4 decimal places. This is the key attribute for tuning the similarity threshold.

Layer 2 — SLM Triage

Span NameSourceKey Attributes
layer2_slmsrc/middleware/slm_triage.rsslm.complexity_score (TEMPLATE|SNIPPET|COMPLEX; legacy binary mode: SIMPLE|COMPLEX)
l2_classify_intentsrc/adapters/router.rsrouter.backend (embedded_candle|remote_vllm), router.decision, router.model, router.url, prompt_len

Layer 2.5 — Context Optimiser

Span NameSourceKey Attributes
layer2_5_context_optimizersrc/middleware/context_optimizer.rscontext.bytes_saved, context.strategy (e.g. "classifier+dedup", "classifier+log_crunch")

When L2.5 modifies the request body, it also sets the response header x-isartor-context-optimized: bytes_saved=<N>.

Layer 3 — Cloud LLM

Span NameSourceKey Attributes
layer3_llmsrc/handler.rsai.prompt.length_bytes, provider.name, model

Custom Span Attributes — Quick Reference

These are the Isartor-specific attributes (beyond standard OTel semantic conventions) that appear on spans and are useful for filtering in Jaeger / Tempo:

AttributeTypeWhere SetPurpose
isartor.final_layerstringRoot gateway_request spanWhich layer resolved the request
cache.hitboolL1a and L1b spansWhether the cache lookup succeeded
cosine_similaritystringL1b search spanBest cosine-similarity score (4 d.p)
cache.entries_scannedu64L1b search spanEntries scanned during similarity search
cache.backendstringL1a get/put spans"memory" or "redis"
router.decisionstringL2 classify span"TEMPLATE", "SNIPPET", or "COMPLEX" (tiered mode); "SIMPLE" or "COMPLEX" (binary mode)
router.backendstringL2 classify span"embedded_candle" or "remote_vllm"
context.bytes_savedu64L2.5 optimizer spanBytes removed by compression pipeline
context.strategystringL2.5 optimizer spanPipeline stages that modified content (e.g. "classifier+dedup")
provider.namestringL3 handler spane.g. "openai", "xai", "azure"
modelstringL3 handler spane.g. "gpt-4o", "grok-beta"
http.status_codeu16Root spanHTTP response status code
client.addressstringRoot spanClient IP (from x-forwarded-for)

OTel Metrics (src/metrics.rs)

Four instruments are registered as a singleton GatewayMetrics via OnceLock:

Metric NameTypeAttributesDescription
isartor_requests_totalCounterfinal_layer, status_code, traffic_surface, client, endpoint_family, toolTotal prompts processed
isartor_request_duration_secondsHistogramfinal_layer, status_code, traffic_surface, client, endpoint_familyEnd-to-end request duration
isartor_layer_duration_secondsHistogramlayer_name, toolPer-layer latency
isartor_tokens_saved_totalCounterfinal_layer, traffic_surface, client, endpoint_family, toolEstimated tokens saved by early resolve
isartor_errors_totalCounterlayer, error_class, toolError occurrences by layer / agent
isartor_retries_totalCounteroperation, attempts, outcome, toolRetry outcomes by agent
isartor_cache_events_totalCountercache_layer, outcome, toolL1 / L1a / L1b hit-miss safety by agent

Where Metrics Are Recorded

Call SiteMetrics Recorded
root_monitoring_middlewarerecord_request_with_context(), record_tokens_saved_with_context() (if early)
proxy::connect::emit_proxy_decision()record_request_with_context(), record_tokens_saved_with_context() (if early)
cache_middleware (L1 hit)record_layer_duration("L1a_ExactCache" | "L1b_SemanticCache")
slm_triage_middleware (L2 hit)record_layer_duration("L2_SLM")
context_optimizer_middlewarerecord_layer_duration("L2_5_ContextOptimiser") (when bytes saved > 0)
chat_handler (L3)record_layer_duration("L3_Cloud")

Request Dimensions

Unified prompt telemetry distinguishes:

  • traffic_surface: gateway or proxy
  • client: direct, openai, anthropic, copilot, claude, antigravity, etc.
  • endpoint_family: native, openai, or anthropic

Token Estimation

estimate_tokens(prompt) uses the heuristic: max(1, prompt.len() / 4). This is intentionally conservative — the metric tracks relative savings rather than precise token counts.


ROI — isartor_tokens_saved_total

This is the headline business metric. Every request resolved before Layer 3 (exact cache, semantic cache, or local SLM) avoids a round-trip to the external LLM provider.

# Daily token savings
sum(increase(isartor_tokens_saved_total[24h]))

# Savings by layer
sum by (final_layer) (rate(isartor_tokens_saved_total[1h]))

# Prompt volume by traffic surface
sum by (traffic_surface) (rate(isartor_requests_total[5m]))

# Prompt volume by client
sum by (client) (rate(isartor_requests_total[5m]))

# Estimated cost savings (assuming $0.01 per 1K tokens)
sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01

Use this metric to justify infrastructure spend for the caching / SLM layers.


Docker Compose — Local Observability Stack

Use the provided compose file for local development:

cd docker
docker compose -f docker-compose.observability.yml up -d
ServicePortPurpose
OTel Collector4317OTLP gRPC receiver
Jaeger16686Trace UI
Prometheus9090Metrics scrape + query
Grafana3000Dashboards (anonymous admin)

Configuration files:

FilePurpose
docker/otel-collector-config.yamlCollector pipelines
docker/prometheus.ymlScrape targets

Pipeline Flow

Isartor  ──OTLP gRPC──▶  OTel Collector ──▶  Jaeger    (traces)
                                          └──▶  Prometheus (metrics)
                                                     │
                                                     ▼
                                                  Grafana

OTel Collector Configuration

The collector config is at docker/otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  otlp:
    endpoint: "jaeger:4317"
    tls:
      insecure: true
  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp, debug]
    metrics:
      receivers: [otlp]
      exporters: [prometheus, debug]

Prometheus Configuration

The Prometheus config is at docker/prometheus.yml:

scrape_configs:
  - job_name: 'otel-collector'
    scrape_interval: 5s
    static_configs:
      - targets: ['otel-collector:8889']

Prometheus scrapes the OTel Collector's Prometheus exporter on port 8889 every 5 seconds.


Per-Tier Setup

Level 1 — Minimal (Console Logs Only)

No observability stack is needed. Use RUST_LOG for structured console output:

ISARTOR__ENABLE_MONITORING=false
RUST_LOG=isartor=info

For debug-level output during development:

RUST_LOG=isartor=debug,tower_http=trace

Level 2 — Docker Compose (Full Stack)

The docker-compose.sidecar.yml includes the complete observability stack:

cd docker
docker compose -f docker-compose.sidecar.yml up --build

Services included:

ServiceURLPurpose
OTel Collectorlocalhost:4317 (gRPC)Receives OTLP from gateway
Jaeger UIhttp://localhost:16686View distributed traces
Prometheushttp://localhost:9090Query metrics
Grafanahttp://localhost:3000Dashboards (anonymous admin access)

The gateway is pre-configured with:

ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317

Level 3 — Kubernetes (Managed or Self-Hosted)

ApproachRecommended StackNotes
Self-managedOTel Collector DaemonSet + Jaeger Operator + kube-prometheus-stackFull control, higher ops burden
AWSAWS X-Ray + CloudWatch + Managed GrafanaADOT Collector as sidecar/DaemonSet
GCPCloud Trace + Cloud Monitoring + Cloud LoggingUse OTLP exporter to Cloud Trace
AzureApplication Insights + Azure MonitorUse Azure Monitor OpenTelemetry exporter
Grafana CloudGrafana Alloy + Grafana CloudLow ops, managed Prometheus + Tempo
DatadogDatadog Agent + OTel CollectorEnterprise APM

For all options, point the gateway at the collector:

ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.isartor:4317

Grafana Dashboard Queries (PromQL)

PanelPromQL
Request Raterate(isartor_requests_total[5m])
P95 Latencyhistogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
Layer Resolutionsum by (final_layer) (rate(isartor_requests_total[5m]))
Traffic Surface Splitsum by (traffic_surface) (rate(isartor_requests_total[5m]))
Client Splitsum by (client) (rate(isartor_requests_total[5m]))
Per-Layer Latencyhistogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m])))
Tokens Saved / Hoursum(increase(isartor_tokens_saved_total[1h]))
Tokens Saved by Layersum by (final_layer) (rate(isartor_tokens_saved_total[5m]))
Cache Hit Raterate(isartor_requests_total{final_layer=~"L1.*"}[5m]) / rate(isartor_requests_total[5m])

Jaeger — Useful Searches

GoalSearch
Slow requests (> 500 ms)Service isartor-gateway, Min Duration 500ms
Cache missesTag cache.hit=false
Semantic cache tuningTag cosine_similarity — sort by value
Layer 3 fallbacksTag isartor.final_layer=L3_Cloud
SLM local resolutionsTag router.decision=TEMPLATE or router.decision=SNIPPET (tiered); router.decision=SIMPLE (binary)

Trace Anatomy

A typical trace for a cache-miss, locally-resolved request:

isartor-gateway
  └─ HTTP POST /api/chat                       [250ms]
       ├─ Layer0_AuthCheck                       [0.1ms]
       ├─ Layer1_SemanticCache (MISS)            [5ms]
       ├─ Layer2_IntentClassifier                [80ms]
       │     intent=TEMPLATE, confidence=0.97
       └─ Layer2_LocalExecutor                   [160ms]
             model=phi-3-mini, tokens=42

Built-in User Views

For quick operator checks without a separate telemetry stack:

isartor stats --gateway-url http://localhost:8080
isartor stats --gateway-url http://localhost:8080 --by-tool

Add --gateway-api-key <key> only when gateway auth is enabled.

--by-tool prints richer per-agent stats: requests, cache hits/misses, average latency, retry count, error count, and L1a/L1b safety ratios.

Built-in JSON endpoints:

  • GET /health
  • GET /debug/proxy/recent
  • GET /debug/stats/prompts
  • GET /debug/stats/agents

Alerting Rules

Prometheus Alerting Rules

Create docker/prometheus-alerts.yml:

groups:
  - name: isartor
    rules:
      - alert: HighErrorRate
        expr: rate(isartor_requests_total{status="error"}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Isartor error rate > 5% for 5 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Isartor P95 latency > 2s for 5 minutes"

      - alert: LowCacheHitRate
        expr: >
          rate(isartor_requests_total{final_layer=~"L1.*"}[15m]) /
          rate(isartor_requests_total[15m]) < 0.3
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Cache hit rate below 30% — consider tuning similarity threshold"

      - alert: LowDeflectionRate
        expr: |
          1 - (
            sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
            /
            sum(rate(isartor_requests_total[1h]))
          ) < 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Isartor deflection rate below 50%"

      - alert: FirewallDown
        expr: up{job="isartor"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Isartor gateway is down"

Troubleshooting

SymptomCauseFix
No traces in JaegerMonitoring disabledSet ISARTOR__ENABLE_MONITORING=true
No traces in JaegerCollector unreachableVerify OTEL_EXPORTER_ENDPOINT + port 4317
No metrics in PrometheusPrometheus can't scrape collectorCheck prometheus.yml targets
Grafana "No data"Data source misconfiguredURL should be http://prometheus:9090
Console shows "OTel disabled"Config precedenceCheck env vars override file config
isartor_layer_duration_seconds emptyNo requests yetSend a test request

See also: Configuration Reference · Performance Tuning · Troubleshooting

Performance Tuning

How to measure, tune, and operate Isartor for maximum deflection and minimum latency.


Table of Contents

  1. Understanding Deflection
  2. Measuring Deflection Rate
  3. Tuning Configuration for Deflection
  4. Tuning Latency
  5. Memory & Resource Tuning
  6. Cache Tuning Deep-Dive
  7. SLM Router Tuning
  8. Embedder Tuning
  9. SLO / SLA Goal Templates
  10. Scenario-Based Tuning Recipes
  11. PromQL Cheat Sheet

Understanding Deflection

Deflection = the percentage of requests resolved before Layer 3 (the external cloud LLM). A request is "deflected" if it is served by:

LayerMechanismCost
L1a — Exact CacheSHA-256 hash match$0
L1b — Semantic CacheCosine similarity match$0
L2 — SLM TriageLocal SLM classifies requests as TEMPLATE, SNIPPET, or COMPLEX (tiered mode) and answers TEMPLATE/SNIPPET locally$0

The deflection rate directly maps to cost savings. A 70 % deflection rate means only 30 % of requests reach the paid cloud LLM.


Measuring Deflection Rate

Via Prometheus / Grafana

The gateway emits isartor_requests_total with a final_layer label. Use the following PromQL to compute the deflection rate:

# Overall deflection rate (last 1 hour)
1 - (
  sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
  /
  sum(increase(isartor_requests_total[1h]))
)
# Deflection rate by layer (pie chart)
sum by (final_layer) (rate(isartor_requests_total[5m]))
# Exact-cache deflection only
sum(increase(isartor_requests_total{final_layer="L1a_ExactCache"}[1h]))
/
sum(increase(isartor_requests_total[1h]))

Via the API

Send a test batch and count response layer values:

# Send 100 identical requests — expect 99 cache hits
for i in $(seq 1 100); do
  curl -s -X POST http://localhost:8080/api/chat \
    -H "Content-Type: application/json" \
    -H "X-API-Key: $ISARTOR_API_KEY" \
    -d '{"prompt": "What is the capital of France?"}' \
  | jq '.layer'
done | sort | uniq -c

Expected output (ideal):

   1 3       ← first request → cloud
  99 1       ← remaining → exact cache

Via Structured Logs

When ISARTOR__ENABLE_MONITORING=true, every request logs the final layer:

# grep JSON logs for final-layer distribution
cat logs.json | jq '.isartor.final_layer' | sort | uniq -c

Via Jaeger / Tempo

Filter traces by the isartor.final_layer tag:

GoalSearch
All cache hitsTag isartor.final_layer=L1a_ExactCache or L1b_SemanticCache
SLM resolutionsTag isartor.final_layer=L2_SLM
Cloud fallbacksTag isartor.final_layer=L3_Cloud

Tuning Configuration for Deflection

Cache Mode

VariableValuesRecommended
ISARTOR__CACHE_MODEexact, semantic, bothboth (default)
  • exact — Only identical prompts hit. Good for deterministic agent loops.
  • semantic — Catches paraphrases ("Price?" ≈ "Cost?"). Higher hit rate but adds ~1–5 ms embedding cost.
  • both — Exact check first (< 1 ms), then semantic if no exact hit. Best of both worlds.

Similarity Threshold

VariableDefaultRange
ISARTOR__SIMILARITY_THRESHOLD0.850.01.0
ValueEffect
0.95Very strict — only near-identical prompts match. Low false positives, lower hit rate.
0.85Balanced — catches common paraphrases. Recommended starting point.
0.75Aggressive — higher hit rate but risk of returning wrong cached answers.
0.60Dangerous — high false-positive rate. Not recommended for production.

How to tune:

  1. Set ISARTOR__ENABLE_MONITORING=true.
  2. Send representative traffic for 1 hour.
  3. In Jaeger, search for cosine_similarity attribute on l1b_semantic_cache_search spans.
  4. Plot the distribution. If most similarity scores cluster between 0.80–0.90, a threshold of 0.85 is good.
  5. If you see many scores at 0.82–0.84 that should be hits, lower to 0.80.

Cache TTL

VariableDefaultDescription
ISARTOR__CACHE_TTL_SECS300 (5 min)Time-to-live for cached responses
  • Short TTL (60–120 s): Good for rapidly changing data, real-time dashboards.
  • Medium TTL (300–600 s): Balanced for most workloads.
  • Long TTL (1800+ s): Maximises deflection for static Q&A / documentation bots.

Cache Capacity

VariableDefaultDescription
ISARTOR__CACHE_MAX_CAPACITY10000Max entries in each cache (LRU eviction)
  • Monitor eviction rate via cache.evicted span attribute on l1b_semantic_cache_insert.
  • If eviction rate > 5 % of inserts, increase capacity or shorten TTL.
  • Each cache entry ≈ 2–4 KB (prompt hash + response + optional 384-dim vector).

Tuning Latency

Target Latencies by Layer

LayerTarget (p95)Typical Range
L1a — Exact Cache< 1 ms0.1–0.5 ms
L1b — Semantic Cache< 10 ms1–5 ms
L2 — SLM Triage< 300 ms50–200 ms (embedded), 100–500 ms (sidecar)
L3 — Cloud LLM< 3 s500 ms – 5 s (network-bound)

Measure with PromQL

# P95 latency by layer
histogram_quantile(0.95,
  sum by (le, layer_name) (
    rate(isartor_layer_duration_seconds_bucket[5m])
  )
)
# P95 end-to-end latency
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))

Reducing Latency

BottleneckSymptomFix
EmbeddingL1b > 10 msUse a lighter model or increase CPU allocation
SLM inferenceL2 > 500 msUse quantised model (Q4_K_M GGUF), switch to embedded engine
RedisL1a > 5 msCheck network latency, use Redis cluster with read replicas
Cloud LLML3 > 5 sSwitch provider, use a smaller model, enable request timeout

Memory & Resource Tuning

Memory Budget

ComponentMemory UsageNotes
Exact cache (in-memory, 10K entries)~20–40 MBScales linearly with cache_max_capacity
Semantic cache (in-memory, 10K entries)~30–60 MB384-dim float32 vectors + response strings
candle embedder (all-MiniLM-L6-v2)~90 MBLoaded at startup, constant
Candle GGUF model (embedded SLM)~1–4 GBDepends on model quantisation
Tokio runtime~10–20 MBAsync task pool
Total (minimalist mode)~150–200 MBNo embedded SLM
Total (embedded mode)~1.5–4.5 GBWith embedded Candle SLM

CPU Considerations

  • Embedding generation runs on spawn_blocking (dedicated thread pool).
  • Candle GGUF inference is CPU-bound; allocate ≥ 4 cores for embedded mode.
  • The Tokio async runtime uses the default thread count (num_cpus).

Container Limits

# docker-compose example
services:
  gateway:
    deploy:
      resources:
        limits:
          memory: 512M    # minimalist mode
          cpus: "2"
        # For embedded SLM mode:
        # limits:
        #   memory: 4G
        #   cpus: "4"

Cache Tuning Deep-Dive

Exact vs. Semantic Cache Hit Analysis

# Exact cache hit rate
sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))

# Semantic cache hit rate
sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))

Cache Backend: Memory vs. Redis

FactorIn-MemoryRedis
Latency~0.1 ms~1–5 ms (network hop)
CapacityLimited by process RAMLimited by Redis memory
Multi-replica❌ No sharing✅ Shared across pods
Persistence❌ Lost on restart✅ Optional AOF/RDB
Recommended forSingle-instance, dev, edgeK8s, multi-replica, production

Switch with:

export ISARTOR__CACHE_BACKEND=redis
export ISARTOR__REDIS_URL=redis://redis.svc:6379

When to Disable Semantic Cache

  • Traffic is 100 % deterministic (exact same prompts repeated).
  • Embedding overhead is unacceptable (< 1 ms budget).
  • Set ISARTOR__CACHE_MODE=exact.

SLM Router Tuning

Embedded vs. Sidecar

ModeVariableLatencyResource Usage
Embedded (Candle)ISARTOR__INFERENCE_ENGINE=embedded50–200 msHigh CPU, 1–4 GB RAM
Sidecar (llama.cpp)ISARTOR__INFERENCE_ENGINE=sidecar100–500 msSeparate process, GPU optional
Remote (vLLM/TGI)ISARTOR__ROUTER_BACKEND=vllm100–500 msSeparate server, GPU recommended

Model Selection

ModelSizeSpeedAccuracy
Phi-3-mini (Q4_K_M)~2 GBFastGood
Gemma-2-2B-IT (Q4)~1.5 GBVery fastGood
Qwen-1.5-1.8B (Q4)~1.2 GBFastestAdequate
Llama-3-8B (Q4)~4.5 GBSlowerBest

For intent classification (TEMPLATE/SNIPPET/COMPLEX in tiered mode, or SIMPLE/COMPLEX in legacy binary mode), smaller models (1–3 B params) are sufficient. Use the smallest model that meets your accuracy needs.

Tuning the Classification Prompt

The system prompt in src/middleware/slm_triage.rs determines classification accuracy. If too many COMPLEX requests are misclassified as TEMPLATE or SNIPPET (resulting in bad local answers), consider:

  1. Making the system prompt more specific to your domain.
  2. Adding examples to the prompt (few-shot).
  3. Switching to a larger model.
  4. Setting ISARTOR__LAYER2__MAX_ANSWER_TOKENS to allow longer SLM responses (default 2048).
  5. Falling back to binary mode via ISARTOR__LAYER2__CLASSIFIER_MODE=binary if the three-tier split does not suit your workload.

Embedder Tuning

In-Process (candle)

The default embedder uses candle with sentence-transformers/all-MiniLM-L6-v2 (pure-Rust BertModel):

  • 384-dimensional vectors
  • ~90 MB model footprint
  • 1–5 ms per embedding (CPU)
  • Runs on spawn_blocking to avoid starving the Tokio runtime

Sidecar Embedder

For higher throughput or GPU acceleration:

export ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082
export ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME=all-minilm
export ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS=10

Embedding Model Selection

ModelDimsSpeedQuality
all-MiniLM-L6-v2384FastestGood
bge-small-en-v1.5384FastBetter
bge-base-en-v1.5768ModerateBest

Use 384-dim models for production. 768-dim models double memory usage for marginal quality improvement in most use cases.


SLO / SLA Goal Templates

Developer / Internal SLO

MetricTargetMeasurement
Availability99.5 %up{job="isartor"} over 30-day window
P95 latency (cache hit)< 10 mshistogram_quantile(0.95, ...) on L1
P95 latency (end-to-end)< 3 shistogram_quantile(0.95, ...) on all
Deflection rate> 50 %1 - (L3 / total) over 24 h
Error rate< 1 %rate(isartor_requests_total{http_status=~"5.."}[5m])

Production / Enterprise SLO

MetricTargetMeasurement
Availability99.9 %Multi-replica, health check monitoring
P95 latency (cache hit)< 5 msRequires Redis or fast in-memory
P95 latency (end-to-end)< 2 sOptimised models, provider SLAs
P99 latency (end-to-end)< 5 sTail latency budget
Deflection rate> 70 %Tuned thresholds + warm cache
Error rate< 0.1 %Circuit breakers, retries
Token savings> 60 %isartor_tokens_saved_total vs estimated total

SLA Template (for downstream consumers)

## Isartor Prompt Firewall SLA

**Availability:** 99.9 % monthly uptime (< 43.8 min downtime/month)
**Latency:** P95 end-to-end < 2 seconds
**Error Budget:** 0.1 % of requests may return 5xx
**Maintenance Window:** Sundays 02:00–04:00 UTC (excluded from SLA)

### Remediation
- Cache tier failure: automatic fallback to cloud LLM (degraded mode)
- SLM failure: automatic fallback to cloud LLM (degraded mode)
- Cloud LLM failure: 502 Bad Gateway returned, retry recommended

### Monitoring
- Health endpoint: GET /healthz
- Metrics endpoint: Prometheus scrape via OTel Collector on port 8889
- Dashboard: Grafana at http://<grafana-host>:3000

Alert Rules (Prometheus)

groups:
  - name: isartor-slo
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(isartor_requests_total{http_status=~"5.."}[5m]))
          /
          sum(rate(isartor_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Isartor error rate exceeds 1%"

      - alert: HighP95Latency
        expr: |
          histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
          > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Isartor P95 latency exceeds 3 seconds"

      - alert: LowDeflectionRate
        expr: |
          1 - (
            sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
            /
            sum(rate(isartor_requests_total[1h]))
          ) < 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Isartor deflection rate below 50%"

      - alert: FirewallDown
        expr: up{job="isartor"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Isartor gateway is down"

Scenario-Based Tuning Recipes

Scenario A: Agentic Loop (High-Volume Identical Prompts)

Profile: Autonomous agent sends the same prompt hundreds of times per minute.

ISARTOR__CACHE_MODE=exact           # Semantic unnecessary for identical prompts
ISARTOR__CACHE_TTL_SECS=3600       # Long TTL — agent prompts are stable
ISARTOR__CACHE_MAX_CAPACITY=50000  # Large cache for many unique prompts

Expected deflection: 95–99 % (after warm-up).

Scenario B: Customer Support Bot (Paraphrased Questions)

Profile: End users ask the same questions in different ways.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.80  # Lower threshold to catch paraphrases
ISARTOR__CACHE_TTL_SECS=1800       # 30 min — support answers change slowly
ISARTOR__CACHE_MAX_CAPACITY=10000

Expected deflection: 60–80 %.

Scenario C: Code Generation (Low Cache Hit Rate)

Profile: Developers ask unique, complex coding questions.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.92  # High threshold — wrong cached code is costly
ISARTOR__CACHE_TTL_SECS=600        # Short TTL — code context changes quickly
ISARTOR__INFERENCE_ENGINE=embedded   # Let SLM handle simple code questions

Expected deflection: 20–40 % (SLM handles simple extraction).

Scenario D: RAG Pipeline (Document Q&A)

Profile: Queries against a knowledge base; similar questions are common.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.83  # Moderate threshold
ISARTOR__CACHE_TTL_SECS=3600       # Documents change infrequently
ISARTOR__CACHE_MAX_CAPACITY=20000  # Large cache for document variation

Expected deflection: 50–70 %.

Scenario E: Multi-Replica Kubernetes

Profile: Horizontally scaled behind a load balancer.

ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379
ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.85

Benefit: All replicas share the same cache → deflection rate applies cluster-wide.


PromQL Cheat Sheet

WhatQuery
Deflection rate (1 h)1 - (sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h])) / sum(increase(isartor_requests_total[1h])))
Request raterate(isartor_requests_total[5m])
Request rate by layersum by (final_layer) (rate(isartor_requests_total[5m]))
P50 latencyhistogram_quantile(0.50, rate(isartor_request_duration_seconds_bucket[5m]))
P95 latencyhistogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
P99 latencyhistogram_quantile(0.99, rate(isartor_request_duration_seconds_bucket[5m]))
Per-layer P95histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m])))
Tokens saved (daily)sum(increase(isartor_tokens_saved_total[24h]))
Tokens saved by layersum by (final_layer) (rate(isartor_tokens_saved_total[5m]))
Est. daily cost savings ($0.01/1K tok)sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01
Error ratesum(rate(isartor_requests_total{http_status=~"5.."}[5m])) / sum(rate(isartor_requests_total[5m]))
Cache hit ratio (exact)sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m])) / sum(rate(isartor_requests_total[5m]))
Cache hit ratio (semantic)sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m])) / sum(rate(isartor_requests_total[5m]))

See also: Metrics & Tracing · Configuration Reference · Troubleshooting

Testing

Complete test runbook for Isartor — from automated test suites to manual feature verification and Copilot CLI integration testing.


Prerequisites

RequirementCheck
Rust toolchaincargo --version
Built binarycargo build --release
curl + jqcurl --version && jq --version

Quick Start — Automated

Unit & Integration Tests

# Run the full test suite
cargo test --all-features

# Run a specific test binary
cargo test --test unit_suite
cargo test --test integration_suite
cargo test --test scenario_suite

# Run a single test with output
cargo test --test scenario_suite deflection_rate_at_least_60_percent -- --nocapture
cargo test --test integration_suite body_survives_all_middleware -- --nocapture

Smoke Test Script

Run the entire manual test suite in one command:

# Start a fresh server, run all tests, stop after
./scripts/smoke-test.sh --stop-after

# Test an already-running server
./scripts/smoke-test.sh --no-start

# Full run including demo + verbose response bodies
./scripts/smoke-test.sh --run-demo --verbose

# Custom URL / API key
./scripts/smoke-test.sh --url http://localhost:9090 --api-key mykey --no-start

Lint & Format Checks

Run the same checks CI runs:

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings

Compression Pipeline Tests

Run the L2.5 compression module tests specifically:

# All compression tests (pipeline, stages, cache, optimize)
cargo test --all-features compression

# Specific modules
cargo test --all-features content_classifier
cargo test --all-features dedup_cache
cargo test --all-features log_crunch
cargo test --all-features optimize_request_body

Manual Step-by-Step

Note: Isartor runs without gateway auth by default (local-first). The test commands below explicitly set ISARTOR__GATEWAY_API_KEY to exercise authenticated request handling.

1 Start the Server

# Gateway-only startup (local API testing)
ISARTOR__FIRST_RUN_COMPLETE=1 \
./target/release/isartor up

# Full startup for proxy-aware testing (recommended for this guide)
ISARTOR__FIRST_RUN_COMPLETE=1 \
ISARTOR__GATEWAY_API_KEY=changeme \
./target/release/isartor up copilot

# With an OpenAI key (enables real L3 fallback)
ISARTOR__FIRST_RUN_COMPLETE=1 \
ISARTOR__GATEWAY_API_KEY=changeme \
ISARTOR__EXTERNAL_LLM_API_KEY=sk-... \
./target/release/isartor up copilot

Server is ready when you see:

INFO isartor: API gateway listening, addr: 0.0.0.0:8080
INFO isartor: CONNECT proxy starting, addr: 0.0.0.0:8081

2 Health & Liveness

# Liveness probe (no auth needed)
curl http://localhost:8080/healthz

# Rich health (shows layer status, proxy, prompt totals)
curl http://localhost:8080/health | jq .

Expected /health response shape:

{
  "status": "ok",
  "version": "0.1.25",
  "layers": { "l1a": "active", "l1b": "active", "l2": "active", "l3": "no_api_key" },
  "uptime_seconds": 5,
  "proxy": "active",
  "proxy_layer3": "native_upstream_passthrough",
  "prompt_total_requests": 0,
  "prompt_total_deflected_requests": 0
}

3 OpenAI-Compatible Endpoint (/v1/chat/completions)

API_KEY=changeme

curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | jq .

Send the same prompt twice to confirm L1a exact-cache kicks in:

for i in 1 2; do
  echo "--- Request $i ---"
  curl -sS http://localhost:8080/v1/chat/completions \
    -H "X-API-Key: $API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is 2+2?"}]}' \
    | jq '.choices[0].message.content, .model'
done

4 Anthropic-Compatible Endpoint (/v1/messages)

curl -sS http://localhost:8080/v1/messages \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-haiku-20240307",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }' | jq .

Expected shape: {"id":..., "type":"message", "role":"assistant", "content":[...], "model":...}


5 Native Endpoint (/api/chat)

curl -sS http://localhost:8080/api/chat \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "ping"}]}' | jq .

6 L1a — Exact Cache Hit

# Seed the cache with first request
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"capital of France?"}]}' \
  -o /dev/null

# Second identical request — should be served from L1a
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"capital of France?"}]}' \
  | jq '.model'
# → "isartor-cache" or similar (not "gpt-4o-mini")

7 L1b — Semantic Cache Hit

# Seed
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is the capital of France?"}]}' \
  -o /dev/null

# Paraphrase — should hit L1b (cosine similarity ≥ 0.85)
curl -sS http://localhost:8080/v1/chat/completions \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Which city is France capital?"}]}' \
  | jq '.model'

8 Authentication Rejection

# No API key — should return 401/403
curl -sS -w "\nHTTP %{http_code}" http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hello"}]}'

9 Prompt Stats

# JSON endpoint
curl -sS -H "X-API-Key: $API_KEY" \
  "http://localhost:8080/debug/stats/prompts?limit=10" | jq .

# Per-agent observability endpoint
curl -sS -H "X-API-Key: $API_KEY" \
  "http://localhost:8080/debug/stats/agents" | jq .

# CLI command
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key $API_KEY

# CLI per-agent view
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key $API_KEY \
  --by-tool

Expected isartor stats output:

Isartor Prompt Stats
  URL:        http://localhost:8080
  Total:      7
  Deflected:  3

By Layer
  L1A  3
  L3   4

By Surface
  gateway  7

By Client
  openai   5
  anthropic 2

Recent Prompts
  2026-03-19T09:00:00Z gateway openai L1A via /v1/chat/completions (1ms, HTTP 200)

10 Proxy Recent Decisions

curl -sS -H "X-API-Key: $API_KEY" \
  "http://localhost:8080/debug/proxy/recent?limit=5" | jq .

11 isartor connect status

./target/release/isartor connect status \
  --gateway-url http://localhost:8080 \
  --gateway-api-key $API_KEY

12 Run the Built-in Demo

./target/release/isartor demo
# Replays 50 bundled prompts through L1a/L1b, prints deflection rate.
# Writes isartor_demo_result.txt

13 Stop the Server

./target/release/isartor stop

Copilot CLI Integration Test

Step 1 — Connect Copilot CLI

./target/release/isartor connect copilot \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

This writes ~/.isartor/env/copilot.sh with:

export HTTPS_PROXY="http://localhost:8081"
export NODE_EXTRA_CA_CERTS="/Users/<you>/.isartor/ca/isartor-ca.pem"
export ISARTOR_COPILOT_ENABLED=true

Step 2 — Activate the Proxy Environment

Critical: You must source the env file in the same shell where you run Copilot CLI:

source ~/.isartor/env/copilot.sh

# Verify the env is active
echo $HTTPS_PROXY        # → http://localhost:8081
echo $NODE_EXTRA_CA_CERTS  # → /Users/<you>/.isartor/ca/isartor-ca.pem

Step 3 — Use Copilot CLI (same shell)

# Ask Copilot a question — traffic will route through Isartor proxy
gh copilot suggest "list all files in a directory"

# Or explain
gh copilot explain "what does git rebase do"

Step 4 — Verify Traffic Hit Isartor

# Check proxy recent decisions
./target/release/isartor connect status \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

# Check prompt stats
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

You should see proxy_recent_requests > 0 and Copilot entries in By Client.

Step 5 — Ask Repeated Questions (cache test)

# Ask the same thing twice — second hit should be L1a
gh copilot suggest "list all files in a directory"
gh copilot suggest "list all files in a directory"

# Check stats — deflected count should have increased
./target/release/isartor stats \
  --gateway-url http://localhost:8080 \
  --gateway-api-key changeme

Disconnect

./target/release/isartor connect copilot --disconnect
# then unset in your shell:
unset HTTPS_PROXY NODE_EXTRA_CA_CERTS ISARTOR_COPILOT_ENABLED

Feature Coverage Matrix

FeatureTestSection
Health endpointcurl /health§2
Liveness probecurl /healthz§2
OpenAI /v1/chat/completionscurl + jq§3
Anthropic /v1/messagescurl + jq§4
Native /api/chatcurl + jq§5
L1a exact-cache deflectionrepeated prompt§6
L1b semantic-cache deflectionparaphrased prompt§7
Auth rejectionno X-API-Key§8
Prompt stats endpoint/debug/stats/prompts§9
isartor stats CLIisartor stats§9
Proxy decisions endpoint/debug/proxy/recent§10
Connect status CLIisartor connect status§11
Built-in demoisartor demo§12
Copilot CLI proxy routingsource env + gh copilotCopilot CLI
Cache hit via Copilotrepeated gh copilotCopilot CLI §5

Troubleshooting

SymptomCauseFix
Connection refused :8080Server not startedRun ./target/release/isartor up
isartor update fails after stopStale HTTPS_PROXY in shellunset HTTPS_PROXY HTTP_PROXY
Copilot traffic not showing in statsWrong shell / env not sourcedsource ~/.isartor/env/copilot.sh then restart Copilot CLI
L1b miss on paraphraseSemantic index coldSend several prompts first to warm the index
l3: no_api_key in healthNo LLM key setSet ISARTOR__EXTERNAL_LLM_API_KEY or use cache/demo mode

See also: Troubleshooting · Contributing

Contributing

Thanks for your interest in contributing to Isartor! Isartor is maintained by one developer as a side project. Here's how to make your contribution land quickly.


Before You Open a PR

  1. Check existing issues — your idea may already be tracked.
  2. Open an issue first for any non-trivial change.
  3. One PR per issue — keep scope tight.

Looking for something to work on? Check out the good first issues label on GitHub.


Development Setup

Prerequisites

  • Rust 1.75+ — install via rustup
  • Docker — required for integration tests and the observability stack
  • curl + jq — for manual testing

Clone and Build

git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build

Run the Test Suite

# Full test suite
cargo test --all-features

# Or use Make
make test

# Run a specific test binary
cargo test --test unit_suite
cargo test --test integration_suite
cargo test --test scenario_suite

# Run a single test with output
cargo test --test scenario_suite deflection_rate_at_least_60_percent -- --nocapture

Lint & Format

# Format check (same as CI)
cargo fmt --all -- --check

# Apply formatting
cargo fmt --all

# Clippy lint check (same as CI)
cargo clippy --all-targets --all-features -- -D warnings

Release Build

cargo build --release
# or
make build

Benchmarks

# Criterion micro-benchmarks
cargo bench --bench cache_latency
cargo bench --bench e2e_pipeline

# Full benchmark harness (requires running Isartor instance)
make benchmark

# Dry-run smoke test (no server needed)
make benchmark-dry-run

PR Checklist

  • cargo test --all-features passes
  • cargo clippy --all-targets --all-features -- -D warnings has no new warnings
  • cargo fmt --all -- --check passes
  • PR description explains WHY, not just WHAT
  • Documentation updated if behaviour changes

What Gets Merged Quickly

  • Bug fixes with a test that reproduces the bug
  • Documentation improvements
  • Performance improvements with benchmark evidence

What Takes Longer

  • New features — needs design discussion in an issue first
  • Changes to the deflection layer logic — core path changes require careful review

Code Conventions

  • Tests are grouped into integration-test binaries (unit_suite, integration_suite, scenario_suite) that re-export submodules. When adding a test, place it in the appropriate binary rather than creating a standalone file.
  • Configuration uses ISARTOR__... environment variables with double underscores as separators.
  • The Axum middleware stack wraps inside-out. See src/main.rs for the documented layer order.
  • Use spawn_blocking for CPU-intensive work (embeddings, model inference) to avoid starving the Tokio runtime.
  • The src/compression/ module uses a Fusion Pipeline pattern: stateless CompressionStage trait objects executed in order. To add a new compression stage, implement the CompressionStage trait and wire it in src/compression/optimize.rs::build_pipeline().

Response Time

Issues and PRs are reviewed within 24–48 hours on weekdays. Weekend responses are not guaranteed.


See also: Testing · Architecture · Troubleshooting

Troubleshooting

Common issues, diagnostic steps, and FAQ for operating Isartor.


Table of Contents

  1. Startup Errors
  2. Cache Issues
  3. Embedding & SLM Issues
  4. Cloud LLM Issues
  5. Observability Issues
  6. Performance & Degraded Operation
  7. Docker & Deployment Issues
  8. FAQ

Startup Errors

Failed to initialize candle TextEmbedder

Symptom: Gateway panics on startup with:

Failed to initialize candle TextEmbedder (all-MiniLM-L6-v2)

Causes & Fixes:

CauseFix
Model files not downloadedRun once with internet access; candle auto-downloads to ~/.cache/huggingface/
Corrupted model cacheDelete ~/.cache/huggingface/ and restart
Cache directory not writable (Permission denied (os error 13))Set HF_HOME (or ISARTOR_HF_CACHE_DIR) to a writable path (e.g. /tmp/huggingface). In Docker, mount a volume there: -e HF_HOME=/tmp/huggingface -v isartor-hf:/tmp/huggingface.
Insufficient memoryEnsure ≥ 256 MB available for the embedding model

Address already in use

Symptom:

Error: error creating server listener: Address already in use (os error 48)

Fix:

# Find the process using port 8080
lsof -i :8080
# Kill it, or change the port:
export ISARTOR__HOST_PORT=0.0.0.0:9090

missing field or config deserialization errors

Symptom:

Error: missing field `layer2` in config

Fix: Ensure all required environment variables have the correct prefix and separator. Isartor uses double-underscore (__) as separator:

# Correct:
export ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081

# Wrong:
export ISARTOR_LAYER2_SIDECAR_URL=http://127.0.0.1:8081

See the Configuration Reference for the full list of variables.

Gateway auth / 401 Unauthorized

Symptom: All requests return 401 Unauthorized.

By default, gateway_api_key is empty and auth is disabled — you should not see 401 errors unless you (or your deployment) explicitly set ISARTOR__GATEWAY_API_KEY.

If you enabled auth by setting a key, every request must include it:

export ISARTOR__GATEWAY_API_KEY=your-secret-key

Common causes of unexpected 401s:

  • The key in your request header doesn't match ISARTOR__GATEWAY_API_KEY.
  • You forgot to include X-API-Key or Authorization: Bearer in the request.

Cache Issues

Low Cache Hit Rate

Symptom: Deflection rate below expected levels despite repeated traffic.

Diagnostic steps:

  1. Check cache mode:

    echo $ISARTOR__CACHE_MODE   # should be "both" for most workloads
    
  2. Check similarity threshold:

    echo $ISARTOR__SIMILARITY_THRESHOLD   # default: 0.85
    

    If too high (> 0.92), similar prompts won't match. Try lowering to 0.80.

  3. Check TTL:

    echo $ISARTOR__CACHE_TTL_SECS   # default: 300
    

    Short TTL evicts entries before they can be reused.

  4. Check Jaeger for cosine_similarity values on semantic cache spans. If scores are just below the threshold, lower it.

Stale Cache Responses

Symptom: Users receive outdated answers from cache.

Fix: Reduce TTL or restart the gateway to clear in-memory caches:

export ISARTOR__CACHE_TTL_SECS=60   # 1 minute

For Redis-backed caches, you can flush explicitly:

redis-cli -u $ISARTOR__REDIS_URL FLUSHDB

Redis Connection Refused

Symptom:

Layer 1a: Redis connection error — falling through

Diagnostic steps:

  1. Verify Redis is running:

    redis-cli -u $ISARTOR__REDIS_URL ping
    # Expected: PONG
    
  2. Check network connectivity (especially in Docker/K8s):

    # Inside the gateway container:
    curl -v telnet://redis:6379
    
  3. Verify the URL format:

    # Correct formats:
    export ISARTOR__REDIS_URL=redis://127.0.0.1:6379
    export ISARTOR__REDIS_URL=redis://user:password@redis.svc:6379/0
    
  4. Check Redis memory limit — if Redis is OOM, it will reject writes.

Fallback behaviour: When Redis is unreachable, Isartor falls through to the next layer. No data is lost, but deflection rate drops.

Cache Memory Growing Unbounded

Symptom: Gateway memory usage increases over time.

Fix: The in-memory cache uses bounded LRU eviction. Check:

echo $ISARTOR__CACHE_MAX_CAPACITY   # default: 10000

If set too high, reduce it. Each entry ≈ 2–4 KB, so 10K entries ≈ 20–40 MB.


Embedding & SLM Issues

Slow Embedding Generation

Symptom: L1b latency > 10 ms.

Causes & Fixes:

CauseFix
CPU-bound contentionIncrease CPU allocation for the container
Large prompt textEmbedder truncates to model max length (512 tokens), but longer text = more CPU
Cold startFirst embedding call warms up the candle BertModel (~2 s). Subsequent calls are fast.

SLM Sidecar Unreachable

Symptom:

Layer 2: Failed to connect to SLM sidecar — falling through

Diagnostic steps:

  1. Check if the sidecar is running:

    curl http://127.0.0.1:8081/v1/models
    
  2. Verify configuration:

    echo $ISARTOR__LAYER2__SIDECAR_URL   # default: http://127.0.0.1:8081
    
  3. Check the sidecar logs for errors (model loading, OOM, etc.).

  4. Increase timeout if the sidecar is slow:

    export ISARTOR__LAYER2__TIMEOUT_SECONDS=60
    

Fallback behaviour: When the SLM sidecar is unreachable, Isartor treats all requests as COMPLEX and forwards to Layer 3.

SLM Misclassification (Tiered: TEMPLATE / SNIPPET / COMPLEX)

The default classifier mode is tiered, which sorts requests into three categories instead of the legacy binary SIMPLE/COMPLEX split:

TierDescription
TEMPLATEConfig files, type definitions, documentation, boilerplate
SNIPPETShort single-function code, simple middleware (<50 lines)
COMPLEXMulti-file implementations, test suites, full endpoints

TEMPLATE and SNIPPET requests are answered locally by the SLM; COMPLEX requests are forwarded to Layer 3. The legacy binary mode (SIMPLE/COMPLEX) is still available via ISARTOR__LAYER2__CLASSIFIER_MODE=binary.

An answer quality guard also rejects SLM answers that are too short (<10 chars) or start with uncertainty phrases, escalating them to Layer 3.

Symptom: Users receive low-quality answers for complex questions (misclassified as TEMPLATE/SNIPPET) or unnecessarily hit the cloud for simple ones.

Diagnostic steps:

  1. In Jaeger, search for router.decision attribute to see classification distribution across TEMPLATE, SNIPPET, and COMPLEX.

  2. Send known-simple and known-complex prompts and check the classification:

    curl -s -X POST http://localhost:8080/api/chat \
      -H "Content-Type: application/json" \
      -H "X-API-Key: $KEY" \
      -d '{"prompt": "Generate a tsconfig.json"}' | jq '.layer'
    # Expected: layer 2 (TEMPLATE)
    
  3. Consider switching to a larger SLM model for better classification accuracy.

  4. To fall back to the legacy binary classifier, set ISARTOR__LAYER2__CLASSIFIER_MODE=binary.

Embedded Candle Engine Errors

Symptom:

Layer 2: Embedded classification failed – falling through

Causes & Fixes:

CauseFix
Model file missingSet ISARTOR__EMBEDDED__MODEL_PATH to a valid GGUF file
Insufficient memoryCandle GGUF models need 1–4 GB RAM
Feature not compiledBuild with --features embedded-inference

Cloud LLM Issues

502 Bad Gateway from Layer 3

Symptom: Requests that reach Layer 3 return 502.

Diagnostic steps:

  1. Check provider connectivity:

    curl -s $ISARTOR__EXTERNAL_LLM_URL \
      -H "Authorization: Bearer $ISARTOR__EXTERNAL_LLM_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}'
    
  2. Verify API key is valid and has quota.

  3. For Azure OpenAI, check deployment ID and API version:

    echo $ISARTOR__AZURE_DEPLOYMENT_ID
    echo $ISARTOR__AZURE_API_VERSION
    

Rate Limiting from Cloud Provider

Symptom: Intermittent 429 errors from the cloud LLM.

Fix:

  • Increase deflection rate (lower threshold, longer TTL) to reduce cloud traffic.
  • Request higher rate limits from your provider.
  • Implement client-side retry with exponential backoff (application level).

Wrong Provider Configured

Symptom: Authentication errors or unexpected response formats.

Fix: Verify the provider matches the URL and API key:

# OpenAI
export ISARTOR__LLM_PROVIDER=openai

# Azure
export ISARTOR__LLM_PROVIDER=azure

# Anthropic
export ISARTOR__LLM_PROVIDER=anthropic

# xAI
export ISARTOR__LLM_PROVIDER=xai

# Google Gemini
export ISARTOR__LLM_PROVIDER=gemini

# Ollama (local — no API key required)
export ISARTOR__LLM_PROVIDER=ollama

See the Configuration Reference for the full list of supported providers.


Observability Issues

No Traces in Jaeger

CauseFix
Monitoring disabledexport ISARTOR__ENABLE_MONITORING=true
Wrong endpointexport ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317
Collector not runningdocker compose -f docker-compose.observability.yml up otel-collector
Firewall blocking gRPCEnsure port 4317 is open between gateway and collector

No Metrics in Prometheus

CauseFix
Prometheus not scraping collectorCheck prometheus.yml targets include otel-collector:8889
Collector metrics pipeline brokenVerify otel-collector-config.yaml exports to Prometheus
No requests sent yetSend a test request — metrics appear after first request

Grafana Shows "No Data"

CauseFix
Data source not configuredAdd Prometheus source: URL http://prometheus:9090
Wrong time rangeExpand the time range in Grafana to cover the test period
Dashboard not provisionedCheck docker/grafana/provisioning/ paths are mounted

Console Shows "OTel disabled" Despite Setting env var

Cause: Config file takes precedence, or the env var prefix is wrong.

Fix:

# Correct (double underscore):
export ISARTOR__ENABLE_MONITORING=true

# Wrong (single underscore):
export ISARTOR_ENABLE_MONITORING=true  # ❌ not picked up

Performance & Degraded Operation

High Tail Latency (P99 > 10 s)

Diagnostic steps:

  1. Check which layer is the bottleneck:

    histogram_quantile(0.99,
      sum by (le, layer_name) (
        rate(isartor_layer_duration_seconds_bucket[5m])
      )
    )
    
  2. Common causes:

    • L3 Cloud: provider is slow → switch to a faster model or provider.
    • L2 SLM: model inference is slow → use a smaller quantised model.
    • L1b Semantic: embedding is slow → check CPU contention.

Gateway OOM (Out of Memory)

Diagnostic steps:

  1. Check cache capacity:

    echo $ISARTOR__CACHE_MAX_CAPACITY
    
  2. Reduce capacity or switch to Redis backend.

  3. If using embedded SLM, check model size vs. container memory limit.

Requests Queuing / High Connection Count

Symptom: Clients see connection timeouts or slow responses even for cache hits.

Causes & Fixes:

CauseFix
Too many concurrent requestsScale horizontally (add replicas)
spawn_blocking pool exhaustionIncrease Tokio blocking threads: TOKIO_WORKER_THREADS=8
SLM inference blocking async runtimeEnsure SLM runs on blocking pool (default in Isartor)

Degraded Mode (SLM Down, Cache Only)

When the SLM sidecar is unreachable, Isartor automatically degrades:

  • L1a/L1b cache still works → cached requests are served.
  • L2 SLM → all requests treated as COMPLEX (regardless of classifier mode) → forwarded to L3.
  • Impact: Higher cloud costs, but no downtime.

Monitor with:

# If SLM layer stops resolving requests, something is wrong
sum(rate(isartor_requests_total{final_layer="L2_SLM"}[5m])) == 0

Docker & Deployment Issues

Docker Build Fails

Symptom: cargo build fails inside Docker.

Common fixes:

  • Ensure Dockerfile uses the correct Rust toolchain version.
  • For aws-lc-rs (TLS): install cmake, gcc, make in build stage.
  • Check that .dockerignore isn't excluding required files.

Container Can't Reach Host Services

Symptom: Gateway inside Docker can't connect to sidecar on localhost.

Fix: Use Docker network names or host.docker.internal:

# docker-compose.yml
environment:
  - ISARTOR__LAYER2__SIDECAR_URL=http://sidecar:8081   # service name
  # or for host:
  - ISARTOR__LAYER2__SIDECAR_URL=http://host.docker.internal:8081

Health Check Failing

Symptom: Orchestrator keeps restarting the container.

Fix: The health endpoint is GET /healthz. Ensure the health check matches:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
  interval: 10s
  timeout: 5s
  retries: 3

FAQ

Q: What is cache_mode and which should I use?

A: cache_mode controls which cache layers are active:

ModeWhat it doesBest for
exactOnly SHA-256 hash matchDeterministic agent loops
semanticOnly cosine similarityDiverse user queries
bothExact first, then semanticMost workloads (default)

Q: What happens if Redis goes down?

A: Isartor gracefully falls through. The exact cache layer logs a warning and forwards the request downstream. No crash, no data loss. Deflection rate drops until Redis recovers, and more requests reach the cloud LLM (higher cost).

Q: Can I change the embedding model?

A: Yes. The in-process embedder uses candle with a pure-Rust BertModel, which supports multiple models. Set:

export ISARTOR__EMBEDDING_MODEL=bge-small-en-v1.5

The model is auto-downloaded on first startup. Note: changing the model invalidates the semantic cache (different embedding dimensions/space).

Q: How much does Isartor cost to run?

A: Isartor itself is free (Apache 2.0). The infrastructure cost depends on your deployment:

ModeEstimated Cost
Minimalist (single binary, no GPU)~$5–15/month (small VM or container)
With SLM sidecar (CPU)~$20–50/month (4-core VM)
With SLM on GPU~$50–200/month (GPU instance)
Enterprise (K8s + Redis + vLLM)~$200–500/month

The ROI comes from cloud LLM savings. At 70 % deflection and $0.01/1K tokens, Isartor typically pays for itself within the first week.

Q: Is Isartor production-ready?

A: Isartor is designed for production use with:

  • ✅ Bounded, concurrent caches (no unbounded memory growth)
  • ✅ Graceful degradation (every layer has a fallback)
  • ✅ OpenTelemetry observability (traces, metrics, structured logs)
  • ✅ Health check endpoint (/healthz)
  • ✅ Configurable via environment variables (12-factor app)
  • ✅ Integration tests covering all middleware layers

For enterprise deployments, use Redis-backed caches and a production Kubernetes cluster. See the Enterprise Guide.

Q: Can I use Isartor with LangChain / LlamaIndex / AutoGen?

A: Yes. Isartor exposes an OpenAI-compatible API. Point any SDK at the gateway URL:

import openai
client = openai.OpenAI(
    base_url="http://your-isartor-host:8080/v1",
    api_key="your-gateway-key",
)

See Integrations for full examples.

Q: How do I upgrade Isartor?

A:

# Binary
cargo install --path . --force

# Docker
docker pull ghcr.io/isartor-ai/isartor:latest
docker compose up -d --pull always

In-memory caches are cleared on restart. Redis caches persist.

Q: Why does isartor update or GitHub access fail with localhost:8081 / Connection refused after I stopped Isartor?

A: Your shell likely still has proxy environment variables from a prior isartor connect ... session, so non-Isartor commands are still trying to reach GitHub through the local CONNECT proxy on localhost:8081.

Fix on macOS / Linux:

unset HTTPS_PROXY HTTP_PROXY ALL_PROXY https_proxy http_proxy all_proxy
unset NODE_EXTRA_CA_CERTS SSL_CERT_FILE REQUESTS_CA_BUNDLE
unset ISARTOR_COPILOT_ENABLED ISARTOR_ANTIGRAVITY_ENABLED

Then confirm the shell is clean:

env | grep -i proxy

You can also clean up client-side configuration:

isartor connect copilot --disconnect
isartor connect claude --disconnect
isartor connect antigravity --disconnect

Q: Why does isartor update fail with Permission denied (os error 13)?

A: Your current isartor binary is installed in a system-managed directory.

Recommended fix: move to a user-writable install location:

mkdir -p ~/.local/bin
cp /usr/local/bin/isartor ~/.local/bin/isartor
chmod +x ~/.local/bin/isartor
export PATH="$HOME/.local/bin:$PATH"
hash -r

Then confirm: which isartor

Q: Why does isartor keep my terminal busy?

A: isartor runs the API gateway in the foreground by default. Start in detached mode:

isartor up --detach

Stop later with: isartor stop

Q: How do I monitor deflection rate in real-time?

A: Use the Grafana dashboard included in dashboards/prometheus-grafana.json or the PromQL query:

1 - (
  sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[5m]))
  /
  sum(rate(isartor_requests_total[5m]))
)

Q: Can I run Isartor without any cloud LLM?

A: Partially. Layers 1 and 2 work standalone (cache + SLM). But Layer 3 requires a cloud LLM API key. Without one, uncached COMPLEX requests will return a 502 error. For fully local operation, ensure your SLM can handle all traffic (set a very aggressive SIMPLE classification).


See also: Performance Tuning · Metrics & Tracing · Configuration Reference

Why Most LLM Gateways Can't Pass a FedRAMP Review

Published on the Isartor blog — targeting platform engineers and security architects at regulated enterprises.


The CISO's Nightmare

Picture this: a CISO at a federal agency is six months into an LLM gateway evaluation. The vendor has given assurances — "our gateway is secure, all data stays in your environment." The compliance team runs a network capture during the proof-of-concept. Three unexpected domains light up:

  • telemetry.vendor.io — anonymous usage metrics
  • license.vendor.io — license key validation on every startup
  • registry.vendor.io — model version checks

The FedRAMP audit fails. The project is cancelled. Six months of engineering work discarded because nobody read the gateway's egress behavior carefully enough before the evaluation began.

This is not a hypothetical. It happens routinely in regulated environments. The mistake is usually honest — gateway teams build their products for cloud-native deployments and add telemetry and license checks as an afterthought, without thinking about what happens when those systems need to run in an air-gapped facility.


The Hidden Phone-Home Problem

Most LLM gateways have outbound connection patterns that are not documented in their README. Let's be specific about what these are and why each one is a blocker in a FedRAMP or HIPAA environment:

License validation servers. A gateway that validates its licence key against a remote server cannot operate in a network segment with no outbound internet access. Worse, the validation traffic typically contains the licence key and the server's hostname — both of which may be considered sensitive data in a classified environment. Under FedRAMP Moderate, SC-7 (Boundary Protection) requires that external connections be explicitly authorised and documented. An undocumented licence-check endpoint fails this control.

Anonymous usage telemetry. Many open-source gateways ship with opt-out telemetry that sends aggregate usage statistics to the developer's servers. Even "anonymous" telemetry can include prompt length distributions, model names, or error rates that a regulated environment may consider sensitive. Under HIPAA, any data that could be used to identify a patient — including metadata about the prompts that process PHI — must stay within the covered entity's environment.

Model registry lookups. Gateways that support automatic model updates or capability discovery make outbound calls to check for new model versions. In an air-gapped environment, there is no path for these calls to succeed — and if the gateway blocks on a registry timeout, latency spikes cascade through the application.

OTel exporters enabled by default. OpenTelemetry is essential for observability, but a gateway that ships with OTLP_EXPORTER_ENDPOINT pointing at a cloud-hosted collector creates a data exfiltration risk. Trace data contains prompt content, response content, latency, and error messages. An OTel exporter sending this to an external endpoint in a HIPAA environment would be a reportable breach.

Each of these problems has the same root cause: the gateway was designed for cloud-native deployments and retrofitted for security requirements, rather than designed with air-gap constraints from the start.


What "Truly Air-Gapped" Actually Means

A gateway that can genuinely pass an air-gap review must satisfy three requirements:

1. A static binary with no runtime dependencies. Every runtime dependency — a Python interpreter, a Node.js runtime, a JVM — is a potential attack surface and a source of unexpected network calls. A statically compiled binary eliminates the entire class of "your dependency phoned home without you knowing" vulnerabilities. It also eliminates the download-on-first-run pattern where models or plugins are fetched from the internet when the gateway starts.

2. Offline licence validation. Licence validation must work without a network call. The correct approach is HMAC-based offline validation: the licence key embeds a cryptographic signature that the binary verifies locally using a public key baked in at compile time. No server call required. No licence-check traffic to document in your FedRAMP boundary diagram.

3. All models bundled — no download on first run. Any model that is downloaded at runtime creates a bootstrap dependency on internet connectivity. For an air-gapped deployment, all models must be available in the container image (or on a mounted volume) before the gateway starts. This is non-negotiable for environments where the deployment system has no outbound internet access at all.

Isartor is designed to meet all three requirements. The binary is compiled with Rust's --target x86_64-unknown-linux-musl producing a fully static binary with zero shared library dependencies. Licence validation uses HMAC offline verification. The latest-airgapped Docker image is built to pre-bundle (or pre-cache) all embedding models so that, once the image is transferred to the air-gapped environment and ISARTOR__OFFLINE_MODE=true is set, no additional model downloads or outbound internet access are required at runtime.


The Configuration

Here is the complete environment variable configuration for a compliant air-gapped deployment of Isartor in front of a self-hosted vLLM instance:

# ── Air-gap enforcement ──────────────────────────────────────────────
# Block all outbound cloud connections at the application layer.
export ISARTOR__OFFLINE_MODE=true

# ── Internal LLM routing (L3) ────────────────────────────────────────
# Route surviving cache-misses to your internal model server.
export ISARTOR__EXTERNAL_LLM_URL=http://vllm.internal.corp:8000/v1
export ISARTOR__LLM_PROVIDER=openai          # vLLM exposes OpenAI-compat API
export ISARTOR__EXTERNAL_LLM_MODEL=meta-llama/Llama-3-8B-Instruct

# ── Observability (internal collector only) ──────────────────────────
export ISARTOR__ENABLE_MONITORING=true
export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.internal.corp:4317

Running isartor connectivity-check with this configuration produces:

Isartor Connectivity Audit
──────────────────────────
Required (L3 cloud routing):
  → http://vllm.internal.corp:8000/v1  [CONFIGURED]
    (BLOCKED — offline mode active)

Optional (observability / monitoring):
  → http://otel-collector.internal.corp:4317  [CONFIGURED]

Internal only (no external):
  → (in-memory cache — no network connection)  [CONFIGURED - internal]

Zero hidden telemetry connections: ✓ VERIFIED
Air-gap compatible: ✓ YES (L3 disabled or offline mode active)

This output is the screenshot your compliance team needs. Every connection Isartor makes is explicit, documented, and internal.


The FedRAMP Control Mapping

Understanding how a deployment posture maps to specific NIST 800-53 controls is what separates a security claim from a security argument. Here are the four controls most directly supported by Isartor's air-gapped deployment posture:

AU-2 (Audit Logging): AU-2 requires that the system generate audit records for events relevant to security. Isartor logs every prompt, every deflection decision, and every L3 forwarding event as a structured JSON record with a distributed tracing span. The logs include the layer that handled the request (L1a, L1b, L2, L3), the latency, and whether the request was deflected or forwarded. These records can be ingested by any SIEM that accepts JSON log streams.

SC-7 (Boundary Protection): SC-7 requires the system to monitor and control communications at external boundary points. ISARTOR__OFFLINE_MODE=true implements a hard application-layer block on all outbound connections to non-internal endpoints. This is verified by the phone-home audit test in tests/phone_home_audit.rs, which runs on every commit to main in CI. The CI badge on the repository proves continuous enforcement.

SI-4 (Information System Monitoring): SI-4 requires monitoring of the information system to detect attacks and indicators of compromise. Isartor's OpenTelemetry integration exports traces and metrics to an internal collector. The deflection stack metrics — cache hit rate, L3 call rate, latency per layer — provide a real-time signal that can be baselined and alerted on. An anomalous spike in L3 calls could indicate a cache poisoning attempt.

CM-6 (Configuration Settings): CM-6 requires the organisation to establish and document configuration settings. Every Isartor configuration parameter is controlled by an environment variable with a documented default and a documented security implication. The ISARTOR__OFFLINE_MODE flag, in particular, has a documented effect: it is a single switch that moves the system from "possibly communicates with cloud" to "provably does not communicate with cloud."


Call to Action

If you are a platform engineer or security architect at a regulated enterprise evaluating LLM gateway options, start here:

  1. Read the Air-Gapped Deployment Guide for the complete pre-deployment checklist.
  2. Pull ghcr.io/isartor-ai/isartor:latest-airgapped and run isartor connectivity-check in your environment.
  3. Review the phone-home audit test to understand exactly what is being verified in CI.
  4. Open an issue on GitHub if you have compliance requirements not covered here — FedRAMP High, IL5, ITAR, and sector-specific requirements are all on the roadmap.

The binary that passes your network capture is the binary that passes your FedRAMP review.