Welcome to Isartor
Open-source Prompt Firewall — deflect up to 95% of redundant LLM traffic before it leaves your infrastructure.
Pure Rust · Single Binary · Zero Hidden Telemetry · Air-Gappable
AI coding agents and personal assistants repeat themselves — a lot. Copilot, Claude Code, Cursor, and OpenClaw send the same system instructions, the same context preambles, and often the same user prompts across every turn. Standard API gateways forward all of it to cloud LLMs regardless.
Isartor sits between your tools and the cloud. It intercepts every prompt and runs a cascade of local algorithms — from sub-millisecond hashing to in-process neural inference — to resolve requests before they reach the network. Only the genuinely hard prompts make it through.
The Deflection Stack
Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the stack to reach the cloud.
Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
│ hit │ hit │ simple │ compressed │
▼ ▼ ▼ ▼ ▼
Response Response Local Response Optimised Prompt Cloud Response
| Layer | What It Does | Typical Latency |
|---|---|---|
| L1a — Exact Cache | Sub-millisecond duplicate detection via fast hashing. Traps infinite agent loops instantly. | < 1 ms |
| L1b — Semantic Cache | Catches meaning-equivalent prompts ("Price?" ≈ "Cost?") using pure-Rust embeddings. | 1–5 ms |
| L2 — SLM Router | Triages intent with an embedded Small Language Model to resolve simple tasks locally. | 50–200 ms |
| L2.5 — Context Optimiser | Compresses repeated instruction payloads (CLAUDE.md, copilot-instructions) via session dedup and minification. | < 1 ms |
| L3 — Cloud Logic | Routes surviving complex prompts to OpenAI, Anthropic, or Azure with fallback resilience. | Network-bound |
Layers 1a and 1b deflect 71% of repetitive agentic traffic and 38% of diverse task traffic before any neural inference runs.
How It Works
Getting started with Isartor takes three steps:
1. Install
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh
Or use Docker:
docker run -p 8080:8080 ghcr.io/isartor-ai/isartor:latest
2. Connect
Point any OpenAI-compatible client at Isartor — just change the base URL:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-api-key",
)
Works with the official SDKs, LangChain, LlamaIndex, AutoGen, GitHub Copilot, OpenClaw, and any other OpenAI-compatible tool.
Recent OpenAI-compatible improvements for coding agents include:
GET /v1/modelsfor model discoverystream: truesupport on/v1/chat/completionswith proper SSE chunkstools,tool_choice,functions, andfunction_callpassthroughtool_callspreserved in upstream responses
3. Save
Isartor deflects repetitive and simple prompts locally. You keep the same responses, pay for fewer tokens, and get lower latency — with zero code changes beyond the URL.
Explore the Docs
🚀 Getting Started Install Isartor and send your first request.
🔌 Integrations Connect Copilot CLI, Cursor, Claude Code, and more.
📦 Deployment From a single binary to a multi-replica K8s cluster.
⚙️ Configuration Every environment variable and config key.
🏗️ Architecture Deep dive into the Deflection Stack and trait providers.
📊 Observability OpenTelemetry traces, Prometheus metrics, Grafana dashboards.
Installation
Isartor ships as a single statically linked binary — no runtime dependencies required.
macOS / Linux — Single Command (Recommended)
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh
Docker
The image ships a statically linked isartor binary and downloads the embedding model on first start (then reuses the on-disk hf-hub cache). No API key is needed for the cache layers.
docker run -p 8080:8080 ghcr.io/isartor-ai/isartor:latest
To persist the model cache across restarts (recommended):
docker run -p 8080:8080 \
-e HF_HOME=/tmp/huggingface \
-v isartor-hf:/tmp/huggingface \
ghcr.io/isartor-ai/isartor:latest
To use Azure OpenAI for Layer 3 (recommended: Docker secrets via *_FILE). Important: ISARTOR__EXTERNAL_LLM_URL must be the base Azure endpoint only (no /openai/... path), e.g. https://<resource>.openai.azure.com:
# Put your key in a file (no trailing newline is ideal, but Isartor trims whitespace)
echo -n "YOUR_AZURE_OPENAI_KEY" > ./azure_openai_key
docker run -p 8080:8080 \
-e ISARTOR__LLM_PROVIDER=azure \
-e ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
-e ISARTOR__AZURE_DEPLOYMENT_ID=<deployment> \
-e ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
-e ISARTOR__EXTERNAL_LLM_API_KEY_FILE=/run/secrets/azure_openai_key \
-v $(pwd)/azure_openai_key:/run/secrets/azure_openai_key:ro \
ghcr.io/isartor-ai/isartor:latest
The startup banner appears after all layers are ready (< 30 s on a modern machine).
Image size: ~120 MB compressed / ~260 MB on disk (includes
all-MiniLM-L6-v2embedding model, statically linked Rust binary).
Windows (PowerShell) — Single Command
irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex
Build from Source
git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release
./target/release/isartor up
Requires Rust 1.75 or later.
Verify Installation
Check that the binary is available:
isartor --version
Run the built-in demo. It works without an API key, but if you configure a provider first it also shows a live upstream round-trip:
isartor set-key -p groq
isartor check
isartor demo
Verify the health endpoint:
curl http://localhost:8080/health
# {"status":"ok","version":"0.1.0","layers":{...},"uptime_seconds":5,"demo_mode":true}
Quick Start
This guide walks you through starting Isartor, making your first request, observing a cache hit, and checking stats. If you haven't installed Isartor yet, see the Installation guide.
Starting Isartor
isartor up # start the API gateway only
isartor up --detach # start in background and return to the shell
isartor up copilot # start gateway + CONNECT proxy for Copilot CLI
Other useful commands:
isartor init # generate a commented config scaffold
isartor set-key -p openai # configure your LLM provider API key
isartor check # verify provider/model/key masking and live connectivity
isartor demo # run the post-install showcase
isartor stop # stop a running Isartor instance (uses PID file)
isartor update # self-update to the latest version from GitHub releases
Making Your First Request
Isartor exposes an OpenAI-compatible API. Send a request to the /v1/chat/completions endpoint:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gemma-2-2b-it",
"messages": [
{"role": "user", "content": "Explain the quantum Hall effect in detail, including its significance for condensed matter physics and any applications in modern technology."}
]
}'
Expected JSON Response (snippet):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "The quantum Hall effect is a phenomenon..."
}
}
],
"usage": { ... }
}
Console Log (snippet):
INFO [slm_triage] Layer 3 fallback: OpenAI
INFO [cache] Layer 1a miss: quantum Hall effect prompt
The first request is a cache miss — Layer 2 triages it and Layer 3 routes it to your configured cloud provider.
OpenAI-compatible clients can also:
- call
GET /v1/modelsto discover the configured model - send
"stream": trueand receive OpenAI-style SSE responses - use tool/function calling fields such as
tools,tool_choice, andfunctions
You can also use the native API:
curl -s http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "Calculate 2+2"}'
Seeing a Cache Hit
Repeat the same request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gemma-2-2b-it",
"messages": [
{"role": "user", "content": "Explain the quantum Hall effect in detail, including its significance for condensed matter physics and any applications in modern technology."}
]
}'
Expected JSON Response (snippet):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "The quantum Hall effect is a phenomenon..."
}
}
],
"usage": { ... }
}
Console Log (snippet):
INFO [cache] Layer 1a exact match: quantum Hall effect prompt
INFO [slm_triage] Short-circuit: cache hit
This time the response comes from the Layer 1a exact cache — sub-millisecond, zero tokens consumed, no cloud call.
Checking Stats
View prompt totals, layer hit rates, and recent routing history:
isartor stats
Connecting an AI Tool
Isartor works as a drop-in replacement for any OpenAI-compatible client. Point your favourite AI tool at http://localhost:8080/v1 and it will route through the Deflection Stack automatically.
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Summarise this document."}],
)
If your client probes models first, this also works:
curl -sS http://localhost:8080/v1/models
For detailed setup guides for GitHub Copilot CLI, Claude Code, Cursor, and other tools, see the Integrations section.
For advanced configuration, see the Configuration Reference and Architecture.
Architecture
Pattern: Hexagonal Architecture (Ports & Adapters) Location:
src/core/,src/adapters/,src/factory.rs
High-Level Overview
Isartor is an AI Prompt Firewall that intercepts LLM traffic and routes it through a multi-layer Deflection Stack. Each layer can short-circuit and return a response without reaching the cloud, dramatically reducing cost and latency.
For a detailed breakdown of the deflection layers, see the Deflection Stack page.
flowchart TD
A[Request] --> B[Auth]
B --> C[Cache L1a: LRU/Redis]
C --> D[Cache L1b: Candle/TEI]
D --> E[SLM Router: Candle/vLLM]
E --> F[Context Optimiser: CompressionPipeline]
F --> G[Cloud Fallback: OpenAI/Anthropic]
G --> H[Response]
subgraph F_detail [L2.5 CompressionPipeline]
direction LR
F1[ContentClassifier] --> F2[DedupStage]
F2 --> F3[LogCrunchStage]
end
Pluggable Trait Provider Pattern
All layers are implemented as Rust traits and adapters. Backends are selected at startup via ISARTOR__ environment variables — no code changes or recompilation required.
Rather than feature-flag every call-site, we define Ports (trait interfaces in src/core/ports.rs) and swap the concrete Adapter at startup. This keeps the Deflection Stack logic completely agnostic to the backing implementation.
| Component | Minimalist (Single Binary) | Enterprise (K8s) |
|---|---|---|
| L1a Exact Cache | In-memory LRU (ahash + parking_lot) | Redis cluster (shared across replicas) |
| L1b Semantic Cache | In-process candle BertModel | External TEI sidecar (optional) |
| L2 SLM Router | Embedded candle GGUF inference | Remote vLLM / TGI server (GPU pool) |
| L2.5 Context Optimiser | In-process CompressionPipeline (classifier → dedup → log_crunch) | In-process CompressionPipeline (extensible with custom stages) |
| L3 Cloud Logic | Direct to OpenAI / Anthropic | Direct to OpenAI / Anthropic |
Adding a New Adapter
- Define the struct in
src/adapters/cache.rsorsrc/adapters/router.rs. - Implement the port trait (
ExactCacheorSlmRouter). - Add a variant to the config enum (
CacheBackendorRouterBackend) insrc/config.rs. - Wire it in
src/factory.rswith a newmatcharm. - Write tests — each adapter module has a
#[cfg(test)] mod tests.
No other files need to change. The middleware and pipeline code operate only on Arc<dyn ExactCache> / Arc<dyn SlmRouter>.
Scalability Model (3-Tier)
Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. The same binary serves all three tiers; the runtime behaviour is entirely configuration-driven.
Level 1 (Edge) Level 2 (Compose) Level 3 (K8s)
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Single Process │ │ Firewall + GPU │ │ N Firewall Pods │
│ memory cache │──▶ │ Sidecar │──▶ │ + Redis Cluster │
│ embedded candle │ │ memory cache │ │ + vLLM Pool │
│ context opt. │ │ (optional) │ │ (optional) │
└────────────────┘ └────────────────┘ └────────────────┘
Key insight: Switching to cache_backend=redis unlocks true multi-replica scaling. Without it, each firewall pod maintains an independent cache.
See the deployment guides for tier-specific setup:
Directory Layout
src/
├── core/
│ ├── mod.rs # Re-exports
│ ├── ports.rs # Trait interfaces (ExactCache, SlmRouter)
│ └── context_compress.rs # Re-export shim (backward compat)
├── adapters/
│ ├── mod.rs # Re-exports
│ ├── cache.rs # InMemoryCache, RedisExactCache
│ └── router.rs # EmbeddedCandleRouter, RemoteVllmRouter
├── compression/
│ ├── mod.rs # Re-exports all pipeline types
│ ├── pipeline.rs # CompressionPipeline executor + CompressionStage trait
│ ├── cache.rs # InstructionCache (per-session dedup state)
│ ├── optimize.rs # Request body rewriting (JSON → pipeline → reassembly)
│ └── stages/
│ ├── content_classifier.rs # Gate: instruction vs conversational
│ ├── dedup.rs # Cross-turn instruction dedup
│ └── log_crunch.rs # Static minification
├── middleware/
│ └── context_optimizer.rs # L2.5 Axum middleware
├── factory.rs # build_exact_cache(), build_slm_router()
└── config.rs # CacheBackend, RouterBackend enums + AppConfig
See Also
- Deflection Stack — detailed layer-by-layer breakdown
- Architecture Decision Records — rationale behind key design choices
- Configuration Reference
The Deflection Stack
Every incoming request passes through a sequence of smart computing layers. Only prompts requiring genuine, complex reasoning survive the Deflection Stack to reach the cloud.
Request ──► L1a Exact Cache ──► L1b Semantic Cache ──► L2 SLM Router ──► L2.5 Context Optimiser ──► L3 Cloud Logic
│ hit │ hit │ simple │ compressed │
▼ ▼ ▼ ▼ ▼
Response Response Local Response Optimised Prompt Cloud Response
Layers at a Glance
| Layer | Algorithm / Mechanism | What It Does | Typical Latency |
|---|---|---|---|
| L1a — Exact Cache | Fast Hashing (ahash) | Sub-millisecond duplicate detection. Traps infinite agent loops instantly. | < 1 ms |
| L1b — Semantic Cache | Cosine Similarity (Embeddings) | Computes mathematical meaning via pure-Rust candle models (all-MiniLM-L6-v2) to catch variations ("Price?" ≈ "Cost?"). | 1–5 ms |
| L2 — SLM Router | Neural Classification (LLM) | Triages intent using an embedded Small Language Model (e.g. Qwen-1.5B) to resolve simple data extraction tasks. | 50–200 ms |
| L2.5 — Context Optimiser | Instruction Dedup + Minify | Compresses repeated instruction files (CLAUDE.md, copilot-instructions.md) via session dedup and static minification to reduce cloud input tokens. | < 1 ms |
| L3 — Cloud Logic | Load Balancing & Retries | Routes surviving complex prompts to OpenAI, Anthropic, or Azure, with built-in fallback resilience. | Network-bound |
Layers 1a and 1b deflect 71% of repetitive agentic traffic (FAQ/agent loop patterns) and 38% of diverse task traffic before any neural inference runs.
Layer Details
L1a — Exact Cache
Algorithm: Fast hashing with ahash
L1a is the first line of defence. It computes a hash of the incoming prompt and checks it against an in-memory LRU cache (single-binary mode) or a shared Redis cluster (enterprise mode).
- Hit: Returns the cached response immediately (sub-millisecond).
- Miss: The request continues to L1b.
Cache keys are namespaced before hashing (native|prompt, openai|prompt, anthropic|prompt, etc.) to ensure one endpoint never returns another endpoint's response schema. On a cache hit, ChatResponse.layer is normalised to 1 regardless of which layer originally produced the response.
| Mode | Implementation |
|---|---|
| Minimalist | In-memory LRU (ahash + parking_lot) |
| Enterprise | Redis cluster (shared across replicas, async redis crate) |
L1b — Semantic Cache
Algorithm: Cosine similarity over sentence embeddings (all-MiniLM-L6-v2)
L1b catches semantically equivalent prompts that differ in wording. A sentence embedding is computed for the incoming prompt using a pure-Rust candle BertModel, then compared against the vector cache using cosine similarity.
- Hit (similarity above threshold): Returns the cached response (1–5 ms).
- Miss: The request continues to L2.
Embedding pipeline:
- Model:
sentence-transformers/all-MiniLM-L6-v2— 384-dimensional embeddings (~90 MB). - Runtime: Pure-Rust candle stack — zero C/C++ dependencies.
- Pooling: Mean pooling with attention mask, followed by L2 normalisation.
- Thread safety:
BertModelis wrapped instd::sync::Mutex; inference runs ontokio::task::spawn_blocking. - Architecture:
TextEmbedderis initialised once at startup, stored asArc<TextEmbedder>inAppState.
The vector cache is maintained in tandem with exact cache entries. Insertions and evictions update the index automatically, providing sub-millisecond vector search latency for thousands of embeddings.
| Mode | Implementation |
|---|---|
| Minimalist | In-process candle BertModel |
| Enterprise | External TEI sidecar (optional) |
L2 — SLM Router
Algorithm: Neural classification via Small Language Model
L2 runs a lightweight language model to classify the prompt's intent. Simple requests (data extraction, FAQ-style queries) can be resolved locally without reaching the cloud.
- Simple intent: Returns a locally generated response (50–200 ms).
- Complex intent: The request continues to L2.5.
- Disabled (
enable_slm_router = false): Layer is a no-op; request falls through to L3.
| Mode | Implementation |
|---|---|
| Minimalist | Embedded candle GGUF inference (e.g. Gemma-2-2B-IT, CPU) |
| Enterprise | Remote vLLM / TGI server (GPU pool) |
L2.5 — Context Optimiser
Algorithm: CompressionPipeline — Modular staged compression
Agentic coding tools (Copilot, Claude Code, Cursor) send large instruction files (CLAUDE.md, copilot-instructions.md, skills blocks) with every turn. L2.5 detects and compresses these payloads before they reach the cloud, saving input tokens on every L3 call.
Pipeline architecture (src/compression/):
L2.5 uses a modular CompressionPipeline with pluggable stages that execute in
order. Each stage is a stateless CompressionStage trait object. If a stage sets
short_circuit = true, subsequent stages are skipped.
Built-in stages (run in order):
- ContentClassifier — Gate stage: detects instruction vs conversational content. Short-circuits on conversational messages so downstream stages skip work.
- DedupStage — Session-aware cross-turn deduplication. Hashes instruction content per session; on repeat turns, replaces with a compact hash reference. Short-circuits on dedup hit.
- LogCrunchStage — Static minification: strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.
Adding custom stages:
Implement the CompressionStage trait and add your stage to the pipeline via
build_pipeline() in src/compression/optimize.rs.
Configuration:
| Variable | Default | Description |
|---|---|---|
ISARTOR__ENABLE_CONTEXT_OPTIMIZER | true | Master switch for L2.5 |
ISARTOR__CONTEXT_OPTIMIZER_DEDUP | true | Enable cross-turn instruction deduplication |
ISARTOR__CONTEXT_OPTIMIZER_MINIFY | true | Enable static minification |
Observability:
- Instrumented as:
layer2_5_context_optimizerspan in distributed traces. - Response header:
x-isartor-context-optimized: bytes_saved=<N>on optimised requests. - Span fields:
context.bytes_saved,context.strategy(e.g. "classifier+dedup", "classifier+log_crunch").
| Mode | Implementation |
|---|---|
| Minimalist | In-process CompressionPipeline (classifier → dedup → log_crunch) |
| Enterprise | In-process CompressionPipeline (extensible with custom stages) |
L3 — Cloud Logic
Algorithm: Load balancing & retries
L3 is the final layer. Only the hardest prompts — those not resolved by cache, SLM, or context optimisation — reach the external cloud LLMs.
- Routes to OpenAI, Anthropic, Azure OpenAI, or xAI via rig-core.
- Built-in fallback resilience with load balancing and retries.
- Offline mode (
offline_mode = true): Blocks L3 routing explicitly instead of silently pretending success. - Stale fallback: On L3 failure, checks the namespaced exact-cache key first, then a legacy un-namespaced key for backward compatibility.
| Mode | Implementation |
|---|---|
| Minimalist | Direct to OpenAI / Anthropic |
| Enterprise | Direct to OpenAI / Anthropic |
How Layers Interact
The deflection stack is implemented as Axum middleware plus a final handler. For authenticated routes, the execution order is:
- Body buffer —
BufferedBodystores the request body so multiple layers can read it. - Request-level monitoring — Observability instrumentation.
- Auth — API key validation.
- Layer 1 cache — L1a exact match, then L1b semantic match.
- Layer 2 SLM triage — Intent classification and local response.
- Layer 2.5 context optimiser — Instruction dedup + minification via
CompressionPipeline. - Layer 3 handler — Cloud LLM fallback.
Implementation note: Axum middleware wraps inside-out — the last
.layer(...)added runs first. The stack order insrc/main.rsdocuments this explicitly and must be preserved.
Public health routes (/health, /healthz) intentionally bypass the deflection stack. The authenticated routes are /api/chat, /api/v1/chat, /v1/chat/completions, and /v1/messages.
See Also
- Architecture — high-level system design and pluggable providers
- Architecture Decision Records — rationale behind the deflection stack design (ADR-001)
- Configuration Reference
Architecture Decision Records
Key design decisions, trade-offs, and rationale behind Isartor's architecture.
Each ADR follows a lightweight format: Context → Decision → Consequences.
ADR-001: Multi-Layer Deflection Stack Architecture
Date: 2024 · Status: Accepted
Context
AI Prompt Firewall traffic follows a power-law distribution: the majority of prompts are simple or repetitive, while only a small fraction requires expensive cloud LLMs. Sending all traffic to a single provider wastes tokens and money.
Decision
Implement a sequential Deflection Stack with 4+ layers, each capable of short-circuiting:
- Layer 0 — Operational defense (auth, rate limiting, concurrency control)
- Layer 1 — Semantic + exact cache (zero-cost hits)
- Layer 2 — Local SLM triage (classify intent, execute simple tasks locally)
- Layer 2.5 — Context optimiser (retrieve + rerank to minimise token usage)
- Layer 3 — Cloud LLM fallback (only the hardest prompts)
Layer 2.5 (Context Optimiser):
Retrieves and reranks candidate documents or responses to minimize downstream token usage. Typically implements top-K selection, reranking, or context window optimization before forwarding to the LLM. Instrumented as the context_optimise span in observability.
Consequences
- Positive: 60–80% of traffic can be resolved before Layer 3, dramatically reducing cost.
- Positive: Each layer adds latency only when needed — cache hits are sub-millisecond.
- Positive: Clear separation of concerns; each layer is independently testable.
- Negative: Deflection Stack adds conceptual complexity vs. a simple reverse proxy.
- Negative: Each layer needs its own error handling and timeout strategy.
ADR-002: Axum + Tokio as Runtime Foundation
Date: 2024 · Status: Accepted
Context
The firewall must handle high concurrency (thousands of simultaneous connections) with low latency overhead. The binary should be small, statically linked, and deployable to minimal environments.
Decision
Use Axum 0.8 on Tokio 1.x for the async HTTP server. Build with --target x86_64-unknown-linux-musl and opt-level = "z" + LTO for a ~5 MB static binary.
Consequences
- Positive: Tokio's work-stealing scheduler handles 10K+ concurrent connections efficiently.
- Positive: Axum's type-safe extractors catch errors at compile time.
- Positive: Static musl binary runs in distroless containers (no libc, no shell).
- Negative: Rust's compilation times are longer than Go/Node.js equivalents.
- Negative: Ecosystem is smaller — fewer off-the-shelf middleware components.
ADR-003: Embedded Candle Classifier (Layer 2)
Date: 2024 · Status: Accepted
Context
For minimal deployments (edge, VPS, air-gapped), requiring an external sidecar (llama.cpp, Ollama, TGI) adds operational complexity. Many classification tasks can be handled by a 2B parameter model on CPU.
Decision
Embed a Gemma-2-2B-IT GGUF model directly in the Rust process using the candle framework. The model is loaded on first start via hf-hub (auto-downloaded from Hugging Face) and wrapped in a tokio::sync::Mutex for thread-safe inference on spawn_blocking.
Consequences
- Positive: Zero external dependencies for Layer 2 classification — a single binary handles everything.
- Positive: No HTTP overhead for classification calls; inference is an in-process function call.
- Positive: Works in air-gapped environments with pre-cached models.
- Negative: ~1.5 GB memory overhead for the Q4_K_M model weights.
- Negative: CPU inference is slower than GPU (50–200 ms classification, 200–2000 ms generation).
- Negative:
Mutexserialises inference calls — throughput limited to one inference at a time. - Trade-off: For higher throughput, upgrade to Level 2 (llama.cpp sidecar on GPU).
ADR-004: Three Deployment Tiers
Date: 2024 · Status: Accepted
Context
Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. A single deployment model cannot serve all use cases optimally.
Decision
Define three explicit deployment tiers that share the same binary and configuration surface:
| Tier | Strategy | Target |
|---|---|---|
| Level 1 | Monolithic binary, embedded candle | VPS, edge, bare metal |
| Level 2 | Firewall + llama.cpp sidecars | Docker Compose, single host + GPU |
| Level 3 | Stateless pods + inference pools | Kubernetes, Helm, HPA |
The tier is selected purely by environment variables and infrastructure, not by code changes.
Consequences
- Positive: A single codebase and binary serves all deployment scenarios.
- Positive: Users start at Level 1 and upgrade incrementally — no migrations.
- Positive: Clear documentation entry points for each tier.
- Negative: Some config variables are irrelevant at certain tiers (e.g.,
ISARTOR__LAYER2__SIDECAR_URLis unused at Level 1 with embedded candle). - Negative: Testing all three tiers requires different infrastructure setups.
ADR-005: llama.cpp as Sidecar (Level 2) Instead of Ollama
Date: 2024 · Status: Accepted
Context
The original design used Ollama (~1.5 GB image) as the local SLM engine. While Ollama has a convenient API and model management, it's heavyweight for a sidecar.
Decision
Replace Ollama with llama.cpp server (ghcr.io/ggml-org/llama.cpp:server, ~30 MB) as the default sidecar in docker-compose.sidecar.yml. Two instances run side by side:
- slm-generation (port 8081) — Phi-3-mini for classification and generation
- slm-embedding (port 8082) — all-MiniLM-L6-v2 with
--embeddingflag
Consequences
- Positive: 50× smaller container images (30 MB vs. 1.5 GB).
- Positive: Faster cold starts; no model pull step needed (uses
--hf-repoauto-download). - Positive: OpenAI-compatible API — firewall code doesn't need to change.
- Negative: Ollama's model management UX (pull, list, delete) is lost.
- Negative: Each model needs its own llama.cpp instance (no multi-model serving).
- Migration: Ollama-based Compose files (
docker-compose.yml,docker-compose.azure.yml) are retained for backward compatibility. - Update (ADR-011): The slm-embedding sidecar (port 8082) is now optional. Layer 1 semantic cache embeddings are generated in-process via candle (pure-Rust BertModel).
ADR-006: rig-core for Multi-Provider LLM Client
Date: 2024 · Status: Accepted
Context
Layer 3 must route to multiple cloud LLM providers (OpenAI, Azure OpenAI, Anthropic, xAI). Implementing each provider's API client from scratch would be error-prone and hard to maintain.
Decision
Use rig-core (v0.32.0) as the unified LLM client. Rig provides a consistent CompletionModel abstraction over all supported providers.
Consequences
- Positive: Single configuration surface (
ISARTOR__LLM_PROVIDER+ISARTOR__EXTERNAL_LLM_API_KEY) switches providers. - Positive: Provider-specific quirks (Azure deployment IDs, Anthropic versioning) handled by rig.
- Negative: Adds a dependency; rig's release cadence may not match our needs.
- Negative: Limited to providers rig supports (but covers all major ones).
ADR-007: AIMD Adaptive Concurrency Control
Date: 2024 · Status: Accepted
Context
A fixed concurrency limit either over-provisions (wasting resources) or under-provisions (rejecting requests during traffic spikes). The firewall needs to dynamically adjust its limit based on real-time latency.
Decision
Implement an Additive Increase / Multiplicative Decrease (AIMD) concurrency limiter at Layer 0:
- If P95 latency < target →
limit += 1(additive increase). - If P95 latency > target →
limit *= 0.5(multiplicative decrease). - Bounded by configurable min/max concurrency limits.
Consequences
- Positive: Self-tuning: the limit converges to the optimal value for the current load.
- Positive: Protects downstream services (sidecars, cloud LLMs) from overload.
- Negative: During cold start, the limit starts low and ramps up — initial requests may see 503s.
- Tuning: Target latency must be calibrated per deployment tier.
ADR-008: Unified API Surface
Date: 2024 · Status: Superseded
Context
The original design maintained two API versions: a v1 middleware-based pipeline (/api/chat) and a v2 orchestrator-based pipeline (/api/v2/chat). Maintaining two code paths increased complexity with no clear benefit once the middleware pipeline matured.
Decision
Consolidate into a single endpoint:
/api/chat— Middleware-based pipeline. Each layer is an Axum middleware (auth → cache → SLM triage → handler).- The v2 endpoint (
/api/v2/chat) and itspipeline_*configuration fields have been removed. - Orchestrator and trait-based pipeline components remain in
src/pipeline/for potential future reintegration.
Consequences
- Positive: Single code path to maintain, test, and observe.
- Positive: Simplified configuration surface — no more
PIPELINE_*env vars. - Positive: Eliminates user confusion about which endpoint to use.
- Negative: Orchestrator-based features (structured
processing_log, explicitPipelineContext) are not exposed until reintegrated.
ADR-009: Distroless Container Image
Date: 2024 · Status: Accepted
Context
The firewall binary is statically linked (musl). The runtime container only needs to execute a single binary.
Decision
Use gcr.io/distroless/static-debian12 as the runtime base image. It contains no shell, no package manager, no libc — only the static binary.
Consequences
- Positive: Minimal attack surface — no shell to exec into, no tools for attackers.
- Positive: Tiny image size (base ~2 MB + binary ~5 MB = ~7 MB total).
- Positive: Passes most container security scanners with zero CVEs.
- Negative: Cannot
docker execinto the container for debugging (no shell). - Negative: Cannot install additional tools at runtime.
- Workaround: Use
docker logs, Jaeger traces, and Prometheus metrics for debugging.
ADR-010: OpenTelemetry for Observability
Date: 2024 · Status: Accepted
Context
The firewall needs distributed tracing and metrics. Vendor-specific SDKs (Datadog, New Relic, etc.) create lock-in.
Decision
Use OpenTelemetry (OTLP gRPC) as the sole telemetry interface. Traces and metrics are exported to an OTel Collector, which can forward to any backend (Jaeger, Prometheus, Grafana, Datadog, etc.).
Consequences
- Positive: Vendor-neutral — switch backends by reconfiguring the collector, not the app.
- Positive: OTLP is a CNCF standard with wide ecosystem support.
- Positive: When
ISARTOR__ENABLE_MONITORING=false, no OTel SDK is initialised — zero overhead. - Negative: Requires an OTel Collector as middleware (adds one more service in Level 2/3).
- Negative: Auto-instrumentation is less mature in Rust than in Java/Python.
ADR-011: Pure-Rust Candle for In-Process Sentence Embeddings
| Status | Accepted (superseded: fastembed → candle) |
| Date | 2025-06 (updated 2025-07) |
| Deciders | Core team |
| Relates to | ADR-003 (Embedded Candle), ADR-005 (llama.cpp sidecar) |
Context
Layer 1 (semantic cache) must generate sentence embeddings for every incoming prompt to compute cosine similarity against the vector cache. Previously, this was done via fastembed (ONNX Runtime, BAAI/bge-small-en-v1.5), which introduced a C++ dependency (onnxruntime-sys) that broke cross-compilation on ARM64 macOS and complicated the build matrix.
Decision
Use candle (candle-core, candle-nn, candle-transformers 0.9) with hf-hub and tokenizers to run sentence-transformers/all-MiniLM-L6-v2 in-process via a pure-Rust BertModel. The model weights (~90 MB) are downloaded once from Hugging Face Hub on first startup and cached in ~/.cache/huggingface/. Inference is invoked through tokio::task::spawn_blocking since BERT forward passes are CPU-bound.
- Model: sentence-transformers/all-MiniLM-L6-v2 — 384-dimensional embeddings, optimised for sentence similarity.
- Runtime: Pure-Rust candle stack — zero C/C++ dependencies, seamless cross-compilation to any
rustctarget. - Pooling: Mean pooling with attention mask, followed by L2 normalisation.
- Thread safety: The inner
BertModelis wrapped instd::sync::Mutexbecauseforward()takes&mut self. This is acceptable because inference is always called fromspawn_blocking, never holding the lock across.awaitpoints. - Architecture:
TextEmbedderis initialised once at startup, stored asArc<TextEmbedder>inAppState, and injected into the cache middleware.
Alternatives Considered
| Alternative | Why rejected |
|---|---|
| fastembed (ONNX Runtime) | C++ dependency (onnxruntime-sys) breaks ARM64 cross-compilation; ~5 MB shared library |
| llama.cpp sidecar (all-MiniLM-L6-v2) | Network round-trip on hot path, extra container to manage |
| sentence-transformers (Python) | Crosses FFI boundary, adds Python runtime dependency |
| ort (raw ONNX Runtime bindings) | Same C++ dependency problem as fastembed |
Consequences
- Positive: Eliminates ~2–5 ms network latency per embedding call on the cache hot path.
- Positive: Zero C/C++ dependencies —
cargo buildworks on any platform without cmake or pre-built binaries. - Positive: Zero sidecar dependency for Level 1 — the minimal Dockerfile runs self-contained.
- Positive: Model weights are auto-downloaded from Hugging Face Hub; reproducible builds.
- Negative: First startup downloads model weights (~90 MB) if not pre-cached.
- Negative:
Mutexserialises concurrent embedding calls within a single process (acceptable at current scale; can be replaced with a pool of models if needed).
ADR-012: Pluggable Trait Provider (Hexagonal Architecture)
| Status | Accepted |
| Date | 2025-06 |
| Deciders | Core team |
| Relates to | ADR-003 (Embedded Candle), ADR-004 (Three Deployment Tiers) |
Context
As Isartor grew from a single-process binary (Level 1) to a multi-tier deployment (Level 1 → 2 → 3), the cache and SLM router components became tightly coupled to their in-process implementations. Scaling to Level 3 (Kubernetes, multiple replicas) requires:
- Shared cache — in-process LRU caches are isolated per pod; cache hits are inconsistent, duplicating work.
- GPU-backed inference — in-process Candle inference is CPU-bound; Level 3 needs a dedicated GPU inference pool (vLLM / TGI) that can scale independently.
Hard-coding these choices into the firewall binary would require compile-time feature flags or code branching, making the binary non-portable across tiers.
Decision
Adopt the Ports & Adapters (Hexagonal Architecture) pattern:
- Ports (
src/core/ports.rs) — DefineExactCacheandSlmRouterasasync_traittraits (Send + Sync), representing the interfaces the firewall depends on. - Adapters (
src/adapters/) — Provide concrete implementations:InMemoryCache(ahash + LRU + parking_lot) andRedisExactCacheforExactCacheEmbeddedCandleRouterandRemoteVllmRouterforSlmRouter
- Factory (
src/factory.rs) —build_exact_cache(&config)andbuild_slm_router(&config, &http_client)readAppConfig.cache_backendandAppConfig.router_backendat startup and return the appropriateBox<dyn Trait>. - Configuration (
src/config.rs) —CacheBackendenum (Memory | Redis) andRouterBackendenum (Embedded | Vllm) with associated connection URLs, selectable viaISARTOR__CACHE_BACKENDandISARTOR__ROUTER_BACKENDenv vars.
The same binary serves all three deployment tiers; the runtime behaviour is entirely configuration-driven.
Alternatives Considered
| Alternative | Why rejected |
|---|---|
Compile-time feature flags (#[cfg(feature = "redis")]) | Produces different binaries per tier; complicates CI and container builds |
| Service mesh sidecar (Envoy filter for caching) | Adds infrastructure complexity; cache logic is domain-specific |
Plugin system (dynamic .so loading) | Over-engineered; dyn Trait with compile-time-known variants is simpler |
| Runtime scripting (Lua / Wasm policy) | Unnecessary indirection; Rust trait dispatch is zero-cost |
Consequences
- Positive: One binary, all tiers — only env vars change between Level 1 (embedded everything) and Level 3 (Redis + vLLM).
- Positive: Horizontal scalability — with
cache_backend=redis, all pods share the same cache; withrouter_backend=vllm, GPU inference scales independently. - Positive: Testability — unit tests inject mock adapters via the trait interface.
- Positive: Extensibility — adding a new backend (e.g., Memcached, Triton) requires only a new adapter implementing the trait.
- Negative: Minor runtime overhead from
dyn Traitdynamic dispatch (single vtable lookup per call — negligible vs. network I/O). - Negative:
EmbeddedCandleRouterremains a skeleton; full candle-based classification requires theembedded-inferencefeature flag to be completed.
← Back to Architecture
AI Tool Integrations
Isartor is an OpenAI-compatible and Anthropic-compatible gateway that deflects repeated or simple prompts at Layer 1 (cache) and Layer 2 (local SLM) before they reach the cloud. Clients integrate by overriding their base URL to point at Isartor or by registering Isartor as an MCP server — no proxy, no MITM, no CA certificates.
Endpoints
Isartor's server defaults to: http://localhost:8080.
Authenticated chat endpoints:
| Endpoint | Protocol | Path |
|---|---|---|
| Native Isartor (recommended for direct use) | Native | POST /api/chat / POST /api/v1/chat |
| OpenAI Models | OpenAI | GET /v1/models |
| OpenAI Chat Completions | OpenAI | POST /v1/chat/completions |
| Anthropic Messages | Anthropic | POST /v1/messages |
| Cache lookup / store (used by MCP clients) | Native | POST /api/v1/cache/lookup / POST /api/v1/cache/store |
Authentication
Isartor can enforce a gateway key on authenticated routes when Layer 0 auth is enabled.
Supported headers:
X-API-Key: <gateway_api_key>Authorization: Bearer <gateway_api_key>(useful for OpenAI/Anthropic-compatible clients)
By default, gateway_api_key is empty and auth is disabled (local-first). To
enable gateway authentication, set ISARTOR__GATEWAY_API_KEY to a secret value. In
production, always set a strong key.
Observability headers
All endpoints in the Deflection Stack include:
X-Isartor-Layer:l1a|l1b|l2|l3|l0X-Isartor-Deflected:trueif resolved locally (no cloud call)
Example: OpenAI-compatible request
curl -sS http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "2 + 2?"}
]
}'
If gateway auth is enabled, also add:
-H 'Authorization: Bearer your-secret-key'
Many OpenAI-compatible SDKs and coding agents also call:
curl -sS http://localhost:8080/v1/models
OpenAI-compatible agent features supported by Isartor:
GET /v1/modelsfor model discoverystream: trueon/v1/chat/completionswith OpenAI-style SSE anddata: [DONE]tools,tool_choice,functions, andfunction_callpassthroughtool_callspreserved in provider responses- tool-aware exact cache keys, with semantic cache skipped for tool-use flows
Example: Anthropic-compatible request
curl -sS http://localhost:8080/v1/messages \
-H 'Content-Type: application/json' \
-d '{
"model": "claude-sonnet-4-6",
"system": "Be concise.",
"max_tokens": 100,
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "What is the capital of France?"}]
}
]
}'
If gateway auth is enabled, also add:
-H 'X-API-Key: your-secret-key'
Supported tools at a glance
| Tool | Command | Mechanism |
|---|---|---|
| GitHub Copilot CLI | isartor connect copilot | MCP server (cache-only) |
| GitHub Copilot in VS Code | isartor connect copilot-vscode | Managed settings.json debug overrides |
| OpenClaw | isartor connect openclaw | Managed OpenClaw provider config (openclaw.json) |
| OpenCode | isartor connect opencode | Global provider + auth config |
| Claude Code + GitHub Copilot | isartor connect claude-copilot | Claude base URL override + Copilot-backed L3 |
| Claude Code | isartor connect claude | Base URL override |
| Claude Desktop | isartor connect claude-desktop | Managed local MCP registration (isartor mcp) |
| Cursor IDE | isartor connect cursor | Base URL override + MCP |
| OpenAI Codex CLI | isartor connect codex | Base URL override |
| Gemini CLI | isartor connect gemini | Base URL override |
| Antigravity | isartor connect antigravity | Base URL override |
| Generic / other tools | isartor connect generic | Base URL override |
Add --gateway-api-key <key> to any connect command only if you have explicitly
enabled gateway auth.
Connection status
# Check all connected clients
isartor connect status
Global troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "connection refused" | Isartor not running | Run isartor up first |
| Gateway returns 401 | Auth enabled but key not configured | Add --gateway-api-key to connect command |
For tool-specific troubleshooting, see each integration page above.
GitHub Copilot CLI
Copilot CLI integrates via an MCP (Model Context Protocol) server that
Isartor registers as a stdio subprocess. Isartor also exposes the same MCP
tools over Streamable HTTP at http://localhost:8080/mcp/ for editors and
web agents that prefer HTTP/SSE transport. Both transports expose two tools:
isartor_chat— cache lookup only. Returns the cached answer on hit (L1a exact or L1b semantic), or an empty string on miss. On a miss, Copilot uses its own LLM to answer — Isartor never routes through its configured L3 provider for Copilot traffic.isartor_cache_store— stores a prompt/response pair in Isartor's cache so future identical or similar prompts are deflected locally.
This design means Copilot still owns the conversation loop, while Isartor acts as a transparent cache layer that reduces redundant cloud calls. On a cache hit, Isartor returns the cached text and does not call its own Layer 3 provider. Copilot CLI may still emit its normal final-answer event after the tool result, but that is a Copilot-side render step rather than an Isartor L3 forward.
Prerequisites
- Isartor installed (
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh) - GitHub Copilot CLI installed
Step-by-step setup
# 1. Start Isartor
isartor up --detach
# 2. Register the MCP server with Copilot CLI
isartor connect copilot
# 3. Start Copilot normally — plain chat prompts will use Isartor cache first
copilot
How it works
isartor connect copilotadds anisartorentry to~/.copilot/mcp-config.jsonisartor connect copilotalso installs a managed instruction block in~/.copilot/copilot-instructions.md- When Copilot CLI starts, it launches
isartor mcpas a stdio subprocess and loads the Isartor instruction block - The MCP server exposes
isartor_chat(cache lookup) andisartor_cache_store(cache write) - For plain conversational prompts, Copilot now prefers this flow:
- Call
isartor_chatwith the user's prompt - Cache hit: return the cached answer immediately, verbatim
- Cache miss: answer with Copilot's own model, then call
isartor_cache_store
- Call
- When Copilot calls
isartor_chat:- Cache hit (L1a exact or L1b semantic): returns the cached answer instantly
- Cache miss: returns empty → Copilot uses its own LLM
- After Copilot gets an answer from its LLM, it can call
isartor_cache_storeto populate the cache for future requests
HTTP/SSE MCP endpoint
Isartor now exposes the same MCP tool surface at /mcp/ using Streamable HTTP:
POST /mcp/— client → server JSON-RPCGET /mcp/— server → client SSE streamDELETE /mcp/— explicit session teardown
The HTTP transport uses the MCP Mcp-Session-Id header after initialize, and
supports both JSON responses and SSE responses for POST requests. A minimal
editor config looks like:
{"servers":{"isartor":{"type":"http","url":"http://localhost:8080/mcp/"}}}
Important note about "still going to L3"
If you inspect Copilot CLI JSON traces, you may still see a normal
final_answer event after isartor_chat returns a cache hit. That does not
mean Isartor forwarded the prompt to its own Layer 3 provider. The important
signal is Isartor's own log and headers:
Cache lookup: L1a exact hitorCache lookup: L1b semantic hit- no new
Layer 3: Forwarding to LLM via Rigentry for that prompt
In other words:
- Isartor L3 call = bad for a cache hit
- Copilot final-answer render after a tool hit = expected CLI behavior
Isartor now installs stricter Copilot instructions that tell Copilot to emit the cached tool result verbatim on cache hits, without paraphrasing or extra tool calls.
Cache endpoints (used by MCP internally)
The MCP server calls these HTTP endpoints on the Isartor gateway:
# Cache lookup — returns cached response or 204 No Content
curl -X POST http://localhost:8080/api/v1/cache/lookup \
-H "Content-Type: application/json" \
-d '{"prompt": "capital of France"}'
# Cache store — saves a prompt/response pair
curl -X POST http://localhost:8080/api/v1/cache/store \
-H "Content-Type: application/json" \
-d '{"prompt": "capital of France", "response": "The capital of France is Paris."}'
Custom gateway URL
# If Isartor runs on a non-default port
isartor connect copilot --gateway-url http://localhost:18080
Disconnecting
isartor connect copilot --disconnect
This removes the isartor entry from ~/.copilot/mcp-config.json.
It also removes the managed Isartor block from ~/.copilot/copilot-instructions.md.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Copilot has no isartor_chat tool | MCP server not registered | Run isartor connect copilot |
| Copilot works but bypasses cache | Isartor instructions not installed or custom instructions disabled | Run isartor connect copilot again and do not launch Copilot with --no-custom-instructions |
| Cache never hits for Copilot | Responses not stored after LLM answers | Ask Copilot to call isartor_cache_store after answering |
GitHub Copilot in VS Code
Route GitHub Copilot's code completions and chat requests in VS Code through
Isartor, so repetitive prompts are deflected locally via the L1a/L1b cache
layers. This reduces cloud API calls, lowers latency for repeated patterns,
and gives you per-tool visibility in isartor stats.
How is this different from Copilot CLI? The Copilot CLI integration uses an MCP server for the terminal-based
copilotcommand. This page covers VS Code — the editor extension that provides inline completions and Copilot Chat.
Prerequisites
- Isartor installed and running (
isartor up --detach) - GitHub Copilot VS Code extension installed (requires a Copilot subscription)
- An LLM provider API key configured in Isartor for Layer 3 fallback
(
isartor set-key -p openaior similar)
Step 1 — Start Isartor
# Install (if not already)
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh
# Configure your LLM provider key (OpenAI, Anthropic, Azure, etc.)
isartor set-key -p openai
# Start the gateway in the background
isartor up --detach
Verify it's running:
curl http://localhost:8080/health
# {"status":"ok", ...}
Step 2 — Configure VS Code
Recommended:
isartor connect copilot-vscode
This command:
- auto-detects the VS Code
settings.jsonpath on macOS, Linux, and Windows - backs up the original file to
settings.json.isartor-backup - writes the three
github.copilot.advanced.debug.*overrides - refuses to write if Isartor is not reachable
Manual alternative: open your VS Code User Settings (JSON) and add:
{
"github.copilot.advanced": {
"debug.overrideProxyUrl": "http://localhost:8080",
"debug.overrideCAPIUrl": "http://localhost:8080/v1",
"debug.chatOverrideProxyUrl": "http://localhost:8080/v1/chat/completions"
}
}
| Setting | What It Does |
|---|---|
debug.overrideProxyUrl | Routes Copilot's main API traffic through Isartor |
debug.overrideCAPIUrl | Overrides the completions API endpoint (inline suggestions) |
debug.chatOverrideProxyUrl | Overrides the Copilot Chat endpoint |
Custom port? If Isartor runs on a different port, replace
8080with your port everywhere above.
Step 3 — Restart VS Code
Close and reopen VS Code (or run "Developer: Reload Window" from the command palette). Copilot will now route requests through Isartor.
Step 4 — Verify
Open any code file and trigger a Copilot suggestion (start typing a comment or function). Then check Isartor's stats:
isartor stats
You should see requests flowing through Isartor's layers. Repeat the same prompt and you'll see L1a cache hits — Isartor deflected the duplicate without a cloud call.
For per-tool breakdown:
isartor stats --by-tool
Copilot VS Code traffic appears as copilot in the tool column (identified
from the User-Agent header). The table now includes requests, cache
hits/misses, average latency, retries, errors, and L1a/L1b safety.
How It Works
VS Code Copilot Extension
│
▼ (HTTP request to overrideProxyUrl)
┌─────────────┐
│ Isartor │
│ Gateway │
│ │
│ L1a ──► L1b ──► L3 (Cloud)
│ hit? hit? forward
└─────────────┘
│
▼
Response back to VS Code
- Copilot sends completion/chat requests to Isartor instead of GitHub's servers
- L1a Exact Cache — sub-millisecond hit for identical prompts (< 1 ms)
- L1b Semantic Cache — catches variations of the same prompt (1–5 ms)
- L3 Cloud — only genuinely new prompts reach your configured LLM provider
- Response flows back to Copilot transparently — no change to the editor UX
Disconnecting
isartor connect copilot-vscode --disconnect
If a backup exists, Isartor restores it. Otherwise it removes only the three
managed github.copilot.advanced.debug.* keys.
Benefits
| Benefit | How |
|---|---|
| Reduced API costs | Repetitive completions are served from cache |
| Lower latency | Cache hits return in < 5 ms vs hundreds of ms for cloud |
| Visibility | isartor stats --by-tool shows Copilot request counts, cache hit/miss safety, latency, retries, and errors |
| Privacy | Cached prompts never leave your machine on repeat requests |
| Model flexibility | Route L3 to any provider (OpenAI, Anthropic, Azure, local Ollama) |
Advanced Configuration
Use a specific LLM provider for Layer 3
Isartor routes surviving (non-cached) prompts to your configured L3 provider. You can use any supported provider:
# OpenAI (default)
isartor set-key -p openai
# Anthropic
isartor set-key -p anthropic
# Azure OpenAI
export ISARTOR__LLM_PROVIDER=azure
export ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com
export ISARTOR__AZURE_DEPLOYMENT_ID=<deployment>
isartor set-key -p azure
Adjust cache sensitivity
Tune the semantic cache threshold to control how similar a prompt must be to trigger an L1b hit:
# Default: 0.92 (higher = stricter matching)
export ISARTOR__SIMILARITY_THRESHOLD=0.90
See the Configuration Reference for all available options.
Enable monitoring
export ISARTOR__ENABLE_MONITORING=true
export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317
See Metrics & Tracing for Grafana dashboards and OTel setup.
Known Limitations
-
Copilot Chat override — The
debug.chatOverrideProxyUrlsetting may not be fully respected by all versions of the Copilot Chat extension (tracking issue). Inline code completions (debug.overrideCAPIUrl) work reliably. If chat requests bypass Isartor, try using the global VS Code proxy setting as a workaround:{ "http.proxy": "http://localhost:8080" }Note: This routes all VS Code HTTP traffic through Isartor, not just Copilot. Use a PAC script if you need finer control.
-
Authentication — These
debug.*settings bypass Copilot's normal GitHub authentication. Isartor handles the LLM provider auth via its own API key configuration. Your Copilot subscription is still required for the extension to activate. -
Extension updates — VS Code may update the Copilot extension automatically. If the proxy stops working after an update, verify the settings are still present in
settings.jsonand restart VS Code.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Copilot suggestions stop working | Isartor not running | Run isartor up --detach and verify with curl http://localhost:8080/health |
isartor connect copilot-vscode cannot find VS Code settings | Non-standard editor config path | Pass through manual JSON editing as a fallback |
No requests in isartor stats | Settings not applied | Verify settings.json has the override block, then reload VS Code |
| Chat works but completions don't | Wrong endpoint URL | Ensure debug.overrideCAPIUrl ends with /v1 |
| Completions work but chat doesn't | Known chat override limitation | Add debug.chatOverrideProxyUrl or use http.proxy as workaround |
| Auth errors from Copilot | Missing L3 provider key | Run isartor set-key -p openai (or your provider) |
| High latency on first request | Model loading | First request downloads the embedding model (~25 MB); subsequent requests are fast |
Reverting
To stop routing Copilot through Isartor, remove the github.copilot.advanced
block from your settings.json and reload VS Code:
// Remove this entire block:
"github.copilot.advanced": {
"debug.overrideProxyUrl": "http://localhost:8080",
"debug.overrideCAPIUrl": "http://localhost:8080/v1",
"debug.chatOverrideProxyUrl": "http://localhost:8080/v1/chat/completions"
}
OpenClaw
OpenClaw is a self-hosted AI assistant that can connect chat apps and agent workflows to LLM providers. The pragmatic Isartor setup is to register Isartor as a custom OpenAI-compatible OpenClaw provider and let OpenClaw use that provider as its primary model path.
This is similar in spirit to the LiteLLM integration docs, but with one important difference:
- LiteLLM is a multi-model gateway and catalog
- Isartor is a prompt firewall / gateway that currently exposes the upstream model you configured in Isartor itself
So the best OpenClaw UX is: configure the model in Isartor first, then let isartor connect openclaw mirror that model into OpenClaw's provider config.
Pragmatic setup
# 1. Configure Isartor's upstream provider/model
isartor set-key -p groq
isartor check
# 2. Start Isartor
isartor up --detach
# 3. Make sure OpenClaw is onboarded
openclaw onboard --install-daemon
# 4. Register Isartor as an OpenClaw provider
isartor connect openclaw
# 5. Verify OpenClaw sees the provider/model and auth
openclaw models status --agent main --probe
# 6. Smoke test a prompt
openclaw agent --agent main -m "Hello from OpenClaw through Isartor"
What isartor connect openclaw does
It writes or updates your OpenClaw config (default: ~/.openclaw/openclaw.json) with:
models.providers.isartor- a single managed model entry matching Isartor's current upstream model
agents.defaults.model.primary = "isartor/<your-model>"- the
main/ default agent model override when one is present - a refresh of stale per-agent
models.jsonregistries so OpenClaw regenerates them with the latestbaseUrlandapiKey
Example generated provider block:
models: {
providers: {
isartor: {
baseUrl: "http://localhost:8080/v1",
apiKey: "isartor-local",
api: "openai-completions",
models: [
{
id: "openai/gpt-oss-120b",
name: "Isartor (openai/gpt-oss-120b)"
}
]
}
}
}
And the default model becomes:
agents: {
defaults: {
model: {
primary: "isartor/openai/gpt-oss-120b"
}
}
}
Base URL and auth path
OpenClaw must talk to Isartor's OpenAI-compatible /v1 surface.
- Correct base URL:
http://localhost:8080/v1 - Wrong base URL:
http://localhost:8080
Why this matters:
- OpenClaw appends
/chat/completionsfor OpenAI-compatible custom providers - Isartor exposes that route as
/v1/chat/completions - using the root gateway URL can produce
404errors such asgateway unknown L0 via chat/completions
isartor connect openclaw writes the /v1 path for you, so prefer the connector over hand-editing the provider block.
Reconnecting after changing the gateway API key
OpenClaw stores custom-provider state in two places:
~/.openclaw/openclaw.json- per-agent
models.jsonregistries under~/.openclaw/agents/<agentId>/agent/
Those per-agent registries can keep an old apiKey or baseUrl even after openclaw.json changes. That is why you can still see 401 after fixing the key in the top-level config.
The supported fix is simply:
isartor connect openclaw --gateway-api-key <your-key>
openclaw models status --agent main --probe
openclaw agent --agent main -m "Hello from OpenClaw through Isartor"
The connector now refreshes openclaw.json, updates the main / default agent model override, and removes stale per-agent models.json files so OpenClaw regenerates them with the new auth.
Why this is the best fit
The upstream LiteLLM/OpenClaw docs assume the gateway can expose a multi-model catalog and route among many providers behind one endpoint.
Isartor is different today:
- OpenClaw talks to Isartor over the OpenAI-compatible
/v1/chat/completionssurface - Isartor forwards using its configured upstream provider/model
- OpenClaw model refs should therefore mirror the model currently configured in Isartor
That means:
- if you change Isartor's provider/model later, rerun
isartor connect openclaw - if you change Isartor's gateway API key later, rerun
isartor connect openclaw --gateway-api-key ... - do not expect
isartor/openai/...andisartor/anthropic/...fallbacks to behave like LiteLLM provider switching unless Isartor itself grows multi-provider routing later
Options
| Flag | Default | Description |
|---|---|---|
--model | Isartor's configured upstream model | Override the single model ID exposed to OpenClaw |
--config-path | auto-detected | Path to openclaw.json |
--gateway-api-key | (none) | Gateway key if auth is enabled |
Files written
~/.openclaw/openclaw.json— managed OpenClaw provider config~/.openclaw/agents/<agentId>/agent/models.json— regenerated by OpenClaw after Isartor clears stale custom-provider cachesopenclaw.json.isartor-backup— backup, when a prior config existed
Disconnecting
isartor connect openclaw --disconnect
If a backup exists, Isartor restores it. Otherwise it removes only the managed models.providers.isartor entry and related isartor/... default-model references.
Recommended user workflow
For day-to-day use:
- Pick your upstream provider with
isartor set-key - Validate with
isartor check - Keep Isartor running with
isartor up --detach - Let OpenClaw use
isartor/<configured-model>as its primary model - Use
openclaw models status --agent main --probewhenever you want to confirm what OpenClaw currently sees
If you later switch Isartor from, for example, Groq to OpenAI or Azure:
isartor set-key -p openai
isartor check
isartor connect openclaw
That refreshes OpenClaw's provider model to match the new Isartor config.
What Isartor does for OpenClaw
| Benefit | How |
|---|---|
| Cache repeated agent prompts | OpenClaw often repeats the same context and system framing. L1a exact cache resolves those instantly. |
| Catch paraphrases | L1b semantic cache resolves similar follow-ups locally when safe. |
| Compress repeated instructions | L2.5 trims repeated context before cloud fallback. |
| Keep one stable gateway URL | OpenClaw only needs isartor/<model> while Isartor owns the upstream provider configuration. |
| Observability | isartor stats --by-tool lets you track OpenClaw cache hits, latency, and savings. |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| OpenClaw cannot reach the provider | Isartor not running | Run isartor up --detach first |
| OpenClaw onboarding/custom provider returns 404 | Base URL points at http://localhost:8080 instead of http://localhost:8080/v1 | Use isartor connect openclaw or update the custom provider base URL to end with /v1 |
| OpenClaw still shows the old model | Isartor model changed after initial connect | Re-run isartor connect openclaw |
| Auth errors (401) after reconnecting | OpenClaw is still using stale per-agent provider state | Re-run isartor connect openclaw --gateway-api-key <key> so Isartor refreshes openclaw.json and clears stale per-agent models.json registries |
| "Model is not allowed" | OpenClaw allowlist still excludes the managed model | Re-run isartor connect openclaw so the managed model is re-added to the allowlist |
OpenCode
OpenCode integrates via a global provider config and auth store. Isartor
registers an isartor provider backed by @ai-sdk/openai-compatible and points
it at the gateway's /v1 endpoint.
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Configure OpenCode
isartor connect opencode
# 3. Start OpenCode
opencode
How it works
isartor connect opencodebacks up~/.config/opencode/opencode.json- It writes an
isartorprovider definition to that config file - It writes a matching auth entry to
~/.local/share/opencode/auth.json - The provider uses
@ai-sdk/openai-compatiblewithbaseURLset tohttp://localhost:8080/v1 - If gateway auth is disabled, Isartor writes a dummy local auth key so OpenCode still has a credential to send
Files written
~/.config/opencode/opencode.json~/.local/share/opencode/auth.json
Backups:
~/.config/opencode/opencode.json.isartor-backup~/.local/share/opencode/auth.json.isartor-backup
Disconnecting
isartor connect opencode --disconnect
Disconnect restores the original files from backup when available. If no backup
exists, it removes only the managed isartor entries.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| OpenCode cannot see the Isartor provider | Config file not written | Run isartor connect opencode again |
| OpenCode shows auth errors | Gateway auth mismatch | Re-run with --gateway-api-key or update ISARTOR__GATEWAY_API_KEY |
| OpenCode cannot list models | /v1/models unreachable | Verify curl http://localhost:8080/v1/models |
Claude Code + GitHub Copilot
Use Claude Code's editor and CLI workflow while routing Layer 3 through your existing GitHub Copilot subscription via Isartor. Repeated prompts are still deflected by Isartor's L1a/L1b cache first, ad L2 SLM (if turned on) so cache hits consume zero Copilot quota.
Current status: experimental. The connector and Copilot-backed L3 routing are implemented, but Isartor's Anthropic compatibility surface is still text-oriented today. That means plain Claude Code prompting works best right now; more advanced Anthropic tool-use blocks may still require follow-up work.
Prerequisites
- Active GitHub Copilot subscription
- Isartor installed
- Claude Code installed
# Install Isartor
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh
# Install Claude Code
npm install -g @anthropic-ai/claude-code
Setup
Path A — Interactive authentication (recommended)
isartor connect claude-copilot
This starts GitHub device-flow authentication, stores the OAuth token locally,
updates ./isartor.toml, and writes Claude Code settings into
~/.claude/settings.json.
When no --github-token is provided, Isartor now prefers browser/device-flow
OAuth first. It will reuse a previously saved OAuth credential, but it will
not silently reuse legacy saved PATs.
Path B — Use an existing GitHub token
isartor connect claude-copilot --github-token ghp_YOUR_TOKEN
Use --github-token only when you intentionally want to override the default
browser login flow with a PAT.
Path C — Choose custom Copilot models
isartor connect claude-copilot \
--github-token ghp_YOUR_TOKEN \
--model gpt-4.1 \
--fast-model gpt-4o-mini
After the command finishes, restart Isartor so the new Layer 3 config is loaded:
isartor stop
isartor up --detach
claude
One-click smoke test
./scripts/claude-copilot-smoke-test.sh
# or
make smoke-claude-copilot
The script automatically:
- reads the saved Copilot credential from
~/.isartor/providers/copilot.json - picks a supported Copilot-backed model
- starts a temporary Isartor instance
- runs a Claude Code smoke prompt
- prints an ROI demo showing L3, L1a exact-hit, and L1b semantic-hit behavior
What the command changes
~/.claude/settings.json
The command writes these Claude Code environment overrides:
| Setting | Value | Purpose |
|---|---|---|
ANTHROPIC_BASE_URL | http://localhost:8080 (or your gateway URL) | Routes Claude Code to Isartor |
ANTHROPIC_AUTH_TOKEN | dummy or your gateway key | Satisfies Claude Code auth requirements |
ANTHROPIC_MODEL | selected model | Primary Copilot-backed model |
ANTHROPIC_DEFAULT_SONNET_MODEL | selected model | Default Claude Code Sonnet mapping |
ANTHROPIC_DEFAULT_HAIKU_MODEL | fast model | Lightweight/background tasks |
DISABLE_NON_ESSENTIAL_MODEL_CALLS | 1 | Reduce unnecessary quota burn |
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC | 1 | Compatibility flag across Claude Code versions |
ENABLE_TOOL_SEARCH | true | Preserve Claude Code tool search behavior |
CLAUDE_CODE_MAX_OUTPUT_TOKENS | 16000 | Stay under Copilot's output cap |
./isartor.toml
The command also sets Isartor Layer 3 to use the Copilot provider:
llm_provider = "copilot"
external_llm_model = "claude-sonnet-4.5"
external_llm_api_key = "ghp_..."
external_llm_url = "https://api.githubcopilot.com/chat/completions"
Available Copilot-backed models
| Model | Type | Notes |
|---|---|---|
claude-sonnet-4.5 | Balanced | Good default for Claude-style behavior |
claude-haiku-4.5 | Fast | Lower-latency Claude-family option |
gpt-4o | Strong general model | Good for broad coding tasks |
gpt-4o-mini | Fast + cheap | Good default fast/background model |
gpt-4.1 | Included | Safe fallback choice |
o3-mini | Reasoning | Higher-latency reasoning model |
What Isartor saves
Without Isartor:
Every Claude Code prompt -> GitHub Copilot API -> quota consumed
With Isartor:
Repeated prompt (L1a hit) -> served locally -> 0 Copilot quota
Similar prompt (L1b hit) -> served locally -> 0 Copilot quota
Novel prompt (cache miss) -> forwarded to Copilot -> quota consumed
Example session:
100 Claude Code prompts
40 exact repeats -> L1a -> 0 quota
25 semantic variants -> L1b -> 0 quota
35 novel prompts -> L3 -> 35 Copilot-backed requests
Result: 35 routed requests instead of 100
Limitations
- GitHub Copilot output is capped; Isartor writes
CLAUDE_CODE_MAX_OUTPUT_TOKENS=16000 - The current
/v1/messagescompatibility path is still text-oriented, so some advanced Anthropic tool-use flows may not yet behave exactly like direct Anthropic routing - Extended-thinking / provider-specific Anthropic features are not preserved
- If the chosen Copilot model is unavailable to your account, requests fail instead of silently falling back to Anthropic
Disconnect
isartor connect claude-copilot --disconnect
This restores the backed-up ~/.claude/settings.json and ./isartor.toml.
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
Authentication failed | Browser login incomplete, token invalid, or expired | Re-run isartor connect claude-copilot and finish GitHub sign-in |
No active GitHub Copilot subscription | Signed-in GitHub user has no active Copilot seat / entitlement | Check https://github.com/features/copilot and enterprise seat assignment |
Model not found | Account cannot access the requested model | Retry with --model gpt-4.1 |
Claude Code still uses Anthropic | Isartor not restarted after config change | Run isartor stop && isartor up --detach |
401 from Isartor | Gateway auth enabled but Claude settings use dummy token | Re-run with the gateway key available in local config |
Tool call failed | Current Anthropic compatibility is still text-first | Use simpler prompting for now; full tool-use compatibility is follow-up work |
Claude Code
Claude Code integrates via ANTHROPIC_BASE_URL, pointing all API traffic at
Isartor's /v1/messages endpoint.
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Configure Claude Code
isartor connect claude
# 3. Claude Code now routes through Isartor automatically
How it works
isartor connect claudesetsANTHROPIC_BASE_URLin~/.claude/settings.json- Claude Code sends requests to Isartor's
/v1/messagesendpoint - Isartor forwards to the Anthropic API as Layer 3 when the request is not deflected
Disconnecting
isartor connect claude --disconnect
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Claude not routing through Isartor | settings.json not updated | Run isartor connect claude |
Claude Desktop
Claude Desktop integrates with Isartor via a local MCP server. The recommended setup is isartor connect claude-desktop, which registers isartor mcp in Claude Desktop's config so Claude can use Isartor's cache-aware tools.
Step-by-step setup
# 1. Start Isartor
isartor up --detach
# 2. Register Isartor in Claude Desktop
isartor connect claude-desktop
# 3. Restart Claude Desktop
After restart, open Claude Desktop's tools/connectors UI and confirm the isartor MCP server is present.
What the connector writes
isartor connect claude-desktop updates Claude Desktop's local MCP config and keeps a backup next to it.
Typical config paths:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json - Linux (best-effort path):
~/.config/Claude/claude_desktop_config.json
The generated MCP entry looks like:
{
"mcpServers": {
"isartor": {
"command": "/path/to/isartor",
"args": ["mcp"],
"env": {
"ISARTOR_GATEWAY_URL": "http://localhost:8080"
}
}
}
}
If gateway auth is enabled, the connector also writes ISARTOR__GATEWAY_API_KEY into the managed server env block.
What Claude Desktop gets
The Isartor MCP server exposes these tools:
isartor_chat— cache-first lookup through Isartor's L1a/L1b layersisartor_cache_store— store prompt/response pairs back into Isartor after a cache miss
This gives Claude Desktop a low-risk integration path that fits the current MCP model without relying on Anthropic base-URL overrides.
Advanced / manual setup
If you prefer to edit the config yourself, add a local MCP server entry that runs:
isartor mcp
Isartor also exposes MCP over HTTP/SSE at:
http://localhost:8080/mcp/
That remote MCP surface is useful for clients that support HTTP/SSE registration directly, but isartor connect claude-desktop currently uses the local stdio flow because it is the most reliable Claude Desktop path today.
Disconnecting
isartor connect claude-desktop --disconnect
This restores the backup when one exists; otherwise it removes only the managed mcpServers.isartor entry.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Claude Desktop shows no isartor tools | Claude Desktop was not restarted | Quit and relaunch Claude Desktop after isartor connect claude-desktop |
| Tools appear but calls fail | Isartor is not running | Start the gateway with isartor up --detach |
| MCP server is present but unauthorized | Gateway auth enabled | Re-run isartor connect claude-desktop --gateway-api-key <key> |
| You want the original config back | Managed config needs rollback | Run isartor connect claude-desktop --disconnect |
Note on desktop extensions
Claude Desktop now supports desktop extensions, but Isartor's first-class integration in this repo uses the simpler local MCP server flow today. That keeps setup light and works with the existing isartor mcp implementation immediately.
Cursor IDE
Cursor IDE integrates via the OpenAI Base URL override in Cursor's model settings, and optionally via MCP server registration for tool-based integration.
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Configure Cursor
isartor connect cursor
# 3. Open Cursor → Settings → Cursor Settings → Models
# 4. Enable "Override OpenAI Base URL" and enter: http://localhost:8080/v1
# 5. Paste the API key shown in the connect output
# 6. Add a custom model name (e.g. gpt-4o) and enable it
# 7. Use Ask or Plan mode (Agent mode doesn't support custom keys yet)
How it works
isartor connect cursorwrites a reference env file to~/.isartor/env/cursor.sh- It also registers Isartor as an MCP server in
~/.cursor/mcp.json - In Cursor, override the OpenAI Base URL to point at Isartor's
/v1endpoint - Cursor can use Isartor's
GET /v1/modelsendpoint to discover the configured model - All chat completions requests route through Isartor's L1/L2/L3 deflection stack
- Isartor supports OpenAI streaming SSE, tool-call passthrough, and HTTP/SSE MCP at
http://localhost:8080/mcp/for compatible Cursor workflows - Cursor's Ask and Plan modes are supported; Agent mode requires native keys
Cursor's generated MCP config points at:
{"mcpServers":{"isartor":{"type":"http","url":"http://localhost:8080/mcp/"}}}
Disconnecting
isartor connect cursor --disconnect
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Cursor not routing through Isartor | Base URL override not set | Open Cursor Settings → Models → enable Override OpenAI Base URL |
| Cursor model picker is empty | Cursor cannot reach model discovery | Verify http://localhost:8080/v1/models is reachable from Cursor |
OpenAI Codex CLI
OpenAI Codex CLI integrates via OPENAI_BASE_URL, routing requests through
Isartor's OpenAI-compatible /v1 surface, including /v1/chat/completions and
/v1/models.
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Configure Codex
isartor connect codex
# 3. Source the env file
source ~/.isartor/env/codex.sh
# 4. Run Codex
codex --model o3-mini
How it works
isartor connect codexwritesOPENAI_BASE_URLandOPENAI_API_KEYto~/.isartor/env/codex.sh- Codex can query
/v1/modelsto discover the configured model - Codex sends chat requests to Isartor's
/v1/chat/completionsendpoint - Isartor supports OpenAI streaming SSE and tool-call passthrough for compatible agent workflows
- Isartor forwards to the configured upstream as Layer 3 when not deflected
- Use
--modelto select any model name configured in your L3 provider
Disconnecting
isartor connect codex --disconnect
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Codex not routing through Isartor | Env vars not loaded | Run source ~/.isartor/env/codex.sh in your shell |
| Codex cannot list models | /v1/models unreachable or auth mismatch | Test curl http://localhost:8080/v1/models with the same auth settings |
Gemini CLI
Gemini CLI integrates via GEMINI_API_BASE_URL, routing requests through
Isartor's gateway.
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Configure Gemini CLI
isartor connect gemini
# 3. Source the env file
source ~/.isartor/env/gemini.sh
# 4. Run Gemini CLI
gemini
How it works
isartor connect geminiwritesGEMINI_API_BASE_URLandGEMINI_API_KEYto~/.isartor/env/gemini.sh- Gemini CLI sends requests to Isartor's gateway
- Isartor forwards to the configured upstream as Layer 3 when not deflected
Disconnecting
isartor connect gemini --disconnect
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Gemini not routing through Isartor | Env vars not loaded | Run source ~/.isartor/env/gemini.sh in your shell |
Antigravity
Antigravity integrates via an OpenAI-compatible base URL override. Isartor
generates a shell env file that sets OPENAI_BASE_URL and OPENAI_API_KEY
to route all LLM calls through the Deflection Stack.
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Generate the env file
isartor connect antigravity
# 3. Activate the environment
source ~/.isartor/env/antigravity.sh
# 4. Start Antigravity
# (it will now use Isartor as its OpenAI endpoint)
How it works
isartor connect antigravitycreates~/.isartor/env/antigravity.sh- The file exports
OPENAI_BASE_URLpointing athttp://localhost:8080/v1 - It exports
OPENAI_API_KEYwith your gateway key (or a local placeholder) - When sourced, Antigravity sends all OpenAI-compatible calls through Isartor
Files written
~/.isartor/env/antigravity.sh
Disconnecting
isartor connect antigravity --disconnect
Then restart your shell to clear the exported variables.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Connection refused | Isartor not running | Run isartor up first |
| Auth errors (401) | Gateway auth enabled | Re-run with --gateway-api-key |
| Env not applied | Shell not sourced | Run source ~/.isartor/env/antigravity.sh |
Generic Connector
For tools not explicitly supported, use the generic connector to generate an env script that sets the tool's base URL environment variable to point at Isartor.
Compatible tools
The generic connector works with any OpenAI-compatible tool, including:
- Windsurf
- Zed
- Cline
- Roo Code
- Aider
- Continue
- Antigravity (also available via
isartor connect antigravity) - OpenClaw (also available via
isartor connect openclaw) - Any other tool that reads an
OPENAI_BASE_URLor similar environment variable
OpenAI-compatible features exposed by Isartor include:
GET /v1/modelsfor model discoveryPOST /v1/chat/completionsstream: trueSSE responses- tool/function calling passthrough (
tools,tool_choice,functions,tool_calls)
Step-by-step setup
# 1. Start Isartor
isartor up
# 2. Configure the tool (example: Windsurf)
isartor connect generic \
--tool-name Windsurf \
--base-url-var OPENAI_BASE_URL \
--api-key-var OPENAI_API_KEY
# 3. Source the env file
source ~/.isartor/env/windsurf.sh
# 4. Start the tool
Arguments
| Flag | Required | Description |
|---|---|---|
--tool-name | yes | Display name (also used for env script filename) |
--base-url-var | yes | Env var the tool reads for its API base URL |
--api-key-var | no | Env var the tool reads for its API key |
--no-append-v1 | no | Don't append /v1 to the gateway URL |
Disconnecting
isartor connect generic \
--tool-name Windsurf \
--base-url-var OPENAI_BASE_URL \
--disconnect
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Tool not routing through Isartor | Env vars not loaded | Run source ~/.isartor/env/<tool>.sh in your shell |
| Tool says no models are available | It expects OpenAI model discovery | Verify it can reach http://localhost:8080/v1/models |
Level 1 — Minimal Deployment
Single static binary, embedded candle inference + in-process candle sentence embeddings, zero C/C++ dependencies.
This guide covers deploying Isartor as a standalone process — no sidecars, no Docker Compose, no orchestrator. The firewall binary embeds a Gemma-2-2B-IT GGUF model via candle for Layer 2 classification and uses candle's BertModel (sentence-transformers/all-MiniLM-L6-v2) for Layer 1 semantic cache embeddings — all entirely in-process, pure Rust.
When to Use Level 1
| ✅ Good Fit | ❌ Consider Level 2/3 Instead |
|---|---|
| €5–€20/month VPS (Hetzner, DigitalOcean, Linode) | GPU inference for generation quality |
| ARM edge devices (Raspberry Pi 5, Jetson Nano) | More than ~50 concurrent users |
| Air-gapped / offline environments | Production observability stack required |
| Development & local experimentation | Multi-node high-availability |
| CI/CD test runners |
Prerequisites
| Requirement | Minimum | Recommended |
|---|---|---|
| RAM | 2 GB free | 4 GB free |
| Disk | 2 GB (model download) | 5 GB |
| CPU | 2 cores | 4+ cores (AVX2 recommended) |
| Rust (build from source) | 1.75+ | Latest stable |
| OS | Linux (x86_64 / aarch64), macOS | Ubuntu 22.04 LTS |
Memory budget: Gemma-2-2B Q4_K_M ≈ 1.5 GB, candle BertModel ≈ 90 MB, tokenizer ≈ 4 MB, firewall runtime ≈ 50 MB. Total: ~1.7 GB resident.
Option A: One-Click Install (Recommended)
The fastest way to get started is to leverage the pre-built, cross-platform binaries generated by the CI/CD pipeline.
Install via script:
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh
Windows (PowerShell):
irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex
This script detects your target OS and processor architecture, downloads the correct release binary, and adds it to your path automatically.
Option B: Build from Source
1. Clone & Build
git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release
The release binary is at ./target/release/isartor (~5 MB statically linked).
2. Configure Environment
Create a minimal .env file or export variables directly:
# Required — your cloud LLM key for Layer 3 fallback
export ISARTOR__EXTERNAL_LLM_API_KEY="sk-..."
# Optional — override defaults
export ISARTOR__GATEWAY_API_KEY="my-secret-key"
export ISARTOR__HOST_PORT="0.0.0.0:8080"
export ISARTOR__LLM_PROVIDER="openai" # openai | azure | anthropic | xai
export ISARTOR__EXTERNAL_LLM_MODEL="gpt-4o-mini"
# Cache mode — "both" enables exact + semantic cache. Semantic embeddings
# are generated in-process via candle BertModel — no sidecar needed.
export ISARTOR__CACHE_MODE="both"
# Pluggable backends — Level 1 uses the defaults (no change needed):
# ISARTOR__CACHE_BACKEND=memory — in-process LRU (ahash + parking_lot)
# ISARTOR__ROUTER_BACKEND=embedded — in-process Candle GGUF SLM
# These are ideal for a single-process deployment with zero dependencies.
3. Start the Firewall
./target/release/isartor up
On first start, the embedded classifier will auto-download the Gemma-2-2B-IT GGUF model from Hugging Face Hub (~1.5 GB). Subsequent starts load from the local cache (~/.cache/huggingface/).
INFO isartor > Listening on 0.0.0.0:8080
INFO isartor::layer1::embeddings > Initialising candle TextEmbedder (all-MiniLM-L6-v2)...
INFO isartor::layer1::embeddings > TextEmbedder ready (~90 MB BertModel loaded)
INFO isartor::services::local_inference > Downloading model from mradermacher/gemma-2-2b-it-GGUF...
INFO isartor::services::local_inference > Model loaded (1.5 GB), ready for inference
4. Verify
# Health check
curl http://localhost:8080/health
# Test the firewall
curl -s http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-H "X-API-Key: my-secret-key" \
-d '{"prompt": "Hello, how are you?"}' | jq .
Option B: Docker (Single Container)
For environments where you prefer a container but don't need a full Compose stack.
Build the Image
cd isartor
docker build -t isartor:latest -f docker/Dockerfile .
Run
docker run -d \
--name isartor \
-p 8080:8080 \
-e ISARTOR__GATEWAY_API_KEY="my-secret-key" \
-e ISARTOR__EXTERNAL_LLM_API_KEY="sk-..." \
-e ISARTOR__CACHE_MODE="both" \
-e HF_HOME=/tmp/huggingface \
-v isartor-models:/tmp/huggingface \
isartor:latest
Note: The
-vflag mounts a named volume for the Hugging Face cache so the model downloads persist across container restarts.The official Docker image runs as non-root and uses
HF_HOME=/tmp/huggingfaceto ensure the cache is writable.
Option C: systemd Service (Production Linux)
For long-running production deployments on bare metal or VPS.
1. Install the Binary
# Build
cargo build --release
# Install to /usr/local/bin
sudo cp target/release/isartor /usr/local/bin/isartor
sudo chmod +x /usr/local/bin/isartor
2. Create a System User
sudo useradd --system --no-create-home --shell /usr/sbin/nologin isartor
3. Create Environment File
sudo mkdir -p /etc/isartor
sudo tee /etc/isartor/env <<'EOF'
ISARTOR__HOST_PORT=0.0.0.0:8080
ISARTOR__GATEWAY_API_KEY=your-production-key
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...
ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__CACHE_MODE=both
ISARTOR__CACHE_BACKEND=memory
ISARTOR__ROUTER_BACKEND=embedded
RUST_LOG=isartor=info
EOF
sudo chmod 600 /etc/isartor/env
4. Create systemd Unit
sudo tee /etc/systemd/system/isartor.service <<'EOF'
[Unit]
Description=Isartor Prompt Firewall
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=isartor
Group=isartor
EnvironmentFile=/etc/isartor/env
ExecStart=/usr/local/bin/isartor
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/cache/isartor
[Install]
WantedBy=multi-user.target
EOF
5. Create Model Cache Directory
sudo mkdir -p /var/cache/isartor
sudo chown isartor:isartor /var/cache/isartor
6. Enable & Start
sudo systemctl daemon-reload
sudo systemctl enable isartor
sudo systemctl start isartor
# Check status
sudo systemctl status isartor
sudo journalctl -u isartor -f
Model Pre-Caching (Air-Gapped / Offline)
If the deployment target has no internet access, pre-download the model on a connected machine and copy it over.
On the Connected Machine
# Install huggingface-cli
pip install huggingface-hub
# Download the GGUF file
huggingface-cli download mradermacher/gemma-2-2b-it-GGUF \
gemma-2-2b-it.Q4_K_M.gguf \
--local-dir ./models
# Also grab the tokenizer (from the base model)
huggingface-cli download google/gemma-2-2b-it \
tokenizer.json \
--local-dir ./models
Transfer to Target
scp -r ./models/ user@target-host:/var/cache/isartor/
By default, hf-hub uses ~/.cache/huggingface/. In the official Docker image, Isartor sets HF_HOME=/tmp/huggingface (non-root safe). Set HF_HOME or ISARTOR_HF_CACHE_DIR to point to your pre-cached directory if needed.
Level 1 Configuration Reference
These are the most relevant ISARTOR__* variables for Level 1 deployments. For the full reference, see the Configuration Reference.
| Variable | Default | Level 1 Notes |
|---|---|---|
ISARTOR__HOST_PORT | 0.0.0.0:8080 | Bind address |
ISARTOR__GATEWAY_API_KEY | "" | Set to enable gateway auth |
ISARTOR__CACHE_MODE | both | both recommended — candle BertModel provides in-process semantic embeddings |
ISARTOR__CACHE_BACKEND | memory | In-process LRU — ideal for single-process Level 1 |
ISARTOR__ROUTER_BACKEND | embedded | In-process Candle GGUF SLM — zero external dependencies |
ISARTOR__CACHE_TTL_SECS | 300 | Cache TTL in seconds |
ISARTOR__CACHE_MAX_CAPACITY | 10000 | Max entries per cache |
ISARTOR__LLM_PROVIDER | openai | openai · azure · anthropic · xai |
ISARTOR__EXTERNAL_LLM_API_KEY | (empty) | Required for Layer 3 fallback |
ISARTOR__EXTERNAL_LLM_MODEL | gpt-4o-mini | Cloud LLM model name |
ISARTOR__ENABLE_MONITORING | false | Enable for stdout OTel (no collector needed) |
Embedded Classifier Defaults (Compiled)
| Setting | Default Value | Description |
|---|---|---|
repo_id | mradermacher/gemma-2-2b-it-GGUF | HF repo for the GGUF model |
gguf_filename | gemma-2-2b-it.Q4_K_M.gguf | Model file (~1.5 GB) |
max_classify_tokens | 20 | Token limit for classification |
max_generate_tokens | 256 | Token limit for simple task execution |
temperature | 0.0 | Greedy decoding for classification |
repetition_penalty | 1.1 | Avoids degenerate loops |
Performance Expectations
| Metric | Typical Value (4-core x86_64) |
|---|---|
| Cold start (model download) | 30–120 s (depends on bandwidth; ~1.5 GB Gemma + ~90 MB candle BertModel) |
| Warm start (cached model) | 3–8 s |
| Classification latency | 50–200 ms |
| Simple task execution | 200–2000 ms |
| Firewall overhead (no inference) | < 1 ms |
| Memory (steady state) | ~1.6 GB |
| Binary size | ~5 MB |
Upgrading to Level 2
When your traffic outgrows Level 1, the migration path is straightforward:
- Add the generation sidecar —
ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081(replaces embedded candle with the more powerful Phi-3-mini on GPU). - Optionally add an embedding sidecar —
ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082(only needed for external embedding inference; the default L1b semantic cache already uses in-process candle BertModel). - Deploy via Docker Compose — See Level 2 — Sidecar Deployment.
Note: The pluggable backend defaults (
cache_backend=memory,router_backend=embedded) remain appropriate for Level 2 single-host deployments. You only need to switch tocache_backend=redisandrouter_backend=vllmat Level 3 when scaling horizontally.
No code changes required — only environment variables and infrastructure.
Level 2 — Sidecar Deployment
Split architecture: Isartor firewall + llama.cpp generation sidecar on a single host.
This guide covers deploying Isartor with a dedicated AI sidecar for generation. The firewall delegates Layer 2 inference to a lightweight llama.cpp container via HTTP, while Layer 1 semantic cache embeddings run in-process via candle BertModel (no embedding sidecar required). The overall stack runs on a single machine via Docker Compose.
When to Use Level 2
| ✅ Good Fit | ❌ Consider Level 1 or Level 3 |
|---|---|
| Single host with GPU (NVIDIA, AMD) | No GPU available → Level 1 embedded candle |
| Want GPU-accelerated Layer 2 generation | Multi-node scaling → Level 3 Kubernetes |
| Want full observability stack (Jaeger, Grafana) | Budget VPS (< 4 GB RAM) → Level 1 |
| Development with production-like topology | Auto-scaling inference pools → Level 3 |
| 10–100 concurrent users | > 100 concurrent users → Level 3 |
Prerequisites
| Requirement | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB |
| Disk | 10 GB | 20 GB (model cache) |
| CPU | 4 cores | 8+ cores |
| GPU (optional) | NVIDIA with 4 GB VRAM | NVIDIA with 8+ GB VRAM |
| Docker | 24.0+ | Latest |
| Docker Compose | v2.20+ | Latest |
| NVIDIA Container Toolkit (GPU) | Latest | Latest |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Single Host │
│ │
│ ┌─────────────┐ ┌───────────────────┐ ┌──────────────┐ │
│ │ Client │───▶│ Isartor Firewall │ │ Jaeger UI │ │
│ │ │ │ :8080 │ │ :16686 │ │
│ └─────────────┘ │ (candle L1 │ └──────────────┘ │
│ │ embeddings │ │
│ │ built-in) │ │
│ └──┬────────────────┘ │
│ │ │
│ HTTP :8081│ │
│ ▼ │
│ ┌────────────┐ ┌──────────────┐ │
│ │ slm-gen │ │ Grafana │ │
│ │ Phi-3-mini │ │ :3000 │ │
│ │ (llama.cpp)│ └──────────────┘ │
│ └────────────┘ │
│ ┌──────────────┐ │
│ ┌─────────────────────────┐ │ Prometheus │ │
│ │ OTel Collector :4317 │────▶│ :9090 │ │
│ └─────────────────────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Optional: slm-embed :8082 (llama.cpp) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Services
| Service | Image | Port | Purpose | Memory Limit |
|---|---|---|---|---|
| gateway | isartor:latest (built) | 8080 | Prompt Firewall (includes candle BertModel for Layer 1 embeddings) | 256 MB |
| slm-generation | ghcr.io/ggml-org/llama.cpp:server | 8081 | Phi-3-mini-4k (Q4_K_M) — intent classification + generation | 4 GB |
| slm-embedding (optional) | ghcr.io/ggml-org/llama.cpp:server | 8082 | all-MiniLM-L6-v2 (Q8_0) — external embedding sidecar (default uses in-process candle) | 512 MB |
| otel-collector | otel/opentelemetry-collector-contrib:0.96.0 | 4317 | OTLP gRPC receiver | 128 MB |
| jaeger | jaegertracing/all-in-one:1.55 | 16686 | Distributed tracing UI | 256 MB |
| prometheus | prom/prometheus:v2.51.0 | 9090 | Metrics storage (7d retention) | 256 MB |
| grafana | grafana/grafana:10.4.0 | 3000 | Dashboards | 256 MB |
Quick Start (CPU Only)
1. Clone the Repository
git clone https://github.com/isartor-ai/isartor.git
cd isartor/docker
2. Configure Layer 3 (Optional)
Layers 0–2 work without a cloud LLM key. If you want Layer 3 fallback:
cp .env.full.example .env.full
Edit .env.full and set your provider:
ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...
3. Start the Full Stack
docker compose -f docker-compose.sidecar.yml up --build
First launch downloads model files (~1.5 GB for Phi-3 + ~50 MB for MiniLM). Subsequent starts use the cached isartor-slm-models volume.
4. Wait for Health Checks
The firewall waits for both sidecars to become healthy before starting:
docker compose -f docker-compose.sidecar.yml ps
All services should show healthy or running.
5. Verify
# Health check
curl http://localhost:8080/healthz
# Test the firewall
curl -s http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "What is 2+2?"}' | jq .
# If you enabled gateway auth, add:
# -H "X-API-Key: your-secret-key"
# Check traces in Jaeger
open http://localhost:16686
GPU Passthrough (NVIDIA)
To enable GPU acceleration for the llama.cpp sidecars:
1. Install NVIDIA Container Toolkit
# Ubuntu / Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
2. Add GPU Resources to Compose
Create a docker-compose.gpu.override.yml:
services:
slm-generation:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# The default --n-gpu-layers 99 in docker-compose.sidecar.yml
# already offloads all layers to GPU when available.
slm-embedding:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
3. Start with GPU Override
docker compose \
-f docker-compose.sidecar.yml \
-f docker-compose.gpu.override.yml \
up --build
Expected GPU Impact
| Metric | CPU Only (8-core) | GPU (RTX 3060 12 GB) |
|---|---|---|
| Phi-3 classification | 500–2000 ms | 30–100 ms |
| Phi-3 generation (256 tokens) | 5–15 s | 0.5–2 s |
| MiniLM embedding | 20–50 ms | 5–10 ms |
Available Compose Files
The docker/ directory contains several Compose configurations for different use cases:
| File | Description | Provider |
|---|---|---|
docker-compose.sidecar.yml | Recommended. Full stack with llama.cpp sidecars + observability | Any (configurable) |
docker-compose.yml | Legacy stack with Ollama (heavier) | OpenAI |
docker-compose.azure.yml | Legacy stack with Ollama, pre-configured for Azure OpenAI | Azure |
docker-compose.observability.yml | Observability-focused stack (Ollama + OTel + Jaeger + Grafana) | Azure |
We recommend
docker-compose.sidecar.ymlfor all new deployments. The llama.cpp sidecars are ~30 MB each vs. Ollama's ~1.5 GB.
Environment Variables (Level 2 Specific)
These variables are relevant to the sidecar architecture. For the full reference, see the Configuration Reference.
Firewall ↔ Sidecar Communication
| Variable | Default | Description |
|---|---|---|
ISARTOR__LAYER2__SIDECAR_URL | http://127.0.0.1:8081 | Generation sidecar URL (use Docker service name in Compose: http://slm-generation:8081) |
ISARTOR__LAYER2__MODEL_NAME | phi-3-mini | Model name for OpenAI-compatible requests |
ISARTOR__LAYER2__TIMEOUT_SECONDS | 30 | HTTP timeout for generation calls |
ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL | http://127.0.0.1:8082 | Embedding sidecar URL — optional (default uses in-process candle; use http://slm-embedding:8082 in Compose) |
ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME | all-minilm | Embedding model name (sidecar only) |
ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS | 10 | HTTP timeout for embedding calls (sidecar only) |
Pluggable Backends
| Variable | Default | Description |
|---|---|---|
ISARTOR__CACHE_BACKEND | memory | In-process LRU — ideal for single-host Docker Compose |
ISARTOR__ROUTER_BACKEND | embedded | In-process Candle SLM classification — no external dependency |
Scalability note: These defaults are appropriate for Level 2 (single host). When moving to Level 3 (multi-replica K8s), switch to
cache_backend=redisandrouter_backend=vllmfor horizontal scaling.
Cache
| Variable | Default | Description |
|---|---|---|
ISARTOR__CACHE_MODE | both | Use both — in-process candle BertModel provides semantic embeddings at all tiers |
ISARTOR__SIMILARITY_THRESHOLD | 0.85 | Cosine similarity threshold for cache hits |
Observability
| Variable | Default | Description |
|---|---|---|
ISARTOR__ENABLE_MONITORING | true (in Compose) | Enable OTel trace/metric export |
ISARTOR__OTEL_EXPORTER_ENDPOINT | http://otel-collector:4317 | OTel Collector gRPC endpoint |
Operational Commands
Logs
# All services
docker compose -f docker-compose.sidecar.yml logs -f
# Firewall only
docker compose -f docker-compose.sidecar.yml logs -f gateway
# Sidecars
docker compose -f docker-compose.sidecar.yml logs -f slm-generation slm-embedding
Restart a Service
docker compose -f docker-compose.sidecar.yml restart gateway
Tear Down (Preserve Model Cache)
docker compose -f docker-compose.sidecar.yml down
# Models persist in the 'isartor-slm-models' volume
Tear Down (Clean Everything)
docker compose -f docker-compose.sidecar.yml down -v
# Removes all volumes including model cache — next start re-downloads models
View Model Cache Size
docker volume inspect isartor-slm-models
Networking Notes
- All services share a Docker bridge network created by Compose.
- The firewall references sidecars by Docker service name (
slm-generation,slm-embedding), notlocalhost. - Only the firewall (8080), Jaeger UI (16686), Grafana (3000), and Prometheus (9090) are exposed to the host.
- Sidecar ports (8081, 8082) are also exposed for debugging but can be removed in production by deleting the
ports:mapping.
Scaling Within Level 2
Before moving to Level 3, you can vertically scale Level 2:
| Optimisation | How |
|---|---|
| More GPU VRAM | Use larger quantisation (Q8_0 instead of Q4_K_M) for better quality |
| Bigger model | Swap Phi-3-mini for Phi-3-medium or Qwen2-7B in the Compose command |
| More cache | Increase ISARTOR__CACHE_MAX_CAPACITY and ISARTOR__CACHE_TTL_SECS |
| Faster embedding | Use nomic-embed-text (768-dim) for richer semantic matching |
| More concurrency | Scale horizontally with multiple firewall replicas behind a load balancer |
Upgrading to Level 3
When a single host is no longer sufficient:
- Extract the firewall into stateless Kubernetes pods (it's already stateless).
- Replace sidecars with an auto-scaling inference pool (vLLM, TGI, or Triton).
- Add an internal load balancer between firewall pods and the inference pool.
- Move observability to a managed solution (Datadog, Grafana Cloud, Azure Monitor).
See Level 3 — Enterprise Deployment for the full Kubernetes guide.
Level 3 — Enterprise Deployment
Fully decoupled microservices: stateless firewall pods + auto-scaling GPU inference pools.
This guide covers deploying Isartor on Kubernetes with Helm, horizontal pod autoscaling, dedicated GPU inference pools (vLLM or TGI), service mesh integration, and production-grade observability.
When to Use Level 3
| ✅ Good Fit | ❌ Overkill For |
|---|---|
| 100+ concurrent users | < 50 users → Level 2 Docker Compose |
| Multi-region / multi-zone HA | Single-machine development → Level 1 |
| Auto-scaling GPU inference | No GPU budget → Level 1 embedded candle |
| Compliance: mTLS, audit logs, RBAC | Hobby projects / PoCs |
| Cost optimisation via scale-to-zero | Teams without Kubernetes experience |
Architecture
┌────────────────────┐
│ Ingress / ALB │
│ (TLS termination) │
└──────────┬─────────┘
│
┌──────────────┴──────────────┐
│ Firewall Deployment │
│ (N stateless pods) │
│ │
│ ┌────────┐ ┌────────┐ │
│ │ Pod 1 │ │ Pod N │ │
│ │isartor │ │isartor │ │
│ └────────┘ └────────┘ │
│ │
│ HPA: CPU / custom metrics │
└──────────────┬───────────────┘
│
Internal ClusterIP
│
┌────────────────────┼────────────────────┐
│ │ │
┌────────▼───────┐ ┌────────▼───────┐ ┌────────▼───────┐
│ Inference Pool │ │ Embedding Pool │ │ Cloud LLM │
│ (vLLM / TGI) │ │ (TEI / llama) │ │ (OpenAI / etc) │
│ │ │ │ │ (Layer 3 only) │
│ GPU Nodes │ │ CPU/GPU Nodes │ └────────────────┘
│ HPA on GPU util │ │ HPA on RPS │
└─────────────────┘ └─────────────────┘
Component Summary
| Component | Replicas | Scaling Metric | Resource |
|---|---|---|---|
| Firewall | 2–20 | CPU utilisation / request rate | CPU nodes |
| Inference Pool (vLLM) | 1–N | GPU utilisation / queue depth | GPU nodes |
| Embedding Pool (TEI) | 1–N | Requests per second | CPU or GPU nodes (optional; default uses in-process candle) |
| OTel Collector | 1 (DaemonSet or Deployment) | — | CPU nodes |
| Ingress Controller | 1–2 | — | CPU nodes |
Prerequisites
| Requirement | Details |
|---|---|
| Kubernetes cluster | 1.28+ (EKS, GKE, AKS, or bare metal) |
| Helm | v3.12+ |
| kubectl | Matching cluster version |
| GPU nodes (for inference pool) | NVIDIA GPU Operator installed, or GKE/EKS GPU node pools |
| Container registry | For pushing the Isartor firewall image |
| Ingress controller | nginx-ingress, Istio, or cloud ALB |
Step 1: Build & Push the Firewall Image
# Build
docker build -t your-registry.io/isartor:v0.1.0 -f docker/Dockerfile .
# Push
docker push your-registry.io/isartor:v0.1.0
Step 2: Namespace & Secrets
kubectl create namespace isartor
# Cloud LLM API key (Layer 3 fallback)
kubectl create secret generic isartor-llm-secret \
--namespace isartor \
--from-literal=api-key='sk-...'
# Firewall API key (Layer 0 auth)
kubectl create secret generic isartor-gateway-secret \
--namespace isartor \
--from-literal=gateway-api-key='your-production-key'
Step 3: Firewall Deployment
# k8s/gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: isartor-gateway
namespace: isartor
labels:
app: isartor-gateway
spec:
replicas: 2
selector:
matchLabels:
app: isartor-gateway
template:
metadata:
labels:
app: isartor-gateway
spec:
containers:
- name: gateway
image: your-registry.io/isartor:v0.1.0
ports:
- containerPort: 8080
name: http
env:
- name: ISARTOR__HOST_PORT
value: "0.0.0.0:8080"
- name: ISARTOR__GATEWAY_API_KEY
valueFrom:
secretKeyRef:
name: isartor-gateway-secret
key: gateway-api-key
# Pluggable backends — scaled for multi-replica K8s
- name: ISARTOR__CACHE_BACKEND
value: "redis" # Shared cache across all firewall pods
- name: ISARTOR__REDIS_URL
value: "redis://redis.isartor:6379"
- name: ISARTOR__ROUTER_BACKEND
value: "vllm" # GPU-backed vLLM inference pool
- name: ISARTOR__VLLM_URL
value: "http://isartor-inference:8081"
- name: ISARTOR__VLLM_MODEL
value: "gemma-2-2b-it"
# Cache
- name: ISARTOR__CACHE_MODE
value: "both"
- name: ISARTOR__SIMILARITY_THRESHOLD
value: "0.85"
- name: ISARTOR__CACHE_TTL_SECS
value: "300"
- name: ISARTOR__CACHE_MAX_CAPACITY
value: "50000"
# Inference pool (internal service)
- name: ISARTOR__LAYER2__SIDECAR_URL
value: "http://isartor-inference:8081"
- name: ISARTOR__LAYER2__MODEL_NAME
value: "phi-3-mini"
- name: ISARTOR__LAYER2__TIMEOUT_SECONDS
value: "30"
# Embedding pool (optional — default uses in-process candle)
- name: ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL
value: "http://isartor-embedding:8082"
- name: ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME
value: "all-minilm"
- name: ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS
value: "10"
# Layer 3 — Cloud LLM
- name: ISARTOR__LLM_PROVIDER
value: "openai"
- name: ISARTOR__EXTERNAL_LLM_MODEL
value: "gpt-4o-mini"
- name: ISARTOR__EXTERNAL_LLM_API_KEY
valueFrom:
secretKeyRef:
name: isartor-llm-secret
key: api-key
# Observability
- name: ISARTOR__ENABLE_MONITORING
value: "true"
- name: ISARTOR__OTEL_EXPORTER_ENDPOINT
value: "http://otel-collector.isartor:4317"
resources:
requests:
cpu: "250m"
memory: "128Mi"
limits:
cpu: "1000m"
memory: "256Mi"
readinessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: isartor-gateway
namespace: isartor
spec:
selector:
app: isartor-gateway
ports:
- port: 8080
targetPort: http
name: http
type: ClusterIP
Step 4: Inference Pool (vLLM)
vLLM provides high-throughput, GPU-optimised inference with continuous batching.
# k8s/inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: isartor-inference
namespace: isartor
labels:
app: isartor-inference
spec:
replicas: 1
selector:
matchLabels:
app: isartor-inference
template:
metadata:
labels:
app: isartor-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "microsoft/Phi-3-mini-4k-instruct"
- "--host"
- "0.0.0.0"
- "--port"
- "8081"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.9"
ports:
- containerPort: 8081
name: http
resources:
requests:
nvidia.com/gpu: 1
memory: "8Gi"
limits:
nvidia.com/gpu: 1
memory: "16Gi"
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 60
periodSeconds: 10
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: isartor-inference
namespace: isartor
spec:
selector:
app: isartor-inference
ports:
- port: 8081
targetPort: http
name: http
type: ClusterIP
Alternative: Text Generation Inference (TGI)
Replace vLLM with TGI if you prefer Hugging Face's inference server:
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:latest
args:
- "--model-id"
- "microsoft/Phi-3-mini-4k-instruct"
- "--port"
- "8081"
- "--max-input-length"
- "4096"
- "--max-total-tokens"
- "8192"
Alternative: llama.cpp Server (CPU / Light GPU)
For budget clusters without heavy GPU nodes:
containers:
- name: llama-cpp
image: ghcr.io/ggml-org/llama.cpp:server
args:
- "--host"
- "0.0.0.0"
- "--port"
- "8081"
- "--hf-repo"
- "microsoft/Phi-3-mini-4k-instruct-gguf"
- "--hf-file"
- "Phi-3-mini-4k-instruct-q4.gguf"
- "--ctx-size"
- "4096"
- "--n-gpu-layers"
- "99"
Step 5: Embedding Pool (TEI) — Optional
Note: The gateway generates Layer 1 embeddings in-process via candle BertModel. This external embedding pool is optional for high-throughput deployments that want to offload embedding generation.
Text Embeddings Inference (TEI) provides optimised embedding generation.
# k8s/embedding-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: isartor-embedding
namespace: isartor
labels:
app: isartor-embedding
spec:
replicas: 2
selector:
matchLabels:
app: isartor-embedding
template:
metadata:
labels:
app: isartor-embedding
spec:
containers:
- name: tei
image: ghcr.io/huggingface/text-embeddings-inference:cpu-latest
args:
- "--model-id"
- "sentence-transformers/all-MiniLM-L6-v2"
- "--port"
- "8082"
ports:
- containerPort: 8082
name: http
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: isartor-embedding
namespace: isartor
spec:
selector:
app: isartor-embedding
ports:
- port: 8082
targetPort: http
name: http
type: ClusterIP
Step 6: Horizontal Pod Autoscaler
Gateway HPA
# k8s/gateway-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: isartor-gateway-hpa
namespace: isartor
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: isartor-gateway
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 120
Inference Pool HPA (Custom Metrics)
For GPU-based scaling, use custom metrics from Prometheus:
# k8s/inference-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: isartor-inference-hpa
namespace: isartor
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: isartor-inference
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "80"
Note: GPU-based HPA requires the Prometheus Adapter or KEDA to expose GPU metrics to the HPA controller.
Step 7: Ingress
nginx-ingress Example
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: isartor-ingress
namespace: isartor
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.isartor.example.com
secretName: isartor-tls
rules:
- host: api.isartor.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: isartor-gateway
port:
number: 8080
Istio VirtualService (Service Mesh)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: isartor-vs
namespace: isartor
spec:
hosts:
- api.isartor.example.com
gateways:
- isartor-gateway
http:
- match:
- uri:
prefix: /api/
route:
- destination:
host: isartor-gateway
port:
number: 8080
timeout: 120s
retries:
attempts: 2
perTryTimeout: 60s
Step 8: Apply Everything
# Apply in order
kubectl apply -f k8s/gateway-deployment.yaml
kubectl apply -f k8s/inference-deployment.yaml
kubectl apply -f k8s/embedding-deployment.yaml
kubectl apply -f k8s/gateway-hpa.yaml
kubectl apply -f k8s/inference-hpa.yaml
kubectl apply -f k8s/ingress.yaml
# Verify
kubectl get pods -n isartor
kubectl get svc -n isartor
kubectl get hpa -n isartor
Redis Configuration for Distributed Cache
Enterprise deployments use Redis to share the exact-match cache across all firewall pods. Configure the cache provider via environment variables or isartor.yaml:
Environment Variables
ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379
YAML Configuration
exact_cache:
provider: redis
redis_url: "redis://redis-cluster.svc:6379"
# Optional: redis_db: 0
Kubernetes Topology with Redis
Deploy Redis as a StatefulSet within the cluster, accessible only via ClusterIP:
[Ingress]
|
[Isartor Deployment] <--> [Redis StatefulSet]
|
+--> [vLLM Deployment (GPU nodes)]
- Isartor pods scale horizontally for network I/O and cache hits.
- Redis ensures cache consistency across all pods.
- The vLLM GPU pool scales independently for inference throughput.
vLLM Configuration for SLM Routing
Enterprise deployments replace the embedded candle SLM with a remote vLLM inference pool for higher throughput. Configure the router backend via environment variables or isartor.yaml:
Environment Variables
ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm-openai.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
YAML Configuration
slm_router:
provider: remote_http
remote_url: "http://vllm-openai.svc:8000"
model: "meta-llama/Llama-3-8B-Instruct"
Docker Compose Example (Enterprise Sidecar)
For development or staging environments that mirror enterprise topology:
services:
isartor:
image: isartor-ai/isartor:latest
ports:
- "8080:8080"
environment:
- ISARTOR__CACHE_BACKEND=redis
- ISARTOR__REDIS_URL=redis://redis-cluster:6379
- ISARTOR__ROUTER_BACKEND=vllm
- ISARTOR__VLLM_URL=http://vllm-openai:8000
- ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
depends_on:
- redis
- vllm-openai
redis:
image: redis:7
ports:
- "6379:6379"
vllm-openai:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
Observability in Level 3
For Kubernetes deployments, you have several options:
| Approach | Stack | Effort |
|---|---|---|
| Self-managed | OTel Collector DaemonSet → Jaeger + Prometheus + Grafana | Medium |
| Managed (AWS) | AWS X-Ray + CloudWatch + Managed Grafana | Low |
| Managed (GCP) | Cloud Trace + Cloud Monitoring | Low |
| Managed (Azure) | Azure Monitor + Application Insights | Low |
| Third-party | Datadog / New Relic / Grafana Cloud | Low |
The gateway exports traces and metrics via OTLP gRPC to whatever ISARTOR__OTEL_EXPORTER_ENDPOINT points at. See Metrics & Tracing for detailed setup.
Scalability Deep-Dive
Level 3 is designed for horizontal scaling. The Pluggable Trait Provider architecture ensures every component can scale independently:
Stateless Gateway Pods
The Isartor gateway binary is fully stateless when configured with cache_backend=redis and router_backend=vllm. All request-scoped state (cache, inference) is offloaded to external services, meaning:
- Gateway pods scale linearly — add replicas via HPA without coordination overhead.
- Zero warm-up penalty — new pods serve requests immediately (no model loading, no cache priming).
- Rolling updates — deploy new versions with zero downtime; old and new pods share the same Redis cache.
Shared Cache via Redis
With ISARTOR__CACHE_BACKEND=redis:
| Benefit | Impact |
|---|---|
| Consistent hit rate | All pods read/write the same cache — no per-pod cold caches |
| Memory efficiency | Cache memory is centralised, not duplicated N times |
| Persistence | Redis AOF/RDB survives pod restarts |
| Cluster mode | Redis Cluster or ElastiCache provides sharded, HA caching |
GPU Inference Pool (vLLM)
With ISARTOR__ROUTER_BACKEND=vllm:
| Benefit | Impact |
|---|---|
| Independent GPU scaling | Scale inference replicas separately from gateway pods |
| Continuous batching | vLLM's PagedAttention maximises GPU utilisation |
| Mixed hardware | Gateway runs on cheap CPU nodes; inference on GPU nodes |
| Cost control | Scale inference to zero when idle (KEDA + queue-depth trigger) |
Scaling Dimensions
| Dimension | Knob | Metric |
|---|---|---|
| Gateway replicas | HPA minReplicas / maxReplicas | CPU utilisation, request rate |
| Inference replicas | HPA on custom GPU metrics | GPU utilisation, queue depth |
| Cache capacity | ISARTOR__CACHE_MAX_CAPACITY | Cache hit rate, memory usage |
| Concurrency | HPA + replica scaling | P95 latency, request rate |
| Redis | Redis Cluster nodes | Key count, memory, eviction rate |
Cost Optimisation
| Strategy | Description |
|---|---|
| Spot / preemptible nodes | Use for inference pods (they're stateless and restart quickly) |
| Scale-to-zero | Use KEDA with queue-depth trigger to scale inference to 0 when idle |
| Right-size GPU | A100 80 GB for large models, T4/L4 for Phi-3-mini (4 GB VRAM is sufficient) |
| Shared GPU | NVIDIA MPS or MIG to run multiple inference pods per GPU |
| Semantic cache | Higher ISARTOR__CACHE_MAX_CAPACITY = fewer inference calls |
| Smaller quantisation | Q4_K_M uses less VRAM at marginal quality cost |
Security Checklist
- TLS termination at ingress (cert-manager + Let's Encrypt or cloud certs)
- mTLS between services (Istio / Linkerd / Cilium)
-
ISARTOR__GATEWAY_API_KEYfrom Kubernetes Secret, not plaintext -
ISARTOR__EXTERNAL_LLM_API_KEYfrom Kubernetes Secret - Network policies restricting pod-to-pod communication
- RBAC: least-privilege ServiceAccounts for each workload
-
Pod security standards:
restrictedorbaseline - Image scanning (Trivy, Snyk) in CI pipeline
- Audit logging enabled on the cluster
Downgrading to Level 2
If Kubernetes overhead doesn't justify the scale:
- Export your env vars from the Kubernetes ConfigMap/Secret.
- Map them into
docker/.env.full. - Run
docker compose -f docker-compose.sidecar.yml up --build.
No code changes — the binary is identical across all three tiers.
Air-Gapped / Offline Deployment
Overview
Isartor is architecturally the most air-gap-friendly LLM gateway available. Its pure-Rust statically compiled binary embeds all inference models at build time, requires no runtime dependencies, and validates licenses with an offline HMAC check — so Isartor itself does not initiate unsolicited telemetry or license calls to external services.
The zero-phone-home guarantee applies to Isartor-managed network paths: the
--offline flag disables L3 cloud routing and external observability backends
at the application layer, and our CI phone-home audit test (see
tests/phone_home_audit.rs) exercises these code paths on every commit.
Supported regulated industries: defense, healthcare (HIPAA), finance (SOX), and government (FedRAMP).
Pre-Deployment Checklist
Complete these steps before deploying Isartor in an air-gapped environment:
-
Download the airgapped Docker image
docker pull ghcr.io/isartor-ai/isartor:latest-airgappedThis image includes local copies of the L1b embedding models to minimize or avoid external downloads during normal operation in most setups. See Image Size Comparison for size details and be sure to follow any additional configuration steps required by your environment to operate fully offline.
-
Transfer to your air-gapped environment via your organisation's approved media transfer process (USB, air-gap data diode, etc.).
-
Enable offline mode
export ISARTOR__OFFLINE_MODE=trueAlternatively, pass
--offlineon the command line:isartor --offline -
Disable L3 or point it at an internal LLM endpoint
- For strictly air-gapped / zero-egress deployments, you must enable offline mode (step 3). Leaving
ISARTOR__EXTERNAL_LLM_API_KEYunset alone does not prevent the gateway from attempting outbound L3 calls to the default external endpoint on cache misses. - To run fully local (cache + SLM only) with no outbound attempts, enable offline mode and leave
ISARTOR__EXTERNAL_LLM_API_KEYunset. - To route L3 to a self-hosted model, see Connecting to an Internal LLM.
- For strictly air-gapped / zero-egress deployments, you must enable offline mode (step 3). Leaving
-
Run
isartor checkto confirm zero external connections:isartor checkExpected output (with offline mode active):
Isartor Connectivity Audit ────────────────────────── Required (L3 cloud routing): → api.openai.com:443 [NOT CONFIGURED] (BLOCKED — offline mode active) Optional (observability / monitoring): → http://localhost:4317 [NOT CONFIGURED] Internal only (no external): → (in-memory cache — no network connection) [CONFIGURED - internal] Zero hidden telemetry connections: ✓ VERIFIED Air-gap compatible: ✓ YES (L3 disabled or offline mode active) -
Run
isartor audit verify(planned — see issue #3) to confirm the signed audit log is functioning correctly.
Connecting to an Internal LLM
In this configuration Isartor acts as a fully air-gapped deflection layer in front of an internal LLM. 100% of traffic stays inside the perimeter: L1a and L1b handle cached / semantically similar prompts locally, and only genuine cache misses are forwarded to your self-hosted model over the internal network.
# Route L3 to a self-hosted vLLM instance on the internal network.
export ISARTOR__EXTERNAL_LLM_URL=http://vllm.internal.corp:8000/v1
export ISARTOR__LLM_PROVIDER=openai # vLLM exposes an OpenAI-compat API
export ISARTOR__EXTERNAL_LLM_MODEL=meta-llama/Llama-3-8B-Instruct
# Enable offline mode to block any accidental external connections.
export ISARTOR__OFFLINE_MODE=true
# Start the gateway.
isartor
Note:
ISARTOR__EXTERNAL_LLM_URLsets the L3 endpoint URL. Point it at your internal vLLM or TGI server.
With this configuration:
- L1a (exact cache) deflects duplicate prompts instantly (< 1 ms).
- L1b (semantic cache) deflects semantically similar prompts (1–5 ms).
- L3 forwards surviving cache-miss prompts to your internal vLLM.
- Zero bytes leave the network perimeter.
Startup Status Banner
When offline mode is active, Isartor prints a status banner at startup so operators can confirm the configuration at a glance:
┌──────────────────────────────────────────────────────┐
│ [Isartor] OFFLINE MODE ACTIVE │
├──────────────────────────────────────────────────────┤
│ ✓ L1a Exact Cache: active │
│ ✓ L1b Semantic Cache: active │
│ - L2 SLM Router: disabled (ENABLE_SLM_ROUTER=false)│
│ ✗ L3 Cloud Logic: DISABLED (offline mode) │
│ ✗ Telemetry export: DISABLED if external endpoint │
│ ✓ License validation: offline HMAC check │
└──────────────────────────────────────────────────────┘
Environment Variables Reference
| Variable | Default | Description |
|---|---|---|
ISARTOR__OFFLINE_MODE | false | Enable air-gap mode. Blocks L3 cloud calls. |
ISARTOR__EXTERNAL_LLM_URL | — | Internal LLM endpoint (vLLM, TGI, etc.). |
ISARTOR__EXTERNAL_LLM_MODEL | gpt-4o-mini | Model name passed to the internal LLM. |
ISARTOR__SIMILARITY_THRESHOLD | 0.85 | Cosine similarity threshold for L1b cache hits. Lower values increase local deflection. |
ISARTOR__OTEL_EXPORTER_ENDPOINT | http://localhost:4317 | OTel collector endpoint. External URLs are suppressed in offline mode. |
For the complete variable listing, see the Configuration Reference.
Image Size Comparison
| Image | Tag | Includes models | Compressed size |
|---|---|---|---|
| Base | latest | No (downloads on first run) | ~120 MB |
| Air-gapped | latest-airgapped | Yes (all-MiniLM-L6-v2 embedded) | ~210 MB |
The latest-airgapped image is approximately 90 MB larger due to the
pre-bundled embedding model. This is the recommended image for any environment
with restricted outbound internet access.
Compliance Notes
FedRAMP / NIST 800-53
This deployment posture supports the following NIST 800-53 controls:
| Control | Description | How Isartor Supports It |
|---|---|---|
| AU-2 | Audit Logging | Every prompt, deflection decision, and L3 call is logged as a structured JSON event with tracing spans. |
| SC-7 | Boundary Protection | ISARTOR__OFFLINE_MODE=true enforces a hard block on all outbound connections. The phone-home audit CI test verifies this. |
| SI-4 | Information System Monitoring | OpenTelemetry traces + Prometheus metrics provide real-time visibility into the deflection stack. Internal-only OTel endpoints are supported. |
| CM-6 | Configuration Settings | All settings are controlled via environment variables with documented defaults. No runtime code changes are needed. |
HIPAA
When ISARTOR__OFFLINE_MODE=true and L3 is pointed at an internal model:
- PHI in prompts never leaves the network perimeter.
- The L1b semantic cache computes embeddings in-process using a pure-Rust
candlemodel — no external API calls. - Audit logs are written to stdout for ingestion by your internal SIEM.
Disclaimer
This document describes deployment architecture. The controls described above are architectural claims based on code behaviour — they are not a formal compliance certification. Consult your compliance team and engage a qualified assessor for formal FedRAMP authorization or HIPAA compliance review.
Further Reading
Configuration Reference
Complete reference for every Isartor configuration variable, CLI command, and provider option.
Configuration Loading Order
Isartor loads configuration in the following order (later sources override earlier ones):
- Compiled defaults — baked into the binary
isartor.toml— if present in the working directory or~/.isartor/- Environment variables —
ISARTOR__...with double-underscore separators
Generate a starter config file with:
isartor init
Master Configuration Table
| YAML Key | Environment Variable | Type | Default | Description |
|---|---|---|---|---|
| server.host | ISARTOR__HOST | string | 0.0.0.0 | Host address for server binding |
| server.port | ISARTOR__PORT | int | 8080 | Port for HTTP server |
| exact_cache.provider | ISARTOR__CACHE_BACKEND | string | memory | Layer 1a cache backend: memory or redis |
| exact_cache.redis_url | ISARTOR__REDIS_URL | string | (none) | Redis connection string (if provider=redis) |
| exact_cache.redis_db | ISARTOR__REDIS_DB | int | 0 | Redis database index |
| semantic_cache.provider | ISARTOR__SEMANTIC_BACKEND | string | candle | Layer 1b semantic cache: candle (in-process) or tei (external) |
| semantic_cache.remote_url | ISARTOR__TEI_URL | string | (none) | TEI endpoint (if provider=tei) |
| slm_router.provider | ISARTOR__ROUTER_BACKEND | string | embedded | Layer 2 router: embedded or vllm |
| slm_router.remote_url | ISARTOR__VLLM_URL | string | (none) | vLLM/TGI endpoint (if provider=vllm) |
| slm_router.model | ISARTOR__VLLM_MODEL | string | gemma-2-2b-it | Model name/path for SLM router |
| slm_router.model_path | ISARTOR__MODEL_PATH | string | (baked-in) | Path to GGUF model file (embedded mode) |
| slm_router.classifier_mode | ISARTOR__LAYER2__CLASSIFIER_MODE | string | tiered | Classifier mode: tiered (TEMPLATE/SNIPPET/COMPLEX) or binary (legacy SIMPLE/COMPLEX) |
| slm_router.max_answer_tokens | ISARTOR__LAYER2__MAX_ANSWER_TOKENS | u64 | 2048 | Max tokens the SLM may generate for a local answer |
| fallback.openai_api_key | ISARTOR__OPENAI_API_KEY | string | (none) | OpenAI API key for Layer 3 fallback |
| fallback.anthropic_api_key | ISARTOR__ANTHROPIC_API_KEY | string | (none) | Anthropic API key for Layer 3 fallback |
| llm_provider | ISARTOR__LLM_PROVIDER | string | openai | LLM provider (see below for full list) |
| external_llm_model | ISARTOR__EXTERNAL_LLM_MODEL | string | gpt-4o-mini | Model name to request from the provider |
| external_llm_api_key | ISARTOR__EXTERNAL_LLM_API_KEY | string | (none) | API key for the configured LLM provider (not needed for ollama) |
| l3_timeout_secs | ISARTOR__L3_TIMEOUT_SECS | u64 | 120 | HTTP timeout applied to all Layer 3 provider requests |
| enable_context_optimizer | ISARTOR__ENABLE_CONTEXT_OPTIMIZER | bool | true | Master switch for L2.5 context optimiser |
| context_optimizer_dedup | ISARTOR__CONTEXT_OPTIMIZER_DEDUP | bool | true | Enable cross-turn instruction deduplication |
| context_optimizer_minify | ISARTOR__CONTEXT_OPTIMIZER_MINIFY | bool | true | Enable static minification (comments, rules, blanks) |
Sections
Server
server.host,server.port: Bind address and port.
Layer 1a: Exact Match Cache
exact_cache.provider:memoryorredisexact_cache.redis_url,exact_cache.redis_db: Redis config
Layer 1b: Semantic Cache
semantic_cache.provider:candleorteisemantic_cache.remote_url: TEI endpoint- Requests that carry
x-isartor-session-id,x-thread-id,x-session-id, orx-conversation-idare isolated into a session-aware cache scope. The same scope can also be provided in request bodies viasession_id,thread_id,conversation_id, ormetadata.*. If no session identifier is present, Isartor keeps the legacy global-cache behavior.
Layer 2: SLM Router
slm_router.provider:embeddedorvllmslm_router.remote_url,slm_router.model,slm_router.model_path: Router configslm_router.classifier_mode:tiered(default — TEMPLATE/SNIPPET/COMPLEX) orbinary(legacy SIMPLE/COMPLEX)slm_router.max_answer_tokens: Max tokens the SLM may generate for a local answer (default 2048)
Layer 2.5: Context Optimiser
L2.5 compresses repeated instruction payloads (CLAUDE.md, copilot-instructions.md, skills blocks) before they reach the cloud, reducing input tokens on every L3 call.
enable_context_optimizer: Master switch (defaulttrue). Set tofalseto disable L2.5 entirely.context_optimizer_dedup: Enable cross-turn instruction deduplication (defaulttrue). When the same instruction block is seen in consecutive turns of the same session, it is replaced with a compact hash reference.context_optimizer_minify: Enable static minification (defaulttrue). Strips HTML/XML comments, decorative horizontal rules, consecutive blank lines, and Unicode box-drawing decoration.
The pipeline processes system/instruction messages from OpenAI, Anthropic, and native request formats. See Deflection Stack — L2.5 for architecture details.
Layer 3: Cloud Fallbacks
fallback.openai_api_key,fallback.anthropic_api_key: API keys for external LLMsllm_provider: Select the active provider. All providers are powered by rig-core exceptcopilot, which uses Isartor's native GitHub Copilot adapter:openai(default),azure,anthropic,xaigemini,mistral,groq,deepseekcohere,galadriel,hyperbolic,huggingfacemira,moonshot,ollama(local, no key),openrouterperplexity,togethercopilot(GitHub Copilot subscription-backed L3)
external_llm_model: Model name for the selected provider (e.g.gpt-4o-mini,gemini-2.0-flash,mistral-small-latest,llama-3.1-8b-instant,deepseek-chat,command-r,sonar,moonshot-v1-128k)external_llm_api_key: API key for the configured provider (not needed forollama)l3_timeout_secs: Shared timeout, in seconds, for all Layer 3 provider HTTP calls
TOML Config Example
Generate a scaffold with isartor init, then edit isartor.toml:
[server]
host = "0.0.0.0"
port = 8080
[exact_cache]
provider = "memory" # "memory" or "redis"
# redis_url = "redis://127.0.0.1:6379"
# redis_db = 0
[semantic_cache]
provider = "candle" # "candle" or "tei"
# remote_url = "http://localhost:8082"
[slm_router]
provider = "embedded" # "embedded" or "vllm"
# remote_url = "http://localhost:8000"
# model = "gemma-2-2b-it"
# L2.5 Context Optimiser (all enabled by default)
# enable_context_optimizer = true
# context_optimizer_dedup = true
# context_optimizer_minify = true
[fallback]
# openai_api_key = "sk-..."
# anthropic_api_key = "sk-ant-..."
# llm_provider = "openai"
# external_llm_model = "gpt-4o-mini"
# external_llm_api_key = "sk-..."
Per-Tier Defaults
| Setting | Level 1 (Minimal) | Level 2 (Sidecar) | Level 3 (Enterprise) |
|---|---|---|---|
| Cache backend | memory | memory | redis |
| Semantic backend | candle | candle | tei (optional) |
| SLM router | embedded | embedded or sidecar | vllm |
| LLM provider | openai | openai | any |
| Monitoring | false | true | true |
Provider-Specific Configuration
Each provider requires ISARTOR__EXTERNAL_LLM_API_KEY (except Ollama) and a matching ISARTOR__LLM_PROVIDER value:
# OpenAI (default)
export ISARTOR__LLM_PROVIDER=openai
export ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
# Azure OpenAI
export ISARTOR__LLM_PROVIDER=azure
# Anthropic
export ISARTOR__LLM_PROVIDER=anthropic
export ISARTOR__EXTERNAL_LLM_MODEL=claude-3-haiku-20240307
# xAI (Grok)
export ISARTOR__LLM_PROVIDER=xai
# Google Gemini
export ISARTOR__LLM_PROVIDER=gemini
export ISARTOR__EXTERNAL_LLM_MODEL=gemini-2.0-flash
# Ollama (local — no API key required)
export ISARTOR__LLM_PROVIDER=ollama
export ISARTOR__EXTERNAL_LLM_MODEL=llama3
# GitHub Copilot (configured automatically by `isartor connect claude-copilot`)
export ISARTOR__LLM_PROVIDER=copilot
export ISARTOR__EXTERNAL_LLM_MODEL=claude-sonnet-4.5
Setting API Keys with the CLI
Use isartor set-key for interactive key management:
isartor set-key --provider openai
isartor set-key --provider anthropic
isartor set-key --provider xai
This writes the key to isartor.toml or the appropriate env file.
CLI Commands
| Command | Description |
|---|---|
isartor up | Start the API gateway only (recommended default). Flag: --detach to run in background |
isartor up <copilot|claude|antigravity> | Start the gateway plus the CONNECT proxy for that client |
isartor init | Generate a commented isartor.toml config scaffold |
isartor demo | Run the post-install showcase (cache-only, or live + cache when a provider is configured) |
isartor check | Audit outbound connections |
isartor connect <client> | Configure AI clients to route through Isartor |
isartor connect copilot | Configure Copilot CLI with CONNECT proxy + TLS MITM |
isartor connect claude-copilot | Configure Claude Code to use GitHub Copilot through Isartor |
isartor stats | Show total prompts, counts by layer, and recent prompt routing history |
isartor set-key --provider <name> | Set LLM provider API key (writes to isartor.toml or env file) |
isartor stop | Stop a running Isartor instance (uses PID file). Flags: --force (SIGKILL), --pid-file <path> |
isartor update | Self-update to the latest (or specific) version. Flags: --version <tag>, --dry-run, --force |
See also: Architecture · Metrics & Tracing · Troubleshooting
Metrics & Tracing
Definitive reference for Isartor's OpenTelemetry traces, metrics, structured logging, and observability stack — from local development to Kubernetes.
Overview
Isartor uses OpenTelemetry for distributed
tracing and metrics, plus tracing-subscriber with a JSON layer for
structured logging.
| Signal | Protocol | Default Endpoint |
|---|---|---|
| Traces | OTLP gRPC | http://localhost:4317 |
| Metrics | OTLP gRPC | http://localhost:4317 |
| Logs | stdout (JSON) | — |
When ISARTOR__ENABLE_MONITORING=false (default), only the console log
layer is active — zero OTel overhead.
Architecture
┌─────────────┐ ┌──────────────────┐
│ Isartor │ OTLP gRPC │ OTel Collector │
│ Gateway │─────────────────▶│ :4317 │
│ │ (traces + │ │
│ │ metrics) │ Pipelines: │
└─────────────┘ │ traces → Jaeger │
│ metrics → Prom │
└───┬──────────┬────┘
│ │
┌──────────▼──┐ ┌────▼──────────┐
│ Jaeger │ │ Prometheus │
│ :16686 │ │ :9090 │
│ (UI) │ │ (scrape) │
└─────────────┘ └───────┬───────┘
│
┌───────▼───────┐
│ Grafana │
│ :3000 │
│ (dashboards) │
└───────────────┘
Enabling Monitoring
ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://localhost:4317
RUST_LOG=info,h2=warn,hyper=warn,tower=warn # optional override
When ISARTOR__ENABLE_MONITORING=false (the default), Isartor uses console-only logging via tracing-subscriber with RUST_LOG filtering. No OTel SDK is initialised — zero overhead.
Telemetry Initialisation (src/telemetry.rs)
init_telemetry() returns an OtelGuard (RAII). The guard holds the
SdkTracerProvider and SdkMeterProvider; dropping it flushes pending
telemetry and shuts down exporters gracefully.
| Component | Description |
|---|---|
| JSON stdout layer | Structured logs emitted as JSON when monitoring is on |
| Pretty console layer | Human-readable output when monitoring is off |
| OTLP trace exporter | gRPC via opentelemetry-otlp → Collector |
| OTLP metric exporter | gRPC via opentelemetry-otlp → Collector |
| EnvFilter | Reads RUST_LOG, defaults to info,h2=warn,hyper=warn,tower=warn |
Service identity:
service.name = "isartor-gateway"
service.version = env!("CARGO_PKG_VERSION") # e.g. "0.1.0"
Distributed Traces — Span Reference
Every request gets a root span (gateway_request) from the monitoring
middleware. Child spans are created per-layer:
Root Span
| Span Name | Source | Key Attributes |
|---|---|---|
gateway_request | src/middleware/monitoring.rs | http.method, http.route, http.status_code, client.address, isartor.final_layer |
http.status_code and isartor.final_layer are recorded after the
response returns (empty → filled pattern).
Layer 0 — Auth
| Span Name | Source | Key Attributes |
|---|---|---|
(inline tracing::debug!/warn!) | src/middleware/auth.rs | — |
Auth is lightweight; no dedicated span is created. Events are logged at debug/warn level.
Layer 1a — Exact Cache
| Span Name | Source | Key Attributes |
|---|---|---|
l1a_exact_cache_get | src/adapters/cache.rs | cache.backend (memory|redis), cache.key, cache.hit |
l1a_exact_cache_put | src/adapters/cache.rs | cache.backend, cache.key, response_len |
Layer 1b — Semantic Cache
| Span Name | Source | Key Attributes |
|---|---|---|
l1b_semantic_cache_search | src/vector_cache.rs | cache.entries_scanned, cache.hit, cosine_similarity |
l1b_semantic_cache_insert | src/vector_cache.rs | cache.evicted, cache.size_after |
cosine_similarity— the best-match score formatted to 4 decimal places. This is the key attribute for tuning the similarity threshold.
Layer 2 — SLM Triage
| Span Name | Source | Key Attributes |
|---|---|---|
layer2_slm | src/middleware/slm_triage.rs | slm.complexity_score (TEMPLATE|SNIPPET|COMPLEX; legacy binary mode: SIMPLE|COMPLEX) |
l2_classify_intent | src/adapters/router.rs | router.backend (embedded_candle|remote_vllm), router.decision, router.model, router.url, prompt_len |
Layer 2.5 — Context Optimiser
| Span Name | Source | Key Attributes |
|---|---|---|
layer2_5_context_optimizer | src/middleware/context_optimizer.rs | context.bytes_saved, context.strategy (e.g. "classifier+dedup", "classifier+log_crunch") |
When L2.5 modifies the request body, it also sets the response header x-isartor-context-optimized: bytes_saved=<N>.
Layer 3 — Cloud LLM
| Span Name | Source | Key Attributes |
|---|---|---|
layer3_llm | src/handler.rs | ai.prompt.length_bytes, provider.name, model |
Custom Span Attributes — Quick Reference
These are the Isartor-specific attributes (beyond standard OTel semantic conventions) that appear on spans and are useful for filtering in Jaeger / Tempo:
| Attribute | Type | Where Set | Purpose |
|---|---|---|---|
isartor.final_layer | string | Root gateway_request span | Which layer resolved the request |
cache.hit | bool | L1a and L1b spans | Whether the cache lookup succeeded |
cosine_similarity | string | L1b search span | Best cosine-similarity score (4 d.p) |
cache.entries_scanned | u64 | L1b search span | Entries scanned during similarity search |
cache.backend | string | L1a get/put spans | "memory" or "redis" |
router.decision | string | L2 classify span | "TEMPLATE", "SNIPPET", or "COMPLEX" (tiered mode); "SIMPLE" or "COMPLEX" (binary mode) |
router.backend | string | L2 classify span | "embedded_candle" or "remote_vllm" |
context.bytes_saved | u64 | L2.5 optimizer span | Bytes removed by compression pipeline |
context.strategy | string | L2.5 optimizer span | Pipeline stages that modified content (e.g. "classifier+dedup") |
provider.name | string | L3 handler span | e.g. "openai", "xai", "azure" |
model | string | L3 handler span | e.g. "gpt-4o", "grok-beta" |
http.status_code | u16 | Root span | HTTP response status code |
client.address | string | Root span | Client IP (from x-forwarded-for) |
OTel Metrics (src/metrics.rs)
Four instruments are registered as a singleton GatewayMetrics via OnceLock:
| Metric Name | Type | Attributes | Description |
|---|---|---|---|
isartor_requests_total | Counter | final_layer, status_code, traffic_surface, client, endpoint_family, tool | Total prompts processed |
isartor_request_duration_seconds | Histogram | final_layer, status_code, traffic_surface, client, endpoint_family | End-to-end request duration |
isartor_layer_duration_seconds | Histogram | layer_name, tool | Per-layer latency |
isartor_tokens_saved_total | Counter | final_layer, traffic_surface, client, endpoint_family, tool | Estimated tokens saved by early resolve |
isartor_errors_total | Counter | layer, error_class, tool | Error occurrences by layer / agent |
isartor_retries_total | Counter | operation, attempts, outcome, tool | Retry outcomes by agent |
isartor_cache_events_total | Counter | cache_layer, outcome, tool | L1 / L1a / L1b hit-miss safety by agent |
Where Metrics Are Recorded
| Call Site | Metrics Recorded |
|---|---|
root_monitoring_middleware | record_request_with_context(), record_tokens_saved_with_context() (if early) |
proxy::connect::emit_proxy_decision() | record_request_with_context(), record_tokens_saved_with_context() (if early) |
cache_middleware (L1 hit) | record_layer_duration("L1a_ExactCache" | "L1b_SemanticCache") |
slm_triage_middleware (L2 hit) | record_layer_duration("L2_SLM") |
context_optimizer_middleware | record_layer_duration("L2_5_ContextOptimiser") (when bytes saved > 0) |
chat_handler (L3) | record_layer_duration("L3_Cloud") |
Request Dimensions
Unified prompt telemetry distinguishes:
traffic_surface:gatewayorproxyclient:direct,openai,anthropic,copilot,claude,antigravity, etc.endpoint_family:native,openai, oranthropic
Token Estimation
estimate_tokens(prompt) uses the heuristic: max(1, prompt.len() / 4).
This is intentionally conservative — the metric tracks relative savings
rather than precise token counts.
ROI — isartor_tokens_saved_total
This is the headline business metric. Every request resolved before Layer 3 (exact cache, semantic cache, or local SLM) avoids a round-trip to the external LLM provider.
# Daily token savings
sum(increase(isartor_tokens_saved_total[24h]))
# Savings by layer
sum by (final_layer) (rate(isartor_tokens_saved_total[1h]))
# Prompt volume by traffic surface
sum by (traffic_surface) (rate(isartor_requests_total[5m]))
# Prompt volume by client
sum by (client) (rate(isartor_requests_total[5m]))
# Estimated cost savings (assuming $0.01 per 1K tokens)
sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01
Use this metric to justify infrastructure spend for the caching / SLM layers.
Docker Compose — Local Observability Stack
Use the provided compose file for local development:
cd docker
docker compose -f docker-compose.observability.yml up -d
| Service | Port | Purpose |
|---|---|---|
| OTel Collector | 4317 | OTLP gRPC receiver |
| Jaeger | 16686 | Trace UI |
| Prometheus | 9090 | Metrics scrape + query |
| Grafana | 3000 | Dashboards (anonymous admin) |
Configuration files:
| File | Purpose |
|---|---|
docker/otel-collector-config.yaml | Collector pipelines |
docker/prometheus.yml | Scrape targets |
Pipeline Flow
Isartor ──OTLP gRPC──▶ OTel Collector ──▶ Jaeger (traces)
└──▶ Prometheus (metrics)
│
▼
Grafana
OTel Collector Configuration
The collector config is at docker/otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp:
endpoint: "jaeger:4317"
tls:
insecure: true
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp, debug]
metrics:
receivers: [otlp]
exporters: [prometheus, debug]
Prometheus Configuration
The Prometheus config is at docker/prometheus.yml:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['otel-collector:8889']
Prometheus scrapes the OTel Collector's Prometheus exporter on port 8889 every 5 seconds.
Per-Tier Setup
Level 1 — Minimal (Console Logs Only)
No observability stack is needed. Use RUST_LOG for structured console output:
ISARTOR__ENABLE_MONITORING=false
RUST_LOG=isartor=info
For debug-level output during development:
RUST_LOG=isartor=debug,tower_http=trace
Level 2 — Docker Compose (Full Stack)
The docker-compose.sidecar.yml includes the complete observability stack:
cd docker
docker compose -f docker-compose.sidecar.yml up --build
Services included:
| Service | URL | Purpose |
|---|---|---|
| OTel Collector | localhost:4317 (gRPC) | Receives OTLP from gateway |
| Jaeger UI | http://localhost:16686 | View distributed traces |
| Prometheus | http://localhost:9090 | Query metrics |
| Grafana | http://localhost:3000 | Dashboards (anonymous admin access) |
The gateway is pre-configured with:
ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317
Level 3 — Kubernetes (Managed or Self-Hosted)
| Approach | Recommended Stack | Notes |
|---|---|---|
| Self-managed | OTel Collector DaemonSet + Jaeger Operator + kube-prometheus-stack | Full control, higher ops burden |
| AWS | AWS X-Ray + CloudWatch + Managed Grafana | ADOT Collector as sidecar/DaemonSet |
| GCP | Cloud Trace + Cloud Monitoring + Cloud Logging | Use OTLP exporter to Cloud Trace |
| Azure | Application Insights + Azure Monitor | Use Azure Monitor OpenTelemetry exporter |
| Grafana Cloud | Grafana Alloy + Grafana Cloud | Low ops, managed Prometheus + Tempo |
| Datadog | Datadog Agent + OTel Collector | Enterprise APM |
For all options, point the gateway at the collector:
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.isartor:4317
Grafana Dashboard Queries (PromQL)
| Panel | PromQL |
|---|---|
| Request Rate | rate(isartor_requests_total[5m]) |
| P95 Latency | histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) |
| Layer Resolution | sum by (final_layer) (rate(isartor_requests_total[5m])) |
| Traffic Surface Split | sum by (traffic_surface) (rate(isartor_requests_total[5m])) |
| Client Split | sum by (client) (rate(isartor_requests_total[5m])) |
| Per-Layer Latency | histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m]))) |
| Tokens Saved / Hour | sum(increase(isartor_tokens_saved_total[1h])) |
| Tokens Saved by Layer | sum by (final_layer) (rate(isartor_tokens_saved_total[5m])) |
| Cache Hit Rate | rate(isartor_requests_total{final_layer=~"L1.*"}[5m]) / rate(isartor_requests_total[5m]) |
Jaeger — Useful Searches
| Goal | Search |
|---|---|
| Slow requests (> 500 ms) | Service isartor-gateway, Min Duration 500ms |
| Cache misses | Tag cache.hit=false |
| Semantic cache tuning | Tag cosine_similarity — sort by value |
| Layer 3 fallbacks | Tag isartor.final_layer=L3_Cloud |
| SLM local resolutions | Tag router.decision=TEMPLATE or router.decision=SNIPPET (tiered); router.decision=SIMPLE (binary) |
Trace Anatomy
A typical trace for a cache-miss, locally-resolved request:
isartor-gateway
└─ HTTP POST /api/chat [250ms]
├─ Layer0_AuthCheck [0.1ms]
├─ Layer1_SemanticCache (MISS) [5ms]
├─ Layer2_IntentClassifier [80ms]
│ intent=TEMPLATE, confidence=0.97
└─ Layer2_LocalExecutor [160ms]
model=phi-3-mini, tokens=42
Built-in User Views
For quick operator checks without a separate telemetry stack:
isartor stats --gateway-url http://localhost:8080
isartor stats --gateway-url http://localhost:8080 --by-tool
Add --gateway-api-key <key> only when gateway auth is enabled.
--by-tool prints richer per-agent stats: requests, cache hits/misses,
average latency, retry count, error count, and L1a/L1b safety ratios.
Built-in JSON endpoints:
GET /healthGET /debug/proxy/recentGET /debug/stats/promptsGET /debug/stats/agents
Alerting Rules
Prometheus Alerting Rules
Create docker/prometheus-alerts.yml:
groups:
- name: isartor
rules:
- alert: HighErrorRate
expr: rate(isartor_requests_total{status="error"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Isartor error rate > 5% for 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Isartor P95 latency > 2s for 5 minutes"
- alert: LowCacheHitRate
expr: >
rate(isartor_requests_total{final_layer=~"L1.*"}[15m]) /
rate(isartor_requests_total[15m]) < 0.3
for: 15m
labels:
severity: info
annotations:
summary: "Cache hit rate below 30% — consider tuning similarity threshold"
- alert: LowDeflectionRate
expr: |
1 - (
sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
/
sum(rate(isartor_requests_total[1h]))
) < 0.5
for: 30m
labels:
severity: warning
annotations:
summary: "Isartor deflection rate below 50%"
- alert: FirewallDown
expr: up{job="isartor"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Isartor gateway is down"
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| No traces in Jaeger | Monitoring disabled | Set ISARTOR__ENABLE_MONITORING=true |
| No traces in Jaeger | Collector unreachable | Verify OTEL_EXPORTER_ENDPOINT + port 4317 |
| No metrics in Prometheus | Prometheus can't scrape collector | Check prometheus.yml targets |
| Grafana "No data" | Data source misconfigured | URL should be http://prometheus:9090 |
| Console shows "OTel disabled" | Config precedence | Check env vars override file config |
isartor_layer_duration_seconds empty | No requests yet | Send a test request |
See also: Configuration Reference · Performance Tuning · Troubleshooting
Performance Tuning
How to measure, tune, and operate Isartor for maximum deflection and minimum latency.
Table of Contents
- Understanding Deflection
- Measuring Deflection Rate
- Tuning Configuration for Deflection
- Tuning Latency
- Memory & Resource Tuning
- Cache Tuning Deep-Dive
- SLM Router Tuning
- Embedder Tuning
- SLO / SLA Goal Templates
- Scenario-Based Tuning Recipes
- PromQL Cheat Sheet
Understanding Deflection
Deflection = the percentage of requests resolved before Layer 3 (the external cloud LLM). A request is "deflected" if it is served by:
| Layer | Mechanism | Cost |
|---|---|---|
| L1a — Exact Cache | SHA-256 hash match | $0 |
| L1b — Semantic Cache | Cosine similarity match | $0 |
| L2 — SLM Triage | Local SLM classifies requests as TEMPLATE, SNIPPET, or COMPLEX (tiered mode) and answers TEMPLATE/SNIPPET locally | $0 |
The deflection rate directly maps to cost savings. A 70 % deflection rate means only 30 % of requests reach the paid cloud LLM.
Measuring Deflection Rate
Via Prometheus / Grafana
The gateway emits isartor_requests_total with a final_layer label.
Use the following PromQL to compute the deflection rate:
# Overall deflection rate (last 1 hour)
1 - (
sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
/
sum(increase(isartor_requests_total[1h]))
)
# Deflection rate by layer (pie chart)
sum by (final_layer) (rate(isartor_requests_total[5m]))
# Exact-cache deflection only
sum(increase(isartor_requests_total{final_layer="L1a_ExactCache"}[1h]))
/
sum(increase(isartor_requests_total[1h]))
Via the API
Send a test batch and count response layer values:
# Send 100 identical requests — expect 99 cache hits
for i in $(seq 1 100); do
curl -s -X POST http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-H "X-API-Key: $ISARTOR_API_KEY" \
-d '{"prompt": "What is the capital of France?"}' \
| jq '.layer'
done | sort | uniq -c
Expected output (ideal):
1 3 ← first request → cloud
99 1 ← remaining → exact cache
Via Structured Logs
When ISARTOR__ENABLE_MONITORING=true, every request logs the final layer:
# grep JSON logs for final-layer distribution
cat logs.json | jq '.isartor.final_layer' | sort | uniq -c
Via Jaeger / Tempo
Filter traces by the isartor.final_layer tag:
| Goal | Search |
|---|---|
| All cache hits | Tag isartor.final_layer=L1a_ExactCache or L1b_SemanticCache |
| SLM resolutions | Tag isartor.final_layer=L2_SLM |
| Cloud fallbacks | Tag isartor.final_layer=L3_Cloud |
Tuning Configuration for Deflection
Cache Mode
| Variable | Values | Recommended |
|---|---|---|
ISARTOR__CACHE_MODE | exact, semantic, both | both (default) |
exact— Only identical prompts hit. Good for deterministic agent loops.semantic— Catches paraphrases ("Price?" ≈ "Cost?"). Higher hit rate but adds ~1–5 ms embedding cost.both— Exact check first (< 1 ms), then semantic if no exact hit. Best of both worlds.
Similarity Threshold
| Variable | Default | Range |
|---|---|---|
ISARTOR__SIMILARITY_THRESHOLD | 0.85 | 0.0–1.0 |
| Value | Effect |
|---|---|
0.95 | Very strict — only near-identical prompts match. Low false positives, lower hit rate. |
0.85 | Balanced — catches common paraphrases. Recommended starting point. |
0.75 | Aggressive — higher hit rate but risk of returning wrong cached answers. |
0.60 | Dangerous — high false-positive rate. Not recommended for production. |
How to tune:
- Set
ISARTOR__ENABLE_MONITORING=true. - Send representative traffic for 1 hour.
- In Jaeger, search for
cosine_similarityattribute onl1b_semantic_cache_searchspans. - Plot the distribution. If most similarity scores cluster between 0.80–0.90, a threshold of 0.85 is good.
- If you see many scores at 0.82–0.84 that should be hits, lower to 0.80.
Cache TTL
| Variable | Default | Description |
|---|---|---|
ISARTOR__CACHE_TTL_SECS | 300 (5 min) | Time-to-live for cached responses |
- Short TTL (60–120 s): Good for rapidly changing data, real-time dashboards.
- Medium TTL (300–600 s): Balanced for most workloads.
- Long TTL (1800+ s): Maximises deflection for static Q&A / documentation bots.
Cache Capacity
| Variable | Default | Description |
|---|---|---|
ISARTOR__CACHE_MAX_CAPACITY | 10000 | Max entries in each cache (LRU eviction) |
- Monitor eviction rate via
cache.evictedspan attribute onl1b_semantic_cache_insert. - If eviction rate > 5 % of inserts, increase capacity or shorten TTL.
- Each cache entry ≈ 2–4 KB (prompt hash + response + optional 384-dim vector).
Tuning Latency
Target Latencies by Layer
| Layer | Target (p95) | Typical Range |
|---|---|---|
| L1a — Exact Cache | < 1 ms | 0.1–0.5 ms |
| L1b — Semantic Cache | < 10 ms | 1–5 ms |
| L2 — SLM Triage | < 300 ms | 50–200 ms (embedded), 100–500 ms (sidecar) |
| L3 — Cloud LLM | < 3 s | 500 ms – 5 s (network-bound) |
Measure with PromQL
# P95 latency by layer
histogram_quantile(0.95,
sum by (le, layer_name) (
rate(isartor_layer_duration_seconds_bucket[5m])
)
)
# P95 end-to-end latency
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
Reducing Latency
| Bottleneck | Symptom | Fix |
|---|---|---|
| Embedding | L1b > 10 ms | Use a lighter model or increase CPU allocation |
| SLM inference | L2 > 500 ms | Use quantised model (Q4_K_M GGUF), switch to embedded engine |
| Redis | L1a > 5 ms | Check network latency, use Redis cluster with read replicas |
| Cloud LLM | L3 > 5 s | Switch provider, use a smaller model, enable request timeout |
Memory & Resource Tuning
Memory Budget
| Component | Memory Usage | Notes |
|---|---|---|
| Exact cache (in-memory, 10K entries) | ~20–40 MB | Scales linearly with cache_max_capacity |
| Semantic cache (in-memory, 10K entries) | ~30–60 MB | 384-dim float32 vectors + response strings |
| candle embedder (all-MiniLM-L6-v2) | ~90 MB | Loaded at startup, constant |
| Candle GGUF model (embedded SLM) | ~1–4 GB | Depends on model quantisation |
| Tokio runtime | ~10–20 MB | Async task pool |
| Total (minimalist mode) | ~150–200 MB | No embedded SLM |
| Total (embedded mode) | ~1.5–4.5 GB | With embedded Candle SLM |
CPU Considerations
- Embedding generation runs on
spawn_blocking(dedicated thread pool). - Candle GGUF inference is CPU-bound; allocate ≥ 4 cores for embedded mode.
- The Tokio async runtime uses the default thread count (
num_cpus).
Container Limits
# docker-compose example
services:
gateway:
deploy:
resources:
limits:
memory: 512M # minimalist mode
cpus: "2"
# For embedded SLM mode:
# limits:
# memory: 4G
# cpus: "4"
Cache Tuning Deep-Dive
Exact vs. Semantic Cache Hit Analysis
# Exact cache hit rate
sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))
# Semantic cache hit rate
sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))
Cache Backend: Memory vs. Redis
| Factor | In-Memory | Redis |
|---|---|---|
| Latency | ~0.1 ms | ~1–5 ms (network hop) |
| Capacity | Limited by process RAM | Limited by Redis memory |
| Multi-replica | ❌ No sharing | ✅ Shared across pods |
| Persistence | ❌ Lost on restart | ✅ Optional AOF/RDB |
| Recommended for | Single-instance, dev, edge | K8s, multi-replica, production |
Switch with:
export ISARTOR__CACHE_BACKEND=redis
export ISARTOR__REDIS_URL=redis://redis.svc:6379
When to Disable Semantic Cache
- Traffic is 100 % deterministic (exact same prompts repeated).
- Embedding overhead is unacceptable (< 1 ms budget).
- Set
ISARTOR__CACHE_MODE=exact.
SLM Router Tuning
Embedded vs. Sidecar
| Mode | Variable | Latency | Resource Usage |
|---|---|---|---|
| Embedded (Candle) | ISARTOR__INFERENCE_ENGINE=embedded | 50–200 ms | High CPU, 1–4 GB RAM |
| Sidecar (llama.cpp) | ISARTOR__INFERENCE_ENGINE=sidecar | 100–500 ms | Separate process, GPU optional |
| Remote (vLLM/TGI) | ISARTOR__ROUTER_BACKEND=vllm | 100–500 ms | Separate server, GPU recommended |
Model Selection
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| Phi-3-mini (Q4_K_M) | ~2 GB | Fast | Good |
| Gemma-2-2B-IT (Q4) | ~1.5 GB | Very fast | Good |
| Qwen-1.5-1.8B (Q4) | ~1.2 GB | Fastest | Adequate |
| Llama-3-8B (Q4) | ~4.5 GB | Slower | Best |
For intent classification (TEMPLATE/SNIPPET/COMPLEX in tiered mode, or SIMPLE/COMPLEX in legacy binary mode), smaller models (1–3 B params) are sufficient. Use the smallest model that meets your accuracy needs.
Tuning the Classification Prompt
The system prompt in src/middleware/slm_triage.rs determines classification
accuracy. If too many COMPLEX requests are misclassified as TEMPLATE or
SNIPPET (resulting in bad local answers), consider:
- Making the system prompt more specific to your domain.
- Adding examples to the prompt (few-shot).
- Switching to a larger model.
- Setting
ISARTOR__LAYER2__MAX_ANSWER_TOKENSto allow longer SLM responses (default 2048). - Falling back to binary mode via
ISARTOR__LAYER2__CLASSIFIER_MODE=binaryif the three-tier split does not suit your workload.
Embedder Tuning
In-Process (candle)
The default embedder uses candle with sentence-transformers/all-MiniLM-L6-v2 (pure-Rust BertModel):
- 384-dimensional vectors
- ~90 MB model footprint
- 1–5 ms per embedding (CPU)
- Runs on
spawn_blockingto avoid starving the Tokio runtime
Sidecar Embedder
For higher throughput or GPU acceleration:
export ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082
export ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME=all-minilm
export ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS=10
Embedding Model Selection
| Model | Dims | Speed | Quality |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fastest | Good |
| bge-small-en-v1.5 | 384 | Fast | Better |
| bge-base-en-v1.5 | 768 | Moderate | Best |
Use 384-dim models for production. 768-dim models double memory usage for marginal quality improvement in most use cases.
SLO / SLA Goal Templates
Developer / Internal SLO
| Metric | Target | Measurement |
|---|---|---|
| Availability | 99.5 % | up{job="isartor"} over 30-day window |
| P95 latency (cache hit) | < 10 ms | histogram_quantile(0.95, ...) on L1 |
| P95 latency (end-to-end) | < 3 s | histogram_quantile(0.95, ...) on all |
| Deflection rate | > 50 % | 1 - (L3 / total) over 24 h |
| Error rate | < 1 % | rate(isartor_requests_total{http_status=~"5.."}[5m]) |
Production / Enterprise SLO
| Metric | Target | Measurement |
|---|---|---|
| Availability | 99.9 % | Multi-replica, health check monitoring |
| P95 latency (cache hit) | < 5 ms | Requires Redis or fast in-memory |
| P95 latency (end-to-end) | < 2 s | Optimised models, provider SLAs |
| P99 latency (end-to-end) | < 5 s | Tail latency budget |
| Deflection rate | > 70 % | Tuned thresholds + warm cache |
| Error rate | < 0.1 % | Circuit breakers, retries |
| Token savings | > 60 % | isartor_tokens_saved_total vs estimated total |
SLA Template (for downstream consumers)
## Isartor Prompt Firewall SLA
**Availability:** 99.9 % monthly uptime (< 43.8 min downtime/month)
**Latency:** P95 end-to-end < 2 seconds
**Error Budget:** 0.1 % of requests may return 5xx
**Maintenance Window:** Sundays 02:00–04:00 UTC (excluded from SLA)
### Remediation
- Cache tier failure: automatic fallback to cloud LLM (degraded mode)
- SLM failure: automatic fallback to cloud LLM (degraded mode)
- Cloud LLM failure: 502 Bad Gateway returned, retry recommended
### Monitoring
- Health endpoint: GET /healthz
- Metrics endpoint: Prometheus scrape via OTel Collector on port 8889
- Dashboard: Grafana at http://<grafana-host>:3000
Alert Rules (Prometheus)
groups:
- name: isartor-slo
rules:
- alert: HighErrorRate
expr: |
sum(rate(isartor_requests_total{http_status=~"5.."}[5m]))
/
sum(rate(isartor_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Isartor error rate exceeds 1%"
- alert: HighP95Latency
expr: |
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
> 3
for: 5m
labels:
severity: warning
annotations:
summary: "Isartor P95 latency exceeds 3 seconds"
- alert: LowDeflectionRate
expr: |
1 - (
sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
/
sum(rate(isartor_requests_total[1h]))
) < 0.5
for: 30m
labels:
severity: warning
annotations:
summary: "Isartor deflection rate below 50%"
- alert: FirewallDown
expr: up{job="isartor"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Isartor gateway is down"
Scenario-Based Tuning Recipes
Scenario A: Agentic Loop (High-Volume Identical Prompts)
Profile: Autonomous agent sends the same prompt hundreds of times per minute.
ISARTOR__CACHE_MODE=exact # Semantic unnecessary for identical prompts
ISARTOR__CACHE_TTL_SECS=3600 # Long TTL — agent prompts are stable
ISARTOR__CACHE_MAX_CAPACITY=50000 # Large cache for many unique prompts
Expected deflection: 95–99 % (after warm-up).
Scenario B: Customer Support Bot (Paraphrased Questions)
Profile: End users ask the same questions in different ways.
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.80 # Lower threshold to catch paraphrases
ISARTOR__CACHE_TTL_SECS=1800 # 30 min — support answers change slowly
ISARTOR__CACHE_MAX_CAPACITY=10000
Expected deflection: 60–80 %.
Scenario C: Code Generation (Low Cache Hit Rate)
Profile: Developers ask unique, complex coding questions.
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.92 # High threshold — wrong cached code is costly
ISARTOR__CACHE_TTL_SECS=600 # Short TTL — code context changes quickly
ISARTOR__INFERENCE_ENGINE=embedded # Let SLM handle simple code questions
Expected deflection: 20–40 % (SLM handles simple extraction).
Scenario D: RAG Pipeline (Document Q&A)
Profile: Queries against a knowledge base; similar questions are common.
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.83 # Moderate threshold
ISARTOR__CACHE_TTL_SECS=3600 # Documents change infrequently
ISARTOR__CACHE_MAX_CAPACITY=20000 # Large cache for document variation
Expected deflection: 50–70 %.
Scenario E: Multi-Replica Kubernetes
Profile: Horizontally scaled behind a load balancer.
ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379
ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.85
Benefit: All replicas share the same cache → deflection rate applies cluster-wide.
PromQL Cheat Sheet
| What | Query |
|---|---|
| Deflection rate (1 h) | 1 - (sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h])) / sum(increase(isartor_requests_total[1h]))) |
| Request rate | rate(isartor_requests_total[5m]) |
| Request rate by layer | sum by (final_layer) (rate(isartor_requests_total[5m])) |
| P50 latency | histogram_quantile(0.50, rate(isartor_request_duration_seconds_bucket[5m])) |
| P95 latency | histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) |
| P99 latency | histogram_quantile(0.99, rate(isartor_request_duration_seconds_bucket[5m])) |
| Per-layer P95 | histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m]))) |
| Tokens saved (daily) | sum(increase(isartor_tokens_saved_total[24h])) |
| Tokens saved by layer | sum by (final_layer) (rate(isartor_tokens_saved_total[5m])) |
| Est. daily cost savings ($0.01/1K tok) | sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01 |
| Error rate | sum(rate(isartor_requests_total{http_status=~"5.."}[5m])) / sum(rate(isartor_requests_total[5m])) |
| Cache hit ratio (exact) | sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m])) / sum(rate(isartor_requests_total[5m])) |
| Cache hit ratio (semantic) | sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m])) / sum(rate(isartor_requests_total[5m])) |
See also: Metrics & Tracing · Configuration Reference · Troubleshooting
Testing
Complete test runbook for Isartor — from automated test suites to manual feature verification and Copilot CLI integration testing.
Prerequisites
| Requirement | Check |
|---|---|
| Rust toolchain | cargo --version |
| Built binary | cargo build --release |
curl + jq | curl --version && jq --version |
Quick Start — Automated
Unit & Integration Tests
# Run the full test suite
cargo test --all-features
# Run a specific test binary
cargo test --test unit_suite
cargo test --test integration_suite
cargo test --test scenario_suite
# Run a single test with output
cargo test --test scenario_suite deflection_rate_at_least_60_percent -- --nocapture
cargo test --test integration_suite body_survives_all_middleware -- --nocapture
Smoke Test Script
Run the entire manual test suite in one command:
# Start a fresh server, run all tests, stop after
./scripts/smoke-test.sh --stop-after
# Test an already-running server
./scripts/smoke-test.sh --no-start
# Full run including demo + verbose response bodies
./scripts/smoke-test.sh --run-demo --verbose
# Custom URL / API key
./scripts/smoke-test.sh --url http://localhost:9090 --api-key mykey --no-start
Lint & Format Checks
Run the same checks CI runs:
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
Compression Pipeline Tests
Run the L2.5 compression module tests specifically:
# All compression tests (pipeline, stages, cache, optimize)
cargo test --all-features compression
# Specific modules
cargo test --all-features content_classifier
cargo test --all-features dedup_cache
cargo test --all-features log_crunch
cargo test --all-features optimize_request_body
Manual Step-by-Step
Note: Isartor runs without gateway auth by default (local-first). The test commands below explicitly set
ISARTOR__GATEWAY_API_KEYto exercise authenticated request handling.
1 Start the Server
# Gateway-only startup (local API testing)
ISARTOR__FIRST_RUN_COMPLETE=1 \
./target/release/isartor up
# Full startup for proxy-aware testing (recommended for this guide)
ISARTOR__FIRST_RUN_COMPLETE=1 \
ISARTOR__GATEWAY_API_KEY=changeme \
./target/release/isartor up copilot
# With an OpenAI key (enables real L3 fallback)
ISARTOR__FIRST_RUN_COMPLETE=1 \
ISARTOR__GATEWAY_API_KEY=changeme \
ISARTOR__EXTERNAL_LLM_API_KEY=sk-... \
./target/release/isartor up copilot
Server is ready when you see:
INFO isartor: API gateway listening, addr: 0.0.0.0:8080
INFO isartor: CONNECT proxy starting, addr: 0.0.0.0:8081
2 Health & Liveness
# Liveness probe (no auth needed)
curl http://localhost:8080/healthz
# Rich health (shows layer status, proxy, prompt totals)
curl http://localhost:8080/health | jq .
Expected /health response shape:
{
"status": "ok",
"version": "0.1.25",
"layers": { "l1a": "active", "l1b": "active", "l2": "active", "l3": "no_api_key" },
"uptime_seconds": 5,
"proxy": "active",
"proxy_layer3": "native_upstream_passthrough",
"prompt_total_requests": 0,
"prompt_total_deflected_requests": 0
}
3 OpenAI-Compatible Endpoint (/v1/chat/completions)
API_KEY=changeme
curl -sS http://localhost:8080/v1/chat/completions \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}' | jq .
Send the same prompt twice to confirm L1a exact-cache kicks in:
for i in 1 2; do
echo "--- Request $i ---"
curl -sS http://localhost:8080/v1/chat/completions \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is 2+2?"}]}' \
| jq '.choices[0].message.content, .model'
done
4 Anthropic-Compatible Endpoint (/v1/messages)
curl -sS http://localhost:8080/v1/messages \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-3-haiku-20240307",
"max_tokens": 64,
"messages": [{"role": "user", "content": "What is 2+2?"}]
}' | jq .
Expected shape: {"id":..., "type":"message", "role":"assistant", "content":[...], "model":...}
5 Native Endpoint (/api/chat)
curl -sS http://localhost:8080/api/chat \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "ping"}]}' | jq .
6 L1a — Exact Cache Hit
# Seed the cache with first request
curl -sS http://localhost:8080/v1/chat/completions \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"capital of France?"}]}' \
-o /dev/null
# Second identical request — should be served from L1a
curl -sS http://localhost:8080/v1/chat/completions \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"capital of France?"}]}' \
| jq '.model'
# → "isartor-cache" or similar (not "gpt-4o-mini")
7 L1b — Semantic Cache Hit
# Seed
curl -sS http://localhost:8080/v1/chat/completions \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"What is the capital of France?"}]}' \
-o /dev/null
# Paraphrase — should hit L1b (cosine similarity ≥ 0.85)
curl -sS http://localhost:8080/v1/chat/completions \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Which city is France capital?"}]}' \
| jq '.model'
8 Authentication Rejection
# No API key — should return 401/403
curl -sS -w "\nHTTP %{http_code}" http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hello"}]}'
9 Prompt Stats
# JSON endpoint
curl -sS -H "X-API-Key: $API_KEY" \
"http://localhost:8080/debug/stats/prompts?limit=10" | jq .
# Per-agent observability endpoint
curl -sS -H "X-API-Key: $API_KEY" \
"http://localhost:8080/debug/stats/agents" | jq .
# CLI command
./target/release/isartor stats \
--gateway-url http://localhost:8080 \
--gateway-api-key $API_KEY
# CLI per-agent view
./target/release/isartor stats \
--gateway-url http://localhost:8080 \
--gateway-api-key $API_KEY \
--by-tool
Expected isartor stats output:
Isartor Prompt Stats
URL: http://localhost:8080
Total: 7
Deflected: 3
By Layer
L1A 3
L3 4
By Surface
gateway 7
By Client
openai 5
anthropic 2
Recent Prompts
2026-03-19T09:00:00Z gateway openai L1A via /v1/chat/completions (1ms, HTTP 200)
10 Proxy Recent Decisions
curl -sS -H "X-API-Key: $API_KEY" \
"http://localhost:8080/debug/proxy/recent?limit=5" | jq .
11 isartor connect status
./target/release/isartor connect status \
--gateway-url http://localhost:8080 \
--gateway-api-key $API_KEY
12 Run the Built-in Demo
./target/release/isartor demo
# Replays 50 bundled prompts through L1a/L1b, prints deflection rate.
# Writes isartor_demo_result.txt
13 Stop the Server
./target/release/isartor stop
Copilot CLI Integration Test
Step 1 — Connect Copilot CLI
./target/release/isartor connect copilot \
--gateway-url http://localhost:8080 \
--gateway-api-key changeme
This writes ~/.isartor/env/copilot.sh with:
export HTTPS_PROXY="http://localhost:8081"
export NODE_EXTRA_CA_CERTS="/Users/<you>/.isartor/ca/isartor-ca.pem"
export ISARTOR_COPILOT_ENABLED=true
Step 2 — Activate the Proxy Environment
Critical: You must source the env file in the same shell where you run Copilot CLI:
source ~/.isartor/env/copilot.sh
# Verify the env is active
echo $HTTPS_PROXY # → http://localhost:8081
echo $NODE_EXTRA_CA_CERTS # → /Users/<you>/.isartor/ca/isartor-ca.pem
Step 3 — Use Copilot CLI (same shell)
# Ask Copilot a question — traffic will route through Isartor proxy
gh copilot suggest "list all files in a directory"
# Or explain
gh copilot explain "what does git rebase do"
Step 4 — Verify Traffic Hit Isartor
# Check proxy recent decisions
./target/release/isartor connect status \
--gateway-url http://localhost:8080 \
--gateway-api-key changeme
# Check prompt stats
./target/release/isartor stats \
--gateway-url http://localhost:8080 \
--gateway-api-key changeme
You should see proxy_recent_requests > 0 and Copilot entries in By Client.
Step 5 — Ask Repeated Questions (cache test)
# Ask the same thing twice — second hit should be L1a
gh copilot suggest "list all files in a directory"
gh copilot suggest "list all files in a directory"
# Check stats — deflected count should have increased
./target/release/isartor stats \
--gateway-url http://localhost:8080 \
--gateway-api-key changeme
Disconnect
./target/release/isartor connect copilot --disconnect
# then unset in your shell:
unset HTTPS_PROXY NODE_EXTRA_CA_CERTS ISARTOR_COPILOT_ENABLED
Feature Coverage Matrix
| Feature | Test | Section |
|---|---|---|
| Health endpoint | curl /health | §2 |
| Liveness probe | curl /healthz | §2 |
OpenAI /v1/chat/completions | curl + jq | §3 |
Anthropic /v1/messages | curl + jq | §4 |
Native /api/chat | curl + jq | §5 |
| L1a exact-cache deflection | repeated prompt | §6 |
| L1b semantic-cache deflection | paraphrased prompt | §7 |
| Auth rejection | no X-API-Key | §8 |
| Prompt stats endpoint | /debug/stats/prompts | §9 |
| isartor stats CLI | isartor stats | §9 |
| Proxy decisions endpoint | /debug/proxy/recent | §10 |
| Connect status CLI | isartor connect status | §11 |
| Built-in demo | isartor demo | §12 |
| Copilot CLI proxy routing | source env + gh copilot | Copilot CLI |
| Cache hit via Copilot | repeated gh copilot | Copilot CLI §5 |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Connection refused :8080 | Server not started | Run ./target/release/isartor up |
isartor update fails after stop | Stale HTTPS_PROXY in shell | unset HTTPS_PROXY HTTP_PROXY |
| Copilot traffic not showing in stats | Wrong shell / env not sourced | source ~/.isartor/env/copilot.sh then restart Copilot CLI |
| L1b miss on paraphrase | Semantic index cold | Send several prompts first to warm the index |
l3: no_api_key in health | No LLM key set | Set ISARTOR__EXTERNAL_LLM_API_KEY or use cache/demo mode |
See also: Troubleshooting · Contributing
Contributing
Thanks for your interest in contributing to Isartor! Isartor is maintained by one developer as a side project. Here's how to make your contribution land quickly.
Before You Open a PR
- Check existing issues — your idea may already be tracked.
- Open an issue first for any non-trivial change.
- One PR per issue — keep scope tight.
Looking for something to work on? Check out the good first issues label on GitHub.
Development Setup
Prerequisites
- Rust 1.75+ — install via rustup
- Docker — required for integration tests and the observability stack
- curl + jq — for manual testing
Clone and Build
git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build
Run the Test Suite
# Full test suite
cargo test --all-features
# Or use Make
make test
# Run a specific test binary
cargo test --test unit_suite
cargo test --test integration_suite
cargo test --test scenario_suite
# Run a single test with output
cargo test --test scenario_suite deflection_rate_at_least_60_percent -- --nocapture
Lint & Format
# Format check (same as CI)
cargo fmt --all -- --check
# Apply formatting
cargo fmt --all
# Clippy lint check (same as CI)
cargo clippy --all-targets --all-features -- -D warnings
Release Build
cargo build --release
# or
make build
Benchmarks
# Criterion micro-benchmarks
cargo bench --bench cache_latency
cargo bench --bench e2e_pipeline
# Full benchmark harness (requires running Isartor instance)
make benchmark
# Dry-run smoke test (no server needed)
make benchmark-dry-run
PR Checklist
-
cargo test --all-featurespasses -
cargo clippy --all-targets --all-features -- -D warningshas no new warnings -
cargo fmt --all -- --checkpasses - PR description explains WHY, not just WHAT
- Documentation updated if behaviour changes
What Gets Merged Quickly
- Bug fixes with a test that reproduces the bug
- Documentation improvements
- Performance improvements with benchmark evidence
What Takes Longer
- New features — needs design discussion in an issue first
- Changes to the deflection layer logic — core path changes require careful review
Code Conventions
- Tests are grouped into integration-test binaries (
unit_suite,integration_suite,scenario_suite) that re-export submodules. When adding a test, place it in the appropriate binary rather than creating a standalone file. - Configuration uses
ISARTOR__...environment variables with double underscores as separators. - The Axum middleware stack wraps inside-out. See
src/main.rsfor the documented layer order. - Use
spawn_blockingfor CPU-intensive work (embeddings, model inference) to avoid starving the Tokio runtime. - The
src/compression/module uses a Fusion Pipeline pattern: statelessCompressionStagetrait objects executed in order. To add a new compression stage, implement theCompressionStagetrait and wire it insrc/compression/optimize.rs::build_pipeline().
Response Time
Issues and PRs are reviewed within 24–48 hours on weekdays. Weekend responses are not guaranteed.
See also: Testing · Architecture · Troubleshooting
Troubleshooting
Common issues, diagnostic steps, and FAQ for operating Isartor.
Table of Contents
- Startup Errors
- Cache Issues
- Embedding & SLM Issues
- Cloud LLM Issues
- Observability Issues
- Performance & Degraded Operation
- Docker & Deployment Issues
- FAQ
Startup Errors
Failed to initialize candle TextEmbedder
Symptom: Gateway panics on startup with:
Failed to initialize candle TextEmbedder (all-MiniLM-L6-v2)
Causes & Fixes:
| Cause | Fix |
|---|---|
| Model files not downloaded | Run once with internet access; candle auto-downloads to ~/.cache/huggingface/ |
| Corrupted model cache | Delete ~/.cache/huggingface/ and restart |
Cache directory not writable (Permission denied (os error 13)) | Set HF_HOME (or ISARTOR_HF_CACHE_DIR) to a writable path (e.g. /tmp/huggingface). In Docker, mount a volume there: -e HF_HOME=/tmp/huggingface -v isartor-hf:/tmp/huggingface. |
| Insufficient memory | Ensure ≥ 256 MB available for the embedding model |
Address already in use
Symptom:
Error: error creating server listener: Address already in use (os error 48)
Fix:
# Find the process using port 8080
lsof -i :8080
# Kill it, or change the port:
export ISARTOR__HOST_PORT=0.0.0.0:9090
missing field or config deserialization errors
Symptom:
Error: missing field `layer2` in config
Fix: Ensure all required environment variables have the correct prefix
and separator. Isartor uses double-underscore (__) as separator:
# Correct:
export ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081
# Wrong:
export ISARTOR_LAYER2_SIDECAR_URL=http://127.0.0.1:8081
See the Configuration Reference for the full list of variables.
Gateway auth / 401 Unauthorized
Symptom: All requests return 401 Unauthorized.
By default, gateway_api_key is empty and auth is disabled — you should not see 401 errors unless you (or your deployment) explicitly set ISARTOR__GATEWAY_API_KEY.
If you enabled auth by setting a key, every request must include it:
export ISARTOR__GATEWAY_API_KEY=your-secret-key
Common causes of unexpected 401s:
- The key in your request header doesn't match
ISARTOR__GATEWAY_API_KEY. - You forgot to include
X-API-KeyorAuthorization: Bearerin the request.
Cache Issues
Low Cache Hit Rate
Symptom: Deflection rate below expected levels despite repeated traffic.
Diagnostic steps:
-
Check cache mode:
echo $ISARTOR__CACHE_MODE # should be "both" for most workloads -
Check similarity threshold:
echo $ISARTOR__SIMILARITY_THRESHOLD # default: 0.85If too high (> 0.92), similar prompts won't match. Try lowering to 0.80.
-
Check TTL:
echo $ISARTOR__CACHE_TTL_SECS # default: 300Short TTL evicts entries before they can be reused.
-
Check Jaeger for
cosine_similarityvalues on semantic cache spans. If scores are just below the threshold, lower it.
Stale Cache Responses
Symptom: Users receive outdated answers from cache.
Fix: Reduce TTL or restart the gateway to clear in-memory caches:
export ISARTOR__CACHE_TTL_SECS=60 # 1 minute
For Redis-backed caches, you can flush explicitly:
redis-cli -u $ISARTOR__REDIS_URL FLUSHDB
Redis Connection Refused
Symptom:
Layer 1a: Redis connection error — falling through
Diagnostic steps:
-
Verify Redis is running:
redis-cli -u $ISARTOR__REDIS_URL ping # Expected: PONG -
Check network connectivity (especially in Docker/K8s):
# Inside the gateway container: curl -v telnet://redis:6379 -
Verify the URL format:
# Correct formats: export ISARTOR__REDIS_URL=redis://127.0.0.1:6379 export ISARTOR__REDIS_URL=redis://user:password@redis.svc:6379/0 -
Check Redis memory limit — if Redis is OOM, it will reject writes.
Fallback behaviour: When Redis is unreachable, Isartor falls through to the next layer. No data is lost, but deflection rate drops.
Cache Memory Growing Unbounded
Symptom: Gateway memory usage increases over time.
Fix: The in-memory cache uses bounded LRU eviction. Check:
echo $ISARTOR__CACHE_MAX_CAPACITY # default: 10000
If set too high, reduce it. Each entry ≈ 2–4 KB, so 10K entries ≈ 20–40 MB.
Embedding & SLM Issues
Slow Embedding Generation
Symptom: L1b latency > 10 ms.
Causes & Fixes:
| Cause | Fix |
|---|---|
| CPU-bound contention | Increase CPU allocation for the container |
| Large prompt text | Embedder truncates to model max length (512 tokens), but longer text = more CPU |
| Cold start | First embedding call warms up the candle BertModel (~2 s). Subsequent calls are fast. |
SLM Sidecar Unreachable
Symptom:
Layer 2: Failed to connect to SLM sidecar — falling through
Diagnostic steps:
-
Check if the sidecar is running:
curl http://127.0.0.1:8081/v1/models -
Verify configuration:
echo $ISARTOR__LAYER2__SIDECAR_URL # default: http://127.0.0.1:8081 -
Check the sidecar logs for errors (model loading, OOM, etc.).
-
Increase timeout if the sidecar is slow:
export ISARTOR__LAYER2__TIMEOUT_SECONDS=60
Fallback behaviour: When the SLM sidecar is unreachable, Isartor treats all requests as COMPLEX and forwards to Layer 3.
SLM Misclassification (Tiered: TEMPLATE / SNIPPET / COMPLEX)
The default classifier mode is tiered, which sorts requests into three categories instead of the legacy binary SIMPLE/COMPLEX split:
| Tier | Description |
|---|---|
| TEMPLATE | Config files, type definitions, documentation, boilerplate |
| SNIPPET | Short single-function code, simple middleware (<50 lines) |
| COMPLEX | Multi-file implementations, test suites, full endpoints |
TEMPLATE and SNIPPET requests are answered locally by the SLM; COMPLEX
requests are forwarded to Layer 3. The legacy binary mode (SIMPLE/COMPLEX)
is still available via ISARTOR__LAYER2__CLASSIFIER_MODE=binary.
An answer quality guard also rejects SLM answers that are too short (<10 chars) or start with uncertainty phrases, escalating them to Layer 3.
Symptom: Users receive low-quality answers for complex questions (misclassified as TEMPLATE/SNIPPET) or unnecessarily hit the cloud for simple ones.
Diagnostic steps:
-
In Jaeger, search for
router.decisionattribute to see classification distribution across TEMPLATE, SNIPPET, and COMPLEX. -
Send known-simple and known-complex prompts and check the classification:
curl -s -X POST http://localhost:8080/api/chat \ -H "Content-Type: application/json" \ -H "X-API-Key: $KEY" \ -d '{"prompt": "Generate a tsconfig.json"}' | jq '.layer' # Expected: layer 2 (TEMPLATE) -
Consider switching to a larger SLM model for better classification accuracy.
-
To fall back to the legacy binary classifier, set
ISARTOR__LAYER2__CLASSIFIER_MODE=binary.
Embedded Candle Engine Errors
Symptom:
Layer 2: Embedded classification failed – falling through
Causes & Fixes:
| Cause | Fix |
|---|---|
| Model file missing | Set ISARTOR__EMBEDDED__MODEL_PATH to a valid GGUF file |
| Insufficient memory | Candle GGUF models need 1–4 GB RAM |
| Feature not compiled | Build with --features embedded-inference |
Cloud LLM Issues
502 Bad Gateway from Layer 3
Symptom: Requests that reach Layer 3 return 502.
Diagnostic steps:
-
Check provider connectivity:
curl -s $ISARTOR__EXTERNAL_LLM_URL \ -H "Authorization: Bearer $ISARTOR__EXTERNAL_LLM_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}' -
Verify API key is valid and has quota.
-
For Azure OpenAI, check deployment ID and API version:
echo $ISARTOR__AZURE_DEPLOYMENT_ID echo $ISARTOR__AZURE_API_VERSION
Rate Limiting from Cloud Provider
Symptom: Intermittent 429 errors from the cloud LLM.
Fix:
- Increase deflection rate (lower threshold, longer TTL) to reduce cloud traffic.
- Request higher rate limits from your provider.
- Implement client-side retry with exponential backoff (application level).
Wrong Provider Configured
Symptom: Authentication errors or unexpected response formats.
Fix: Verify the provider matches the URL and API key:
# OpenAI
export ISARTOR__LLM_PROVIDER=openai
# Azure
export ISARTOR__LLM_PROVIDER=azure
# Anthropic
export ISARTOR__LLM_PROVIDER=anthropic
# xAI
export ISARTOR__LLM_PROVIDER=xai
# Google Gemini
export ISARTOR__LLM_PROVIDER=gemini
# Ollama (local — no API key required)
export ISARTOR__LLM_PROVIDER=ollama
See the Configuration Reference for the full list of supported providers.
Observability Issues
No Traces in Jaeger
| Cause | Fix |
|---|---|
| Monitoring disabled | export ISARTOR__ENABLE_MONITORING=true |
| Wrong endpoint | export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317 |
| Collector not running | docker compose -f docker-compose.observability.yml up otel-collector |
| Firewall blocking gRPC | Ensure port 4317 is open between gateway and collector |
No Metrics in Prometheus
| Cause | Fix |
|---|---|
| Prometheus not scraping collector | Check prometheus.yml targets include otel-collector:8889 |
| Collector metrics pipeline broken | Verify otel-collector-config.yaml exports to Prometheus |
| No requests sent yet | Send a test request — metrics appear after first request |
Grafana Shows "No Data"
| Cause | Fix |
|---|---|
| Data source not configured | Add Prometheus source: URL http://prometheus:9090 |
| Wrong time range | Expand the time range in Grafana to cover the test period |
| Dashboard not provisioned | Check docker/grafana/provisioning/ paths are mounted |
Console Shows "OTel disabled" Despite Setting env var
Cause: Config file takes precedence, or the env var prefix is wrong.
Fix:
# Correct (double underscore):
export ISARTOR__ENABLE_MONITORING=true
# Wrong (single underscore):
export ISARTOR_ENABLE_MONITORING=true # ❌ not picked up
Performance & Degraded Operation
High Tail Latency (P99 > 10 s)
Diagnostic steps:
-
Check which layer is the bottleneck:
histogram_quantile(0.99, sum by (le, layer_name) ( rate(isartor_layer_duration_seconds_bucket[5m]) ) ) -
Common causes:
- L3 Cloud: provider is slow → switch to a faster model or provider.
- L2 SLM: model inference is slow → use a smaller quantised model.
- L1b Semantic: embedding is slow → check CPU contention.
Gateway OOM (Out of Memory)
Diagnostic steps:
-
Check cache capacity:
echo $ISARTOR__CACHE_MAX_CAPACITY -
Reduce capacity or switch to Redis backend.
-
If using embedded SLM, check model size vs. container memory limit.
Requests Queuing / High Connection Count
Symptom: Clients see connection timeouts or slow responses even for cache hits.
Causes & Fixes:
| Cause | Fix |
|---|---|
| Too many concurrent requests | Scale horizontally (add replicas) |
spawn_blocking pool exhaustion | Increase Tokio blocking threads: TOKIO_WORKER_THREADS=8 |
| SLM inference blocking async runtime | Ensure SLM runs on blocking pool (default in Isartor) |
Degraded Mode (SLM Down, Cache Only)
When the SLM sidecar is unreachable, Isartor automatically degrades:
- L1a/L1b cache still works → cached requests are served.
- L2 SLM → all requests treated as COMPLEX (regardless of classifier mode) → forwarded to L3.
- Impact: Higher cloud costs, but no downtime.
Monitor with:
# If SLM layer stops resolving requests, something is wrong
sum(rate(isartor_requests_total{final_layer="L2_SLM"}[5m])) == 0
Docker & Deployment Issues
Docker Build Fails
Symptom: cargo build fails inside Docker.
Common fixes:
- Ensure Dockerfile uses the correct Rust toolchain version.
- For
aws-lc-rs(TLS): installcmake,gcc,makein build stage. - Check that
.dockerignoreisn't excluding required files.
Container Can't Reach Host Services
Symptom: Gateway inside Docker can't connect to sidecar on localhost.
Fix: Use Docker network names or host.docker.internal:
# docker-compose.yml
environment:
- ISARTOR__LAYER2__SIDECAR_URL=http://sidecar:8081 # service name
# or for host:
- ISARTOR__LAYER2__SIDECAR_URL=http://host.docker.internal:8081
Health Check Failing
Symptom: Orchestrator keeps restarting the container.
Fix: The health endpoint is GET /healthz. Ensure the health check
matches:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
interval: 10s
timeout: 5s
retries: 3
FAQ
Q: What is cache_mode and which should I use?
A: cache_mode controls which cache layers are active:
| Mode | What it does | Best for |
|---|---|---|
exact | Only SHA-256 hash match | Deterministic agent loops |
semantic | Only cosine similarity | Diverse user queries |
both | Exact first, then semantic | Most workloads (default) |
Q: What happens if Redis goes down?
A: Isartor gracefully falls through. The exact cache layer logs a warning and forwards the request downstream. No crash, no data loss. Deflection rate drops until Redis recovers, and more requests reach the cloud LLM (higher cost).
Q: Can I change the embedding model?
A: Yes. The in-process embedder uses candle with a pure-Rust BertModel, which supports multiple models. Set:
export ISARTOR__EMBEDDING_MODEL=bge-small-en-v1.5
The model is auto-downloaded on first startup. Note: changing the model invalidates the semantic cache (different embedding dimensions/space).
Q: How much does Isartor cost to run?
A: Isartor itself is free (Apache 2.0). The infrastructure cost depends on your deployment:
| Mode | Estimated Cost |
|---|---|
| Minimalist (single binary, no GPU) | ~$5–15/month (small VM or container) |
| With SLM sidecar (CPU) | ~$20–50/month (4-core VM) |
| With SLM on GPU | ~$50–200/month (GPU instance) |
| Enterprise (K8s + Redis + vLLM) | ~$200–500/month |
The ROI comes from cloud LLM savings. At 70 % deflection and $0.01/1K tokens, Isartor typically pays for itself within the first week.
Q: Is Isartor production-ready?
A: Isartor is designed for production use with:
- ✅ Bounded, concurrent caches (no unbounded memory growth)
- ✅ Graceful degradation (every layer has a fallback)
- ✅ OpenTelemetry observability (traces, metrics, structured logs)
- ✅ Health check endpoint (
/healthz) - ✅ Configurable via environment variables (12-factor app)
- ✅ Integration tests covering all middleware layers
For enterprise deployments, use Redis-backed caches and a production Kubernetes cluster. See the Enterprise Guide.
Q: Can I use Isartor with LangChain / LlamaIndex / AutoGen?
A: Yes. Isartor exposes an OpenAI-compatible API. Point any SDK at the gateway URL:
import openai
client = openai.OpenAI(
base_url="http://your-isartor-host:8080/v1",
api_key="your-gateway-key",
)
See Integrations for full examples.
Q: How do I upgrade Isartor?
A:
# Binary
cargo install --path . --force
# Docker
docker pull ghcr.io/isartor-ai/isartor:latest
docker compose up -d --pull always
In-memory caches are cleared on restart. Redis caches persist.
Q: Why does isartor update or GitHub access fail with localhost:8081 / Connection refused after I stopped Isartor?
A: Your shell likely still has proxy environment variables from a prior
isartor connect ... session, so non-Isartor commands are still trying to
reach GitHub through the local CONNECT proxy on localhost:8081.
Fix on macOS / Linux:
unset HTTPS_PROXY HTTP_PROXY ALL_PROXY https_proxy http_proxy all_proxy
unset NODE_EXTRA_CA_CERTS SSL_CERT_FILE REQUESTS_CA_BUNDLE
unset ISARTOR_COPILOT_ENABLED ISARTOR_ANTIGRAVITY_ENABLED
Then confirm the shell is clean:
env | grep -i proxy
You can also clean up client-side configuration:
isartor connect copilot --disconnect
isartor connect claude --disconnect
isartor connect antigravity --disconnect
Q: Why does isartor update fail with Permission denied (os error 13)?
A: Your current isartor binary is installed in a system-managed directory.
Recommended fix: move to a user-writable install location:
mkdir -p ~/.local/bin
cp /usr/local/bin/isartor ~/.local/bin/isartor
chmod +x ~/.local/bin/isartor
export PATH="$HOME/.local/bin:$PATH"
hash -r
Then confirm: which isartor
Q: Why does isartor keep my terminal busy?
A: isartor runs the API gateway in the foreground by default. Start in detached mode:
isartor up --detach
Stop later with: isartor stop
Q: How do I monitor deflection rate in real-time?
A: Use the Grafana dashboard included in dashboards/prometheus-grafana.json
or the PromQL query:
1 - (
sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[5m]))
/
sum(rate(isartor_requests_total[5m]))
)
Q: Can I run Isartor without any cloud LLM?
A: Partially. Layers 1 and 2 work standalone (cache + SLM). But Layer 3 requires a cloud LLM API key. Without one, uncached COMPLEX requests will return a 502 error. For fully local operation, ensure your SLM can handle all traffic (set a very aggressive SIMPLE classification).
See also: Performance Tuning · Metrics & Tracing · Configuration Reference
Why Most LLM Gateways Can't Pass a FedRAMP Review
Published on the Isartor blog — targeting platform engineers and security architects at regulated enterprises.
The CISO's Nightmare
Picture this: a CISO at a federal agency is six months into an LLM gateway evaluation. The vendor has given assurances — "our gateway is secure, all data stays in your environment." The compliance team runs a network capture during the proof-of-concept. Three unexpected domains light up:
telemetry.vendor.io— anonymous usage metricslicense.vendor.io— license key validation on every startupregistry.vendor.io— model version checks
The FedRAMP audit fails. The project is cancelled. Six months of engineering work discarded because nobody read the gateway's egress behavior carefully enough before the evaluation began.
This is not a hypothetical. It happens routinely in regulated environments. The mistake is usually honest — gateway teams build their products for cloud-native deployments and add telemetry and license checks as an afterthought, without thinking about what happens when those systems need to run in an air-gapped facility.
The Hidden Phone-Home Problem
Most LLM gateways have outbound connection patterns that are not documented in their README. Let's be specific about what these are and why each one is a blocker in a FedRAMP or HIPAA environment:
License validation servers. A gateway that validates its licence key against a remote server cannot operate in a network segment with no outbound internet access. Worse, the validation traffic typically contains the licence key and the server's hostname — both of which may be considered sensitive data in a classified environment. Under FedRAMP Moderate, SC-7 (Boundary Protection) requires that external connections be explicitly authorised and documented. An undocumented licence-check endpoint fails this control.
Anonymous usage telemetry. Many open-source gateways ship with opt-out telemetry that sends aggregate usage statistics to the developer's servers. Even "anonymous" telemetry can include prompt length distributions, model names, or error rates that a regulated environment may consider sensitive. Under HIPAA, any data that could be used to identify a patient — including metadata about the prompts that process PHI — must stay within the covered entity's environment.
Model registry lookups. Gateways that support automatic model updates or capability discovery make outbound calls to check for new model versions. In an air-gapped environment, there is no path for these calls to succeed — and if the gateway blocks on a registry timeout, latency spikes cascade through the application.
OTel exporters enabled by default. OpenTelemetry is essential for
observability, but a gateway that ships with OTLP_EXPORTER_ENDPOINT pointing
at a cloud-hosted collector creates a data exfiltration risk. Trace data
contains prompt content, response content, latency, and error messages. An
OTel exporter sending this to an external endpoint in a HIPAA environment
would be a reportable breach.
Each of these problems has the same root cause: the gateway was designed for cloud-native deployments and retrofitted for security requirements, rather than designed with air-gap constraints from the start.
What "Truly Air-Gapped" Actually Means
A gateway that can genuinely pass an air-gap review must satisfy three requirements:
1. A static binary with no runtime dependencies. Every runtime dependency — a Python interpreter, a Node.js runtime, a JVM — is a potential attack surface and a source of unexpected network calls. A statically compiled binary eliminates the entire class of "your dependency phoned home without you knowing" vulnerabilities. It also eliminates the download-on-first-run pattern where models or plugins are fetched from the internet when the gateway starts.
2. Offline licence validation. Licence validation must work without a network call. The correct approach is HMAC-based offline validation: the licence key embeds a cryptographic signature that the binary verifies locally using a public key baked in at compile time. No server call required. No licence-check traffic to document in your FedRAMP boundary diagram.
3. All models bundled — no download on first run. Any model that is downloaded at runtime creates a bootstrap dependency on internet connectivity. For an air-gapped deployment, all models must be available in the container image (or on a mounted volume) before the gateway starts. This is non-negotiable for environments where the deployment system has no outbound internet access at all.
Isartor is designed to meet all three requirements. The binary is compiled with Rust's
--target x86_64-unknown-linux-musl producing a fully static binary with
zero shared library dependencies. Licence validation uses HMAC offline
verification. The latest-airgapped Docker image is built to pre-bundle (or
pre-cache) all embedding models so that, once the image is transferred to the
air-gapped environment and ISARTOR__OFFLINE_MODE=true is set, no additional
model downloads or outbound internet access are required at runtime.
The Configuration
Here is the complete environment variable configuration for a compliant air-gapped deployment of Isartor in front of a self-hosted vLLM instance:
# ── Air-gap enforcement ──────────────────────────────────────────────
# Block all outbound cloud connections at the application layer.
export ISARTOR__OFFLINE_MODE=true
# ── Internal LLM routing (L3) ────────────────────────────────────────
# Route surviving cache-misses to your internal model server.
export ISARTOR__EXTERNAL_LLM_URL=http://vllm.internal.corp:8000/v1
export ISARTOR__LLM_PROVIDER=openai # vLLM exposes OpenAI-compat API
export ISARTOR__EXTERNAL_LLM_MODEL=meta-llama/Llama-3-8B-Instruct
# ── Observability (internal collector only) ──────────────────────────
export ISARTOR__ENABLE_MONITORING=true
export ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.internal.corp:4317
Running isartor connectivity-check with this configuration produces:
Isartor Connectivity Audit
──────────────────────────
Required (L3 cloud routing):
→ http://vllm.internal.corp:8000/v1 [CONFIGURED]
(BLOCKED — offline mode active)
Optional (observability / monitoring):
→ http://otel-collector.internal.corp:4317 [CONFIGURED]
Internal only (no external):
→ (in-memory cache — no network connection) [CONFIGURED - internal]
Zero hidden telemetry connections: ✓ VERIFIED
Air-gap compatible: ✓ YES (L3 disabled or offline mode active)
This output is the screenshot your compliance team needs. Every connection Isartor makes is explicit, documented, and internal.
The FedRAMP Control Mapping
Understanding how a deployment posture maps to specific NIST 800-53 controls is what separates a security claim from a security argument. Here are the four controls most directly supported by Isartor's air-gapped deployment posture:
AU-2 (Audit Logging): AU-2 requires that the system generate audit records for events relevant to security. Isartor logs every prompt, every deflection decision, and every L3 forwarding event as a structured JSON record with a distributed tracing span. The logs include the layer that handled the request (L1a, L1b, L2, L3), the latency, and whether the request was deflected or forwarded. These records can be ingested by any SIEM that accepts JSON log streams.
SC-7 (Boundary Protection): SC-7 requires the system to monitor and
control communications at external boundary points. ISARTOR__OFFLINE_MODE=true
implements a hard application-layer block on all outbound connections to
non-internal endpoints. This is verified by the phone-home audit test in
tests/phone_home_audit.rs, which runs on every commit to main in CI. The
CI badge on the repository proves continuous enforcement.
SI-4 (Information System Monitoring): SI-4 requires monitoring of the information system to detect attacks and indicators of compromise. Isartor's OpenTelemetry integration exports traces and metrics to an internal collector. The deflection stack metrics — cache hit rate, L3 call rate, latency per layer — provide a real-time signal that can be baselined and alerted on. An anomalous spike in L3 calls could indicate a cache poisoning attempt.
CM-6 (Configuration Settings): CM-6 requires the organisation to establish
and document configuration settings. Every Isartor configuration parameter is
controlled by an environment variable with a documented default and a
documented security implication. The ISARTOR__OFFLINE_MODE flag, in
particular, has a documented effect: it is a single switch that moves the
system from "possibly communicates with cloud" to "provably does not
communicate with cloud."
Call to Action
If you are a platform engineer or security architect at a regulated enterprise evaluating LLM gateway options, start here:
- Read the Air-Gapped Deployment Guide for the complete pre-deployment checklist.
- Pull
ghcr.io/isartor-ai/isartor:latest-airgappedand runisartor connectivity-checkin your environment. - Review the phone-home audit test to understand exactly what is being verified in CI.
- Open an issue on GitHub if you have compliance requirements not covered here — FedRAMP High, IL5, ITAR, and sector-specific requirements are all on the roadmap.
The binary that passes your network capture is the binary that passes your FedRAMP review.