Level 1 — Minimal Deployment
Single static binary, embedded candle inference + in-process candle sentence embeddings, zero C/C++ dependencies.
This guide covers deploying Isartor as a standalone process — no sidecars, no Docker Compose, no orchestrator. The firewall binary embeds a Gemma-2-2B-IT GGUF model via candle for Layer 2 classification and uses candle's BertModel (sentence-transformers/all-MiniLM-L6-v2) for Layer 1 semantic cache embeddings — all entirely in-process, pure Rust.
When to Use Level 1
| ✅ Good Fit | ❌ Consider Level 2/3 Instead |
|---|---|
| €5–€20/month VPS (Hetzner, DigitalOcean, Linode) | GPU inference for generation quality |
| ARM edge devices (Raspberry Pi 5, Jetson Nano) | More than ~50 concurrent users |
| Air-gapped / offline environments | Production observability stack required |
| Development & local experimentation | Multi-node high-availability |
| CI/CD test runners |
Prerequisites
| Requirement | Minimum | Recommended |
|---|---|---|
| RAM | 2 GB free | 4 GB free |
| Disk | 2 GB (model download) | 5 GB |
| CPU | 2 cores | 4+ cores (AVX2 recommended) |
| Rust (build from source) | 1.75+ | Latest stable |
| OS | Linux (x86_64 / aarch64), macOS | Ubuntu 22.04 LTS |
Memory budget: Gemma-2-2B Q4_K_M ≈ 1.5 GB, candle BertModel ≈ 90 MB, tokenizer ≈ 4 MB, firewall runtime ≈ 50 MB. Total: ~1.7 GB resident.
Option A: One-Click Install (Recommended)
The fastest way to get started is to leverage the pre-built, cross-platform binaries generated by the CI/CD pipeline.
Install via script:
curl -fsSL https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.sh | sh
Windows (PowerShell):
irm https://raw.githubusercontent.com/isartor-ai/Isartor/main/install.ps1 | iex
This script detects your target OS and processor architecture, downloads the correct release binary, and adds it to your path automatically.
Option B: Build from Source
1. Clone & Build
git clone https://github.com/isartor-ai/Isartor.git
cd Isartor
cargo build --release
The release binary is at ./target/release/isartor (~5 MB statically linked).
2. Configure Environment
Create a minimal .env file or export variables directly:
# Required — your cloud LLM key for Layer 3 fallback
export ISARTOR__EXTERNAL_LLM_API_KEY="sk-..."
# Optional — override defaults
export ISARTOR__GATEWAY_API_KEY="my-secret-key"
export ISARTOR__HOST_PORT="0.0.0.0:8080"
export ISARTOR__LLM_PROVIDER="openai" # openai | azure | anthropic | xai
export ISARTOR__EXTERNAL_LLM_MODEL="gpt-4o-mini"
# Cache mode — "both" enables exact + semantic cache. Semantic embeddings
# are generated in-process via candle BertModel — no sidecar needed.
export ISARTOR__CACHE_MODE="both"
# Pluggable backends — Level 1 uses the defaults (no change needed):
# ISARTOR__CACHE_BACKEND=memory — in-process LRU (ahash + parking_lot)
# ISARTOR__ROUTER_BACKEND=embedded — in-process Candle GGUF SLM
# These are ideal for a single-process deployment with zero dependencies.
3. Start the Firewall
./target/release/isartor up
On first start, the embedded classifier will auto-download the Gemma-2-2B-IT GGUF model from Hugging Face Hub (~1.5 GB). Subsequent starts load from the local cache (~/.cache/huggingface/).
INFO isartor > Listening on 0.0.0.0:8080
INFO isartor::layer1::embeddings > Initialising candle TextEmbedder (all-MiniLM-L6-v2)...
INFO isartor::layer1::embeddings > TextEmbedder ready (~90 MB BertModel loaded)
INFO isartor::services::local_inference > Downloading model from mradermacher/gemma-2-2b-it-GGUF...
INFO isartor::services::local_inference > Model loaded (1.5 GB), ready for inference
4. Verify
# Health check
curl http://localhost:8080/health
# Test the firewall
curl -s http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-H "X-API-Key: my-secret-key" \
-d '{"prompt": "Hello, how are you?"}' | jq .
Option B: Docker (Single Container)
For environments where you prefer a container but don't need a full Compose stack.
Build the Image
cd isartor
docker build -t isartor:latest -f docker/Dockerfile .
Run
docker run -d \
--name isartor \
-p 8080:8080 \
-e ISARTOR__GATEWAY_API_KEY="my-secret-key" \
-e ISARTOR__EXTERNAL_LLM_API_KEY="sk-..." \
-e ISARTOR__CACHE_MODE="both" \
-e HF_HOME=/tmp/huggingface \
-v isartor-models:/tmp/huggingface \
isartor:latest
Note: The
-vflag mounts a named volume for the Hugging Face cache so the model downloads persist across container restarts.The official Docker image runs as non-root and uses
HF_HOME=/tmp/huggingfaceto ensure the cache is writable.
Option C: systemd Service (Production Linux)
For long-running production deployments on bare metal or VPS.
1. Install the Binary
# Build
cargo build --release
# Install to /usr/local/bin
sudo cp target/release/isartor /usr/local/bin/isartor
sudo chmod +x /usr/local/bin/isartor
2. Create a System User
sudo useradd --system --no-create-home --shell /usr/sbin/nologin isartor
3. Create Environment File
sudo mkdir -p /etc/isartor
sudo tee /etc/isartor/env <<'EOF'
ISARTOR__HOST_PORT=0.0.0.0:8080
ISARTOR__GATEWAY_API_KEY=your-production-key
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...
ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__CACHE_MODE=both
ISARTOR__CACHE_BACKEND=memory
ISARTOR__ROUTER_BACKEND=embedded
RUST_LOG=isartor=info
EOF
sudo chmod 600 /etc/isartor/env
4. Create systemd Unit
sudo tee /etc/systemd/system/isartor.service <<'EOF'
[Unit]
Description=Isartor Prompt Firewall
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=isartor
Group=isartor
EnvironmentFile=/etc/isartor/env
ExecStart=/usr/local/bin/isartor
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/cache/isartor
[Install]
WantedBy=multi-user.target
EOF
5. Create Model Cache Directory
sudo mkdir -p /var/cache/isartor
sudo chown isartor:isartor /var/cache/isartor
6. Enable & Start
sudo systemctl daemon-reload
sudo systemctl enable isartor
sudo systemctl start isartor
# Check status
sudo systemctl status isartor
sudo journalctl -u isartor -f
Model Pre-Caching (Air-Gapped / Offline)
If the deployment target has no internet access, pre-download the model on a connected machine and copy it over.
On the Connected Machine
# Install huggingface-cli
pip install huggingface-hub
# Download the GGUF file
huggingface-cli download mradermacher/gemma-2-2b-it-GGUF \
gemma-2-2b-it.Q4_K_M.gguf \
--local-dir ./models
# Also grab the tokenizer (from the base model)
huggingface-cli download google/gemma-2-2b-it \
tokenizer.json \
--local-dir ./models
Transfer to Target
scp -r ./models/ user@target-host:/var/cache/isartor/
By default, hf-hub uses ~/.cache/huggingface/. In the official Docker image, Isartor sets HF_HOME=/tmp/huggingface (non-root safe). Set HF_HOME or ISARTOR_HF_CACHE_DIR to point to your pre-cached directory if needed.
Level 1 Configuration Reference
These are the most relevant ISARTOR__* variables for Level 1 deployments. For the full reference, see the Configuration Reference.
| Variable | Default | Level 1 Notes |
|---|---|---|
ISARTOR__HOST_PORT | 0.0.0.0:8080 | Bind address |
ISARTOR__GATEWAY_API_KEY | "" | Set to enable gateway auth |
ISARTOR__CACHE_MODE | both | both recommended — candle BertModel provides in-process semantic embeddings |
ISARTOR__CACHE_BACKEND | memory | In-process LRU — ideal for single-process Level 1 |
ISARTOR__ROUTER_BACKEND | embedded | In-process Candle GGUF SLM — zero external dependencies |
ISARTOR__CACHE_TTL_SECS | 300 | Cache TTL in seconds |
ISARTOR__CACHE_MAX_CAPACITY | 10000 | Max entries per cache |
ISARTOR__LLM_PROVIDER | openai | openai · azure · anthropic · xai |
ISARTOR__EXTERNAL_LLM_API_KEY | (empty) | Required for Layer 3 fallback |
ISARTOR__EXTERNAL_LLM_MODEL | gpt-4o-mini | Cloud LLM model name |
ISARTOR__ENABLE_MONITORING | false | Enable for stdout OTel (no collector needed) |
Embedded Classifier Defaults (Compiled)
| Setting | Default Value | Description |
|---|---|---|
repo_id | mradermacher/gemma-2-2b-it-GGUF | HF repo for the GGUF model |
gguf_filename | gemma-2-2b-it.Q4_K_M.gguf | Model file (~1.5 GB) |
max_classify_tokens | 20 | Token limit for classification |
max_generate_tokens | 256 | Token limit for simple task execution |
temperature | 0.0 | Greedy decoding for classification |
repetition_penalty | 1.1 | Avoids degenerate loops |
Performance Expectations
| Metric | Typical Value (4-core x86_64) |
|---|---|
| Cold start (model download) | 30–120 s (depends on bandwidth; ~1.5 GB Gemma + ~90 MB candle BertModel) |
| Warm start (cached model) | 3–8 s |
| Classification latency | 50–200 ms |
| Simple task execution | 200–2000 ms |
| Firewall overhead (no inference) | < 1 ms |
| Memory (steady state) | ~1.6 GB |
| Binary size | ~5 MB |
Upgrading to Level 2
When your traffic outgrows Level 1, the migration path is straightforward:
- Add the generation sidecar —
ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8081(replaces embedded candle with the more powerful Phi-3-mini on GPU). - Optionally add an embedding sidecar —
ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082(only needed for external embedding inference; the default L1b semantic cache already uses in-process candle BertModel). - Deploy via Docker Compose — See Level 2 — Sidecar Deployment.
Note: The pluggable backend defaults (
cache_backend=memory,router_backend=embedded) remain appropriate for Level 2 single-host deployments. You only need to switch tocache_backend=redisandrouter_backend=vllmat Level 3 when scaling horizontally.
No code changes required — only environment variables and infrastructure.