Level 2 — Sidecar Deployment

Split architecture: Isartor firewall + llama.cpp generation sidecar on a single host.

This guide covers deploying Isartor with a dedicated AI sidecar for generation. The firewall delegates Layer 2 inference to a lightweight llama.cpp container via HTTP, while Layer 1 semantic cache embeddings run in-process via candle BertModel (no embedding sidecar required). The overall stack runs on a single machine via Docker Compose.


When to Use Level 2

✅ Good Fit❌ Consider Level 1 or Level 3
Single host with GPU (NVIDIA, AMD)No GPU available → Level 1 embedded candle
Want GPU-accelerated Layer 2 generationMulti-node scaling → Level 3 Kubernetes
Want full observability stack (Jaeger, Grafana)Budget VPS (< 4 GB RAM) → Level 1
Development with production-like topologyAuto-scaling inference pools → Level 3
10–100 concurrent users> 100 concurrent users → Level 3

Prerequisites

RequirementMinimumRecommended
RAM8 GB16 GB
Disk10 GB20 GB (model cache)
CPU4 cores8+ cores
GPU (optional)NVIDIA with 4 GB VRAMNVIDIA with 8+ GB VRAM
Docker24.0+Latest
Docker Composev2.20+Latest
NVIDIA Container Toolkit (GPU)LatestLatest

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Single Host                              │
│                                                                 │
│  ┌─────────────┐    ┌───────────────────┐    ┌──────────────┐  │
│  │   Client     │───▶│  Isartor Firewall │    │  Jaeger UI   │  │
│  │             │    │  :8080             │    │  :16686      │  │
│  └─────────────┘    │  (candle L1        │    └──────────────┘  │
│                     │   embeddings       │                      │
│                     │   built-in)        │                      │
│                     └──┬────────────────┘                       │
│                        │                                        │
│              HTTP :8081│                                        │
│                        ▼                                        │
│               ┌────────────┐                  ┌──────────────┐ │
│               │ slm-gen    │                  │  Grafana     │ │
│               │ Phi-3-mini │                  │  :3000       │ │
│               │ (llama.cpp)│                  └──────────────┘ │
│               └────────────┘                                    │
│                                               ┌──────────────┐ │
│               ┌─────────────────────────┐     │  Prometheus  │ │
│               │    OTel Collector :4317  │────▶│  :9090       │ │
│               └─────────────────────────┘     └──────────────┘ │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │ Optional: slm-embed :8082 (llama.cpp)                    │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Services

ServiceImagePortPurposeMemory Limit
gatewayisartor:latest (built)8080Prompt Firewall (includes candle BertModel for Layer 1 embeddings)256 MB
slm-generationghcr.io/ggml-org/llama.cpp:server8081Phi-3-mini-4k (Q4_K_M) — intent classification + generation4 GB
slm-embedding (optional)ghcr.io/ggml-org/llama.cpp:server8082all-MiniLM-L6-v2 (Q8_0) — external embedding sidecar (default uses in-process candle)512 MB
otel-collectorotel/opentelemetry-collector-contrib:0.96.04317OTLP gRPC receiver128 MB
jaegerjaegertracing/all-in-one:1.5516686Distributed tracing UI256 MB
prometheusprom/prometheus:v2.51.09090Metrics storage (7d retention)256 MB
grafanagrafana/grafana:10.4.03000Dashboards256 MB

Quick Start (CPU Only)

1. Clone the Repository

git clone https://github.com/isartor-ai/isartor.git
cd isartor/docker

2. Configure Layer 3 (Optional)

Layers 0–2 work without a cloud LLM key. If you want Layer 3 fallback:

cp .env.full.example .env.full

Edit .env.full and set your provider:

ISARTOR__LLM_PROVIDER=openai
ISARTOR__EXTERNAL_LLM_MODEL=gpt-4o-mini
ISARTOR__EXTERNAL_LLM_API_KEY=sk-...

3. Start the Full Stack

docker compose -f docker-compose.sidecar.yml up --build

First launch downloads model files (~1.5 GB for Phi-3 + ~50 MB for MiniLM). Subsequent starts use the cached isartor-slm-models volume.

4. Wait for Health Checks

The firewall waits for both sidecars to become healthy before starting:

docker compose -f docker-compose.sidecar.yml ps

All services should show healthy or running.

5. Verify

# Health check
curl http://localhost:8080/healthz

# Test the firewall
curl -s http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is 2+2?"}' | jq .

# If you enabled gateway auth, add:
#   -H "X-API-Key: your-secret-key"

# Check traces in Jaeger
open http://localhost:16686

GPU Passthrough (NVIDIA)

To enable GPU acceleration for the llama.cpp sidecars:

1. Install NVIDIA Container Toolkit

# Ubuntu / Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

2. Add GPU Resources to Compose

Create a docker-compose.gpu.override.yml:

services:
  slm-generation:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # The default --n-gpu-layers 99 in docker-compose.sidecar.yml
    # already offloads all layers to GPU when available.

  slm-embedding:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

3. Start with GPU Override

docker compose \
  -f docker-compose.sidecar.yml \
  -f docker-compose.gpu.override.yml \
  up --build

Expected GPU Impact

MetricCPU Only (8-core)GPU (RTX 3060 12 GB)
Phi-3 classification500–2000 ms30–100 ms
Phi-3 generation (256 tokens)5–15 s0.5–2 s
MiniLM embedding20–50 ms5–10 ms

Available Compose Files

The docker/ directory contains several Compose configurations for different use cases:

FileDescriptionProvider
docker-compose.sidecar.ymlRecommended. Full stack with llama.cpp sidecars + observabilityAny (configurable)
docker-compose.ymlLegacy stack with Ollama (heavier)OpenAI
docker-compose.azure.ymlLegacy stack with Ollama, pre-configured for Azure OpenAIAzure
docker-compose.observability.ymlObservability-focused stack (Ollama + OTel + Jaeger + Grafana)Azure

We recommend docker-compose.sidecar.yml for all new deployments. The llama.cpp sidecars are ~30 MB each vs. Ollama's ~1.5 GB.


Environment Variables (Level 2 Specific)

These variables are relevant to the sidecar architecture. For the full reference, see the Configuration Reference.

Firewall ↔ Sidecar Communication

VariableDefaultDescription
ISARTOR__LAYER2__SIDECAR_URLhttp://127.0.0.1:8081Generation sidecar URL (use Docker service name in Compose: http://slm-generation:8081)
ISARTOR__LAYER2__MODEL_NAMEphi-3-miniModel name for OpenAI-compatible requests
ISARTOR__LAYER2__TIMEOUT_SECONDS30HTTP timeout for generation calls
ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URLhttp://127.0.0.1:8082Embedding sidecar URL — optional (default uses in-process candle; use http://slm-embedding:8082 in Compose)
ISARTOR__EMBEDDING_SIDECAR__MODEL_NAMEall-minilmEmbedding model name (sidecar only)
ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS10HTTP timeout for embedding calls (sidecar only)

Pluggable Backends

VariableDefaultDescription
ISARTOR__CACHE_BACKENDmemoryIn-process LRU — ideal for single-host Docker Compose
ISARTOR__ROUTER_BACKENDembeddedIn-process Candle SLM classification — no external dependency

Scalability note: These defaults are appropriate for Level 2 (single host). When moving to Level 3 (multi-replica K8s), switch to cache_backend=redis and router_backend=vllm for horizontal scaling.

Cache

VariableDefaultDescription
ISARTOR__CACHE_MODEbothUse both — in-process candle BertModel provides semantic embeddings at all tiers
ISARTOR__SIMILARITY_THRESHOLD0.85Cosine similarity threshold for cache hits

Observability

VariableDefaultDescription
ISARTOR__ENABLE_MONITORINGtrue (in Compose)Enable OTel trace/metric export
ISARTOR__OTEL_EXPORTER_ENDPOINThttp://otel-collector:4317OTel Collector gRPC endpoint

Operational Commands

Logs

# All services
docker compose -f docker-compose.sidecar.yml logs -f

# Firewall only
docker compose -f docker-compose.sidecar.yml logs -f gateway

# Sidecars
docker compose -f docker-compose.sidecar.yml logs -f slm-generation slm-embedding

Restart a Service

docker compose -f docker-compose.sidecar.yml restart gateway

Tear Down (Preserve Model Cache)

docker compose -f docker-compose.sidecar.yml down
# Models persist in the 'isartor-slm-models' volume

Tear Down (Clean Everything)

docker compose -f docker-compose.sidecar.yml down -v
# Removes all volumes including model cache — next start re-downloads models

View Model Cache Size

docker volume inspect isartor-slm-models

Networking Notes

  • All services share a Docker bridge network created by Compose.
  • The firewall references sidecars by Docker service name (slm-generation, slm-embedding), not localhost.
  • Only the firewall (8080), Jaeger UI (16686), Grafana (3000), and Prometheus (9090) are exposed to the host.
  • Sidecar ports (8081, 8082) are also exposed for debugging but can be removed in production by deleting the ports: mapping.

Scaling Within Level 2

Before moving to Level 3, you can vertically scale Level 2:

OptimisationHow
More GPU VRAMUse larger quantisation (Q8_0 instead of Q4_K_M) for better quality
Bigger modelSwap Phi-3-mini for Phi-3-medium or Qwen2-7B in the Compose command
More cacheIncrease ISARTOR__CACHE_MAX_CAPACITY and ISARTOR__CACHE_TTL_SECS
Faster embeddingUse nomic-embed-text (768-dim) for richer semantic matching
More concurrencyScale horizontally with multiple firewall replicas behind a load balancer

Upgrading to Level 3

When a single host is no longer sufficient:

  1. Extract the firewall into stateless Kubernetes pods (it's already stateless).
  2. Replace sidecars with an auto-scaling inference pool (vLLM, TGI, or Triton).
  3. Add an internal load balancer between firewall pods and the inference pool.
  4. Move observability to a managed solution (Datadog, Grafana Cloud, Azure Monitor).

See Level 3 — Enterprise Deployment for the full Kubernetes guide.