Architecture Decision Records
Key design decisions, trade-offs, and rationale behind Isartor's architecture.
Each ADR follows a lightweight format: Context → Decision → Consequences.
ADR-001: Multi-Layer Deflection Stack Architecture
Date: 2024 · Status: Accepted
Context
AI Prompt Firewall traffic follows a power-law distribution: the majority of prompts are simple or repetitive, while only a small fraction requires expensive cloud LLMs. Sending all traffic to a single provider wastes tokens and money.
Decision
Implement a sequential Deflection Stack with 4+ layers, each capable of short-circuiting:
- Layer 0 — Operational defense (auth, rate limiting, concurrency control)
- Layer 1 — Semantic + exact cache (zero-cost hits)
- Layer 2 — Local SLM triage (classify intent, execute simple tasks locally)
- Layer 2.5 — Context optimiser (retrieve + rerank to minimise token usage)
- Layer 3 — Cloud LLM fallback (only the hardest prompts)
Layer 2.5 (Context Optimiser):
Retrieves and reranks candidate documents or responses to minimize downstream token usage. Typically implements top-K selection, reranking, or context window optimization before forwarding to the LLM. Instrumented as the context_optimise span in observability.
Consequences
- Positive: 60–80% of traffic can be resolved before Layer 3, dramatically reducing cost.
- Positive: Each layer adds latency only when needed — cache hits are sub-millisecond.
- Positive: Clear separation of concerns; each layer is independently testable.
- Negative: Deflection Stack adds conceptual complexity vs. a simple reverse proxy.
- Negative: Each layer needs its own error handling and timeout strategy.
ADR-002: Axum + Tokio as Runtime Foundation
Date: 2024 · Status: Accepted
Context
The firewall must handle high concurrency (thousands of simultaneous connections) with low latency overhead. The binary should be small, statically linked, and deployable to minimal environments.
Decision
Use Axum 0.8 on Tokio 1.x for the async HTTP server. Build with --target x86_64-unknown-linux-musl and opt-level = "z" + LTO for a ~5 MB static binary.
Consequences
- Positive: Tokio's work-stealing scheduler handles 10K+ concurrent connections efficiently.
- Positive: Axum's type-safe extractors catch errors at compile time.
- Positive: Static musl binary runs in distroless containers (no libc, no shell).
- Negative: Rust's compilation times are longer than Go/Node.js equivalents.
- Negative: Ecosystem is smaller — fewer off-the-shelf middleware components.
ADR-003: Embedded Candle Classifier (Layer 2)
Date: 2024 · Status: Accepted
Context
For minimal deployments (edge, VPS, air-gapped), requiring an external sidecar (llama.cpp, Ollama, TGI) adds operational complexity. Many classification tasks can be handled by a 2B parameter model on CPU.
Decision
Embed a Gemma-2-2B-IT GGUF model directly in the Rust process using the candle framework. The model is loaded on first start via hf-hub (auto-downloaded from Hugging Face) and wrapped in a tokio::sync::Mutex for thread-safe inference on spawn_blocking.
Consequences
- Positive: Zero external dependencies for Layer 2 classification — a single binary handles everything.
- Positive: No HTTP overhead for classification calls; inference is an in-process function call.
- Positive: Works in air-gapped environments with pre-cached models.
- Negative: ~1.5 GB memory overhead for the Q4_K_M model weights.
- Negative: CPU inference is slower than GPU (50–200 ms classification, 200–2000 ms generation).
- Negative:
Mutexserialises inference calls — throughput limited to one inference at a time. - Trade-off: For higher throughput, upgrade to Level 2 (llama.cpp sidecar on GPU).
ADR-004: Three Deployment Tiers
Date: 2024 · Status: Accepted
Context
Isartor targets a wide range of deployments, from a developer's laptop to enterprise Kubernetes clusters. A single deployment model cannot serve all use cases optimally.
Decision
Define three explicit deployment tiers that share the same binary and configuration surface:
| Tier | Strategy | Target |
|---|---|---|
| Level 1 | Monolithic binary, embedded candle | VPS, edge, bare metal |
| Level 2 | Firewall + llama.cpp sidecars | Docker Compose, single host + GPU |
| Level 3 | Stateless pods + inference pools | Kubernetes, Helm, HPA |
The tier is selected purely by environment variables and infrastructure, not by code changes.
Consequences
- Positive: A single codebase and binary serves all deployment scenarios.
- Positive: Users start at Level 1 and upgrade incrementally — no migrations.
- Positive: Clear documentation entry points for each tier.
- Negative: Some config variables are irrelevant at certain tiers (e.g.,
ISARTOR__LAYER2__SIDECAR_URLis unused at Level 1 with embedded candle). - Negative: Testing all three tiers requires different infrastructure setups.
ADR-005: llama.cpp as Sidecar (Level 2) Instead of Ollama
Date: 2024 · Status: Accepted
Context
The original design used Ollama (~1.5 GB image) as the local SLM engine. While Ollama has a convenient API and model management, it's heavyweight for a sidecar.
Decision
Replace Ollama with llama.cpp server (ghcr.io/ggml-org/llama.cpp:server, ~30 MB) as the default sidecar in docker-compose.sidecar.yml. Two instances run side by side:
- slm-generation (port 8081) — Phi-3-mini for classification and generation
- slm-embedding (port 8082) — all-MiniLM-L6-v2 with
--embeddingflag
Consequences
- Positive: 50× smaller container images (30 MB vs. 1.5 GB).
- Positive: Faster cold starts; no model pull step needed (uses
--hf-repoauto-download). - Positive: OpenAI-compatible API — firewall code doesn't need to change.
- Negative: Ollama's model management UX (pull, list, delete) is lost.
- Negative: Each model needs its own llama.cpp instance (no multi-model serving).
- Migration: Ollama-based Compose files (
docker-compose.yml,docker-compose.azure.yml) are retained for backward compatibility. - Update (ADR-011): The slm-embedding sidecar (port 8082) is now optional. Layer 1 semantic cache embeddings are generated in-process via candle (pure-Rust BertModel).
ADR-006: rig-core for Multi-Provider LLM Client
Date: 2024 · Status: Accepted
Context
Layer 3 must route to multiple cloud LLM providers (OpenAI, Azure OpenAI, Anthropic, xAI). Implementing each provider's API client from scratch would be error-prone and hard to maintain.
Decision
Use rig-core (v0.32.0) as the unified LLM client. Rig provides a consistent CompletionModel abstraction over all supported providers.
Consequences
- Positive: Single configuration surface (
ISARTOR__LLM_PROVIDER+ISARTOR__EXTERNAL_LLM_API_KEY) switches providers. - Positive: Provider-specific quirks (Azure deployment IDs, Anthropic versioning) handled by rig.
- Negative: Adds a dependency; rig's release cadence may not match our needs.
- Negative: Limited to providers rig supports (but covers all major ones).
ADR-007: AIMD Adaptive Concurrency Control
Date: 2024 · Status: Accepted
Context
A fixed concurrency limit either over-provisions (wasting resources) or under-provisions (rejecting requests during traffic spikes). The firewall needs to dynamically adjust its limit based on real-time latency.
Decision
Implement an Additive Increase / Multiplicative Decrease (AIMD) concurrency limiter at Layer 0:
- If P95 latency < target →
limit += 1(additive increase). - If P95 latency > target →
limit *= 0.5(multiplicative decrease). - Bounded by configurable min/max concurrency limits.
Consequences
- Positive: Self-tuning: the limit converges to the optimal value for the current load.
- Positive: Protects downstream services (sidecars, cloud LLMs) from overload.
- Negative: During cold start, the limit starts low and ramps up — initial requests may see 503s.
- Tuning: Target latency must be calibrated per deployment tier.
ADR-008: Unified API Surface
Date: 2024 · Status: Superseded
Context
The original design maintained two API versions: a v1 middleware-based pipeline (/api/chat) and a v2 orchestrator-based pipeline (/api/v2/chat). Maintaining two code paths increased complexity with no clear benefit once the middleware pipeline matured.
Decision
Consolidate into a single endpoint:
/api/chat— Middleware-based pipeline. Each layer is an Axum middleware (auth → cache → SLM triage → handler).- The v2 endpoint (
/api/v2/chat) and itspipeline_*configuration fields have been removed. - Orchestrator and trait-based pipeline components remain in
src/pipeline/for potential future reintegration.
Consequences
- Positive: Single code path to maintain, test, and observe.
- Positive: Simplified configuration surface — no more
PIPELINE_*env vars. - Positive: Eliminates user confusion about which endpoint to use.
- Negative: Orchestrator-based features (structured
processing_log, explicitPipelineContext) are not exposed until reintegrated.
ADR-009: Distroless Container Image
Date: 2024 · Status: Accepted
Context
The firewall binary is statically linked (musl). The runtime container only needs to execute a single binary.
Decision
Use gcr.io/distroless/static-debian12 as the runtime base image. It contains no shell, no package manager, no libc — only the static binary.
Consequences
- Positive: Minimal attack surface — no shell to exec into, no tools for attackers.
- Positive: Tiny image size (base ~2 MB + binary ~5 MB = ~7 MB total).
- Positive: Passes most container security scanners with zero CVEs.
- Negative: Cannot
docker execinto the container for debugging (no shell). - Negative: Cannot install additional tools at runtime.
- Workaround: Use
docker logs, Jaeger traces, and Prometheus metrics for debugging.
ADR-010: OpenTelemetry for Observability
Date: 2024 · Status: Accepted
Context
The firewall needs distributed tracing and metrics. Vendor-specific SDKs (Datadog, New Relic, etc.) create lock-in.
Decision
Use OpenTelemetry (OTLP gRPC) as the sole telemetry interface. Traces and metrics are exported to an OTel Collector, which can forward to any backend (Jaeger, Prometheus, Grafana, Datadog, etc.).
Consequences
- Positive: Vendor-neutral — switch backends by reconfiguring the collector, not the app.
- Positive: OTLP is a CNCF standard with wide ecosystem support.
- Positive: When
ISARTOR__ENABLE_MONITORING=false, no OTel SDK is initialised — zero overhead. - Negative: Requires an OTel Collector as middleware (adds one more service in Level 2/3).
- Negative: Auto-instrumentation is less mature in Rust than in Java/Python.
ADR-011: Pure-Rust Candle for In-Process Sentence Embeddings
| Status | Accepted (superseded: fastembed → candle) |
| Date | 2025-06 (updated 2025-07) |
| Deciders | Core team |
| Relates to | ADR-003 (Embedded Candle), ADR-005 (llama.cpp sidecar) |
Context
Layer 1 (semantic cache) must generate sentence embeddings for every incoming prompt to compute cosine similarity against the vector cache. Previously, this was done via fastembed (ONNX Runtime, BAAI/bge-small-en-v1.5), which introduced a C++ dependency (onnxruntime-sys) that broke cross-compilation on ARM64 macOS and complicated the build matrix.
Decision
Use candle (candle-core, candle-nn, candle-transformers 0.9) with hf-hub and tokenizers to run sentence-transformers/all-MiniLM-L6-v2 in-process via a pure-Rust BertModel. The model weights (~90 MB) are downloaded once from Hugging Face Hub on first startup and cached in ~/.cache/huggingface/. Inference is invoked through tokio::task::spawn_blocking since BERT forward passes are CPU-bound.
- Model: sentence-transformers/all-MiniLM-L6-v2 — 384-dimensional embeddings, optimised for sentence similarity.
- Runtime: Pure-Rust candle stack — zero C/C++ dependencies, seamless cross-compilation to any
rustctarget. - Pooling: Mean pooling with attention mask, followed by L2 normalisation.
- Thread safety: The inner
BertModelis wrapped instd::sync::Mutexbecauseforward()takes&mut self. This is acceptable because inference is always called fromspawn_blocking, never holding the lock across.awaitpoints. - Architecture:
TextEmbedderis initialised once at startup, stored asArc<TextEmbedder>inAppState, and injected into the cache middleware.
Alternatives Considered
| Alternative | Why rejected |
|---|---|
| fastembed (ONNX Runtime) | C++ dependency (onnxruntime-sys) breaks ARM64 cross-compilation; ~5 MB shared library |
| llama.cpp sidecar (all-MiniLM-L6-v2) | Network round-trip on hot path, extra container to manage |
| sentence-transformers (Python) | Crosses FFI boundary, adds Python runtime dependency |
| ort (raw ONNX Runtime bindings) | Same C++ dependency problem as fastembed |
Consequences
- Positive: Eliminates ~2–5 ms network latency per embedding call on the cache hot path.
- Positive: Zero C/C++ dependencies —
cargo buildworks on any platform without cmake or pre-built binaries. - Positive: Zero sidecar dependency for Level 1 — the minimal Dockerfile runs self-contained.
- Positive: Model weights are auto-downloaded from Hugging Face Hub; reproducible builds.
- Negative: First startup downloads model weights (~90 MB) if not pre-cached.
- Negative:
Mutexserialises concurrent embedding calls within a single process (acceptable at current scale; can be replaced with a pool of models if needed).
ADR-012: Pluggable Trait Provider (Hexagonal Architecture)
| Status | Accepted |
| Date | 2025-06 |
| Deciders | Core team |
| Relates to | ADR-003 (Embedded Candle), ADR-004 (Three Deployment Tiers) |
Context
As Isartor grew from a single-process binary (Level 1) to a multi-tier deployment (Level 1 → 2 → 3), the cache and SLM router components became tightly coupled to their in-process implementations. Scaling to Level 3 (Kubernetes, multiple replicas) requires:
- Shared cache — in-process LRU caches are isolated per pod; cache hits are inconsistent, duplicating work.
- GPU-backed inference — in-process Candle inference is CPU-bound; Level 3 needs a dedicated GPU inference pool (vLLM / TGI) that can scale independently.
Hard-coding these choices into the firewall binary would require compile-time feature flags or code branching, making the binary non-portable across tiers.
Decision
Adopt the Ports & Adapters (Hexagonal Architecture) pattern:
- Ports (
src/core/ports.rs) — DefineExactCacheandSlmRouterasasync_traittraits (Send + Sync), representing the interfaces the firewall depends on. - Adapters (
src/adapters/) — Provide concrete implementations:InMemoryCache(ahash + LRU + parking_lot) andRedisExactCacheforExactCacheEmbeddedCandleRouterandRemoteVllmRouterforSlmRouter
- Factory (
src/factory.rs) —build_exact_cache(&config)andbuild_slm_router(&config, &http_client)readAppConfig.cache_backendandAppConfig.router_backendat startup and return the appropriateBox<dyn Trait>. - Configuration (
src/config.rs) —CacheBackendenum (Memory | Redis) andRouterBackendenum (Embedded | Vllm) with associated connection URLs, selectable viaISARTOR__CACHE_BACKENDandISARTOR__ROUTER_BACKENDenv vars.
The same binary serves all three deployment tiers; the runtime behaviour is entirely configuration-driven.
Alternatives Considered
| Alternative | Why rejected |
|---|---|
Compile-time feature flags (#[cfg(feature = "redis")]) | Produces different binaries per tier; complicates CI and container builds |
| Service mesh sidecar (Envoy filter for caching) | Adds infrastructure complexity; cache logic is domain-specific |
Plugin system (dynamic .so loading) | Over-engineered; dyn Trait with compile-time-known variants is simpler |
| Runtime scripting (Lua / Wasm policy) | Unnecessary indirection; Rust trait dispatch is zero-cost |
Consequences
- Positive: One binary, all tiers — only env vars change between Level 1 (embedded everything) and Level 3 (Redis + vLLM).
- Positive: Horizontal scalability — with
cache_backend=redis, all pods share the same cache; withrouter_backend=vllm, GPU inference scales independently. - Positive: Testability — unit tests inject mock adapters via the trait interface.
- Positive: Extensibility — adding a new backend (e.g., Memcached, Triton) requires only a new adapter implementing the trait.
- Negative: Minor runtime overhead from
dyn Traitdynamic dispatch (single vtable lookup per call — negligible vs. network I/O). - Negative:
EmbeddedCandleRouterremains a skeleton; full candle-based classification requires theembedded-inferencefeature flag to be completed.
← Back to Architecture