Performance Tuning

How to measure, tune, and operate Isartor for maximum deflection and minimum latency.


Table of Contents

  1. Understanding Deflection
  2. Measuring Deflection Rate
  3. Tuning Configuration for Deflection
  4. Tuning Latency
  5. Memory & Resource Tuning
  6. Cache Tuning Deep-Dive
  7. SLM Router Tuning
  8. Embedder Tuning
  9. SLO / SLA Goal Templates
  10. Scenario-Based Tuning Recipes
  11. PromQL Cheat Sheet

Understanding Deflection

Deflection = the percentage of requests resolved before Layer 3 (the external cloud LLM). A request is "deflected" if it is served by:

LayerMechanismCost
L1a — Exact CacheSHA-256 hash match$0
L1b — Semantic CacheCosine similarity match$0
L2 — SLM TriageLocal SLM classifies requests as TEMPLATE, SNIPPET, or COMPLEX (tiered mode) and answers TEMPLATE/SNIPPET locally$0

The deflection rate directly maps to cost savings. A 70 % deflection rate means only 30 % of requests reach the paid cloud LLM.


Measuring Deflection Rate

Via Prometheus / Grafana

The gateway emits isartor_requests_total with a final_layer label. Use the following PromQL to compute the deflection rate:

# Overall deflection rate (last 1 hour)
1 - (
  sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
  /
  sum(increase(isartor_requests_total[1h]))
)
# Deflection rate by layer (pie chart)
sum by (final_layer) (rate(isartor_requests_total[5m]))
# Exact-cache deflection only
sum(increase(isartor_requests_total{final_layer="L1a_ExactCache"}[1h]))
/
sum(increase(isartor_requests_total[1h]))

Via the API

Send a test batch and count response layer values:

# Send 100 identical requests — expect 99 cache hits
for i in $(seq 1 100); do
  curl -s -X POST http://localhost:8080/api/chat \
    -H "Content-Type: application/json" \
    -H "X-API-Key: $ISARTOR_API_KEY" \
    -d '{"prompt": "What is the capital of France?"}' \
  | jq '.layer'
done | sort | uniq -c

Expected output (ideal):

   1 3       ← first request → cloud
  99 1       ← remaining → exact cache

Via Structured Logs

When ISARTOR__ENABLE_MONITORING=true, every request logs the final layer:

# grep JSON logs for final-layer distribution
cat logs.json | jq '.isartor.final_layer' | sort | uniq -c

Via Jaeger / Tempo

Filter traces by the isartor.final_layer tag:

GoalSearch
All cache hitsTag isartor.final_layer=L1a_ExactCache or L1b_SemanticCache
SLM resolutionsTag isartor.final_layer=L2_SLM
Cloud fallbacksTag isartor.final_layer=L3_Cloud

Tuning Configuration for Deflection

Cache Mode

VariableValuesRecommended
ISARTOR__CACHE_MODEexact, semantic, bothboth (default)
  • exact — Only identical prompts hit. Good for deterministic agent loops.
  • semantic — Catches paraphrases ("Price?" ≈ "Cost?"). Higher hit rate but adds ~1–5 ms embedding cost.
  • both — Exact check first (< 1 ms), then semantic if no exact hit. Best of both worlds.

Similarity Threshold

VariableDefaultRange
ISARTOR__SIMILARITY_THRESHOLD0.850.01.0
ValueEffect
0.95Very strict — only near-identical prompts match. Low false positives, lower hit rate.
0.85Balanced — catches common paraphrases. Recommended starting point.
0.75Aggressive — higher hit rate but risk of returning wrong cached answers.
0.60Dangerous — high false-positive rate. Not recommended for production.

How to tune:

  1. Set ISARTOR__ENABLE_MONITORING=true.
  2. Send representative traffic for 1 hour.
  3. In Jaeger, search for cosine_similarity attribute on l1b_semantic_cache_search spans.
  4. Plot the distribution. If most similarity scores cluster between 0.80–0.90, a threshold of 0.85 is good.
  5. If you see many scores at 0.82–0.84 that should be hits, lower to 0.80.

Cache TTL

VariableDefaultDescription
ISARTOR__CACHE_TTL_SECS300 (5 min)Time-to-live for cached responses
  • Short TTL (60–120 s): Good for rapidly changing data, real-time dashboards.
  • Medium TTL (300–600 s): Balanced for most workloads.
  • Long TTL (1800+ s): Maximises deflection for static Q&A / documentation bots.

Cache Capacity

VariableDefaultDescription
ISARTOR__CACHE_MAX_CAPACITY10000Max entries in each cache (LRU eviction)
  • Monitor eviction rate via cache.evicted span attribute on l1b_semantic_cache_insert.
  • If eviction rate > 5 % of inserts, increase capacity or shorten TTL.
  • Each cache entry ≈ 2–4 KB (prompt hash + response + optional 384-dim vector).

Tuning Latency

Target Latencies by Layer

LayerTarget (p95)Typical Range
L1a — Exact Cache< 1 ms0.1–0.5 ms
L1b — Semantic Cache< 10 ms1–5 ms
L2 — SLM Triage< 300 ms50–200 ms (embedded), 100–500 ms (sidecar)
L3 — Cloud LLM< 3 s500 ms – 5 s (network-bound)

Measure with PromQL

# P95 latency by layer
histogram_quantile(0.95,
  sum by (le, layer_name) (
    rate(isartor_layer_duration_seconds_bucket[5m])
  )
)
# P95 end-to-end latency
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))

Reducing Latency

BottleneckSymptomFix
EmbeddingL1b > 10 msUse a lighter model or increase CPU allocation
SLM inferenceL2 > 500 msUse quantised model (Q4_K_M GGUF), switch to embedded engine
RedisL1a > 5 msCheck network latency, use Redis cluster with read replicas
Cloud LLML3 > 5 sSwitch provider, use a smaller model, enable request timeout

Memory & Resource Tuning

Memory Budget

ComponentMemory UsageNotes
Exact cache (in-memory, 10K entries)~20–40 MBScales linearly with cache_max_capacity
Semantic cache (in-memory, 10K entries)~30–60 MB384-dim float32 vectors + response strings
candle embedder (all-MiniLM-L6-v2)~90 MBLoaded at startup, constant
Candle GGUF model (embedded SLM)~1–4 GBDepends on model quantisation
Tokio runtime~10–20 MBAsync task pool
Total (minimalist mode)~150–200 MBNo embedded SLM
Total (embedded mode)~1.5–4.5 GBWith embedded Candle SLM

CPU Considerations

  • Embedding generation runs on spawn_blocking (dedicated thread pool).
  • Candle GGUF inference is CPU-bound; allocate ≥ 4 cores for embedded mode.
  • The Tokio async runtime uses the default thread count (num_cpus).

Container Limits

# docker-compose example
services:
  gateway:
    deploy:
      resources:
        limits:
          memory: 512M    # minimalist mode
          cpus: "2"
        # For embedded SLM mode:
        # limits:
        #   memory: 4G
        #   cpus: "4"

Cache Tuning Deep-Dive

Exact vs. Semantic Cache Hit Analysis

# Exact cache hit rate
sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))

# Semantic cache hit rate
sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))

Cache Backend: Memory vs. Redis

FactorIn-MemoryRedis
Latency~0.1 ms~1–5 ms (network hop)
CapacityLimited by process RAMLimited by Redis memory
Multi-replica❌ No sharing✅ Shared across pods
Persistence❌ Lost on restart✅ Optional AOF/RDB
Recommended forSingle-instance, dev, edgeK8s, multi-replica, production

Switch with:

export ISARTOR__CACHE_BACKEND=redis
export ISARTOR__REDIS_URL=redis://redis.svc:6379

When to Disable Semantic Cache

  • Traffic is 100 % deterministic (exact same prompts repeated).
  • Embedding overhead is unacceptable (< 1 ms budget).
  • Set ISARTOR__CACHE_MODE=exact.

SLM Router Tuning

Embedded vs. Sidecar

ModeVariableLatencyResource Usage
Embedded (Candle)ISARTOR__INFERENCE_ENGINE=embedded50–200 msHigh CPU, 1–4 GB RAM
Sidecar (llama.cpp)ISARTOR__INFERENCE_ENGINE=sidecar100–500 msSeparate process, GPU optional
Remote (vLLM/TGI)ISARTOR__ROUTER_BACKEND=vllm100–500 msSeparate server, GPU recommended

Model Selection

ModelSizeSpeedAccuracy
Phi-3-mini (Q4_K_M)~2 GBFastGood
Gemma-2-2B-IT (Q4)~1.5 GBVery fastGood
Qwen-1.5-1.8B (Q4)~1.2 GBFastestAdequate
Llama-3-8B (Q4)~4.5 GBSlowerBest

For intent classification (TEMPLATE/SNIPPET/COMPLEX in tiered mode, or SIMPLE/COMPLEX in legacy binary mode), smaller models (1–3 B params) are sufficient. Use the smallest model that meets your accuracy needs.

Tuning the Classification Prompt

The system prompt in src/middleware/slm_triage.rs determines classification accuracy. If too many COMPLEX requests are misclassified as TEMPLATE or SNIPPET (resulting in bad local answers), consider:

  1. Making the system prompt more specific to your domain.
  2. Adding examples to the prompt (few-shot).
  3. Switching to a larger model.
  4. Setting ISARTOR__LAYER2__MAX_ANSWER_TOKENS to allow longer SLM responses (default 2048).
  5. Falling back to binary mode via ISARTOR__LAYER2__CLASSIFIER_MODE=binary if the three-tier split does not suit your workload.

Embedder Tuning

In-Process (candle)

The default embedder uses candle with sentence-transformers/all-MiniLM-L6-v2 (pure-Rust BertModel):

  • 384-dimensional vectors
  • ~90 MB model footprint
  • 1–5 ms per embedding (CPU)
  • Runs on spawn_blocking to avoid starving the Tokio runtime

Sidecar Embedder

For higher throughput or GPU acceleration:

export ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082
export ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME=all-minilm
export ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS=10

Embedding Model Selection

ModelDimsSpeedQuality
all-MiniLM-L6-v2384FastestGood
bge-small-en-v1.5384FastBetter
bge-base-en-v1.5768ModerateBest

Use 384-dim models for production. 768-dim models double memory usage for marginal quality improvement in most use cases.


SLO / SLA Goal Templates

Developer / Internal SLO

MetricTargetMeasurement
Availability99.5 %up{job="isartor"} over 30-day window
P95 latency (cache hit)< 10 mshistogram_quantile(0.95, ...) on L1
P95 latency (end-to-end)< 3 shistogram_quantile(0.95, ...) on all
Deflection rate> 50 %1 - (L3 / total) over 24 h
Error rate< 1 %rate(isartor_requests_total{http_status=~"5.."}[5m])

Production / Enterprise SLO

MetricTargetMeasurement
Availability99.9 %Multi-replica, health check monitoring
P95 latency (cache hit)< 5 msRequires Redis or fast in-memory
P95 latency (end-to-end)< 2 sOptimised models, provider SLAs
P99 latency (end-to-end)< 5 sTail latency budget
Deflection rate> 70 %Tuned thresholds + warm cache
Error rate< 0.1 %Circuit breakers, retries
Token savings> 60 %isartor_tokens_saved_total vs estimated total

SLA Template (for downstream consumers)

## Isartor Prompt Firewall SLA

**Availability:** 99.9 % monthly uptime (< 43.8 min downtime/month)
**Latency:** P95 end-to-end < 2 seconds
**Error Budget:** 0.1 % of requests may return 5xx
**Maintenance Window:** Sundays 02:00–04:00 UTC (excluded from SLA)

### Remediation
- Cache tier failure: automatic fallback to cloud LLM (degraded mode)
- SLM failure: automatic fallback to cloud LLM (degraded mode)
- Cloud LLM failure: 502 Bad Gateway returned, retry recommended

### Monitoring
- Health endpoint: GET /healthz
- Metrics endpoint: Prometheus scrape via OTel Collector on port 8889
- Dashboard: Grafana at http://<grafana-host>:3000

Alert Rules (Prometheus)

groups:
  - name: isartor-slo
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(isartor_requests_total{http_status=~"5.."}[5m]))
          /
          sum(rate(isartor_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Isartor error rate exceeds 1%"

      - alert: HighP95Latency
        expr: |
          histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
          > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Isartor P95 latency exceeds 3 seconds"

      - alert: LowDeflectionRate
        expr: |
          1 - (
            sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
            /
            sum(rate(isartor_requests_total[1h]))
          ) < 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Isartor deflection rate below 50%"

      - alert: FirewallDown
        expr: up{job="isartor"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Isartor gateway is down"

Scenario-Based Tuning Recipes

Scenario A: Agentic Loop (High-Volume Identical Prompts)

Profile: Autonomous agent sends the same prompt hundreds of times per minute.

ISARTOR__CACHE_MODE=exact           # Semantic unnecessary for identical prompts
ISARTOR__CACHE_TTL_SECS=3600       # Long TTL — agent prompts are stable
ISARTOR__CACHE_MAX_CAPACITY=50000  # Large cache for many unique prompts

Expected deflection: 95–99 % (after warm-up).

Scenario B: Customer Support Bot (Paraphrased Questions)

Profile: End users ask the same questions in different ways.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.80  # Lower threshold to catch paraphrases
ISARTOR__CACHE_TTL_SECS=1800       # 30 min — support answers change slowly
ISARTOR__CACHE_MAX_CAPACITY=10000

Expected deflection: 60–80 %.

Scenario C: Code Generation (Low Cache Hit Rate)

Profile: Developers ask unique, complex coding questions.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.92  # High threshold — wrong cached code is costly
ISARTOR__CACHE_TTL_SECS=600        # Short TTL — code context changes quickly
ISARTOR__INFERENCE_ENGINE=embedded   # Let SLM handle simple code questions

Expected deflection: 20–40 % (SLM handles simple extraction).

Scenario D: RAG Pipeline (Document Q&A)

Profile: Queries against a knowledge base; similar questions are common.

ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.83  # Moderate threshold
ISARTOR__CACHE_TTL_SECS=3600       # Documents change infrequently
ISARTOR__CACHE_MAX_CAPACITY=20000  # Large cache for document variation

Expected deflection: 50–70 %.

Scenario E: Multi-Replica Kubernetes

Profile: Horizontally scaled behind a load balancer.

ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379
ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.85

Benefit: All replicas share the same cache → deflection rate applies cluster-wide.


PromQL Cheat Sheet

WhatQuery
Deflection rate (1 h)1 - (sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h])) / sum(increase(isartor_requests_total[1h])))
Request raterate(isartor_requests_total[5m])
Request rate by layersum by (final_layer) (rate(isartor_requests_total[5m]))
P50 latencyhistogram_quantile(0.50, rate(isartor_request_duration_seconds_bucket[5m]))
P95 latencyhistogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
P99 latencyhistogram_quantile(0.99, rate(isartor_request_duration_seconds_bucket[5m]))
Per-layer P95histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m])))
Tokens saved (daily)sum(increase(isartor_tokens_saved_total[24h]))
Tokens saved by layersum by (final_layer) (rate(isartor_tokens_saved_total[5m]))
Est. daily cost savings ($0.01/1K tok)sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01
Error ratesum(rate(isartor_requests_total{http_status=~"5.."}[5m])) / sum(rate(isartor_requests_total[5m]))
Cache hit ratio (exact)sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m])) / sum(rate(isartor_requests_total[5m]))
Cache hit ratio (semantic)sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m])) / sum(rate(isartor_requests_total[5m]))

See also: Metrics & Tracing · Configuration Reference · Troubleshooting