Performance Tuning
How to measure, tune, and operate Isartor for maximum deflection and minimum latency.
Table of Contents
- Understanding Deflection
- Measuring Deflection Rate
- Tuning Configuration for Deflection
- Tuning Latency
- Memory & Resource Tuning
- Cache Tuning Deep-Dive
- SLM Router Tuning
- Embedder Tuning
- SLO / SLA Goal Templates
- Scenario-Based Tuning Recipes
- PromQL Cheat Sheet
Understanding Deflection
Deflection = the percentage of requests resolved before Layer 3 (the external cloud LLM). A request is "deflected" if it is served by:
| Layer | Mechanism | Cost |
|---|---|---|
| L1a — Exact Cache | SHA-256 hash match | $0 |
| L1b — Semantic Cache | Cosine similarity match | $0 |
| L2 — SLM Triage | Local SLM classifies requests as TEMPLATE, SNIPPET, or COMPLEX (tiered mode) and answers TEMPLATE/SNIPPET locally | $0 |
The deflection rate directly maps to cost savings. A 70 % deflection rate means only 30 % of requests reach the paid cloud LLM.
Measuring Deflection Rate
Via Prometheus / Grafana
The gateway emits isartor_requests_total with a final_layer label.
Use the following PromQL to compute the deflection rate:
# Overall deflection rate (last 1 hour)
1 - (
sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
/
sum(increase(isartor_requests_total[1h]))
)
# Deflection rate by layer (pie chart)
sum by (final_layer) (rate(isartor_requests_total[5m]))
# Exact-cache deflection only
sum(increase(isartor_requests_total{final_layer="L1a_ExactCache"}[1h]))
/
sum(increase(isartor_requests_total[1h]))
Via the API
Send a test batch and count response layer values:
# Send 100 identical requests — expect 99 cache hits
for i in $(seq 1 100); do
curl -s -X POST http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-H "X-API-Key: $ISARTOR_API_KEY" \
-d '{"prompt": "What is the capital of France?"}' \
| jq '.layer'
done | sort | uniq -c
Expected output (ideal):
1 3 ← first request → cloud
99 1 ← remaining → exact cache
Via Structured Logs
When ISARTOR__ENABLE_MONITORING=true, every request logs the final layer:
# grep JSON logs for final-layer distribution
cat logs.json | jq '.isartor.final_layer' | sort | uniq -c
Via Jaeger / Tempo
Filter traces by the isartor.final_layer tag:
| Goal | Search |
|---|---|
| All cache hits | Tag isartor.final_layer=L1a_ExactCache or L1b_SemanticCache |
| SLM resolutions | Tag isartor.final_layer=L2_SLM |
| Cloud fallbacks | Tag isartor.final_layer=L3_Cloud |
Tuning Configuration for Deflection
Cache Mode
| Variable | Values | Recommended |
|---|---|---|
ISARTOR__CACHE_MODE | exact, semantic, both | both (default) |
exact— Only identical prompts hit. Good for deterministic agent loops.semantic— Catches paraphrases ("Price?" ≈ "Cost?"). Higher hit rate but adds ~1–5 ms embedding cost.both— Exact check first (< 1 ms), then semantic if no exact hit. Best of both worlds.
Similarity Threshold
| Variable | Default | Range |
|---|---|---|
ISARTOR__SIMILARITY_THRESHOLD | 0.85 | 0.0–1.0 |
| Value | Effect |
|---|---|
0.95 | Very strict — only near-identical prompts match. Low false positives, lower hit rate. |
0.85 | Balanced — catches common paraphrases. Recommended starting point. |
0.75 | Aggressive — higher hit rate but risk of returning wrong cached answers. |
0.60 | Dangerous — high false-positive rate. Not recommended for production. |
How to tune:
- Set
ISARTOR__ENABLE_MONITORING=true. - Send representative traffic for 1 hour.
- In Jaeger, search for
cosine_similarityattribute onl1b_semantic_cache_searchspans. - Plot the distribution. If most similarity scores cluster between 0.80–0.90, a threshold of 0.85 is good.
- If you see many scores at 0.82–0.84 that should be hits, lower to 0.80.
Cache TTL
| Variable | Default | Description |
|---|---|---|
ISARTOR__CACHE_TTL_SECS | 300 (5 min) | Time-to-live for cached responses |
- Short TTL (60–120 s): Good for rapidly changing data, real-time dashboards.
- Medium TTL (300–600 s): Balanced for most workloads.
- Long TTL (1800+ s): Maximises deflection for static Q&A / documentation bots.
Cache Capacity
| Variable | Default | Description |
|---|---|---|
ISARTOR__CACHE_MAX_CAPACITY | 10000 | Max entries in each cache (LRU eviction) |
- Monitor eviction rate via
cache.evictedspan attribute onl1b_semantic_cache_insert. - If eviction rate > 5 % of inserts, increase capacity or shorten TTL.
- Each cache entry ≈ 2–4 KB (prompt hash + response + optional 384-dim vector).
Tuning Latency
Target Latencies by Layer
| Layer | Target (p95) | Typical Range |
|---|---|---|
| L1a — Exact Cache | < 1 ms | 0.1–0.5 ms |
| L1b — Semantic Cache | < 10 ms | 1–5 ms |
| L2 — SLM Triage | < 300 ms | 50–200 ms (embedded), 100–500 ms (sidecar) |
| L3 — Cloud LLM | < 3 s | 500 ms – 5 s (network-bound) |
Measure with PromQL
# P95 latency by layer
histogram_quantile(0.95,
sum by (le, layer_name) (
rate(isartor_layer_duration_seconds_bucket[5m])
)
)
# P95 end-to-end latency
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
Reducing Latency
| Bottleneck | Symptom | Fix |
|---|---|---|
| Embedding | L1b > 10 ms | Use a lighter model or increase CPU allocation |
| SLM inference | L2 > 500 ms | Use quantised model (Q4_K_M GGUF), switch to embedded engine |
| Redis | L1a > 5 ms | Check network latency, use Redis cluster with read replicas |
| Cloud LLM | L3 > 5 s | Switch provider, use a smaller model, enable request timeout |
Memory & Resource Tuning
Memory Budget
| Component | Memory Usage | Notes |
|---|---|---|
| Exact cache (in-memory, 10K entries) | ~20–40 MB | Scales linearly with cache_max_capacity |
| Semantic cache (in-memory, 10K entries) | ~30–60 MB | 384-dim float32 vectors + response strings |
| candle embedder (all-MiniLM-L6-v2) | ~90 MB | Loaded at startup, constant |
| Candle GGUF model (embedded SLM) | ~1–4 GB | Depends on model quantisation |
| Tokio runtime | ~10–20 MB | Async task pool |
| Total (minimalist mode) | ~150–200 MB | No embedded SLM |
| Total (embedded mode) | ~1.5–4.5 GB | With embedded Candle SLM |
CPU Considerations
- Embedding generation runs on
spawn_blocking(dedicated thread pool). - Candle GGUF inference is CPU-bound; allocate ≥ 4 cores for embedded mode.
- The Tokio async runtime uses the default thread count (
num_cpus).
Container Limits
# docker-compose example
services:
gateway:
deploy:
resources:
limits:
memory: 512M # minimalist mode
cpus: "2"
# For embedded SLM mode:
# limits:
# memory: 4G
# cpus: "4"
Cache Tuning Deep-Dive
Exact vs. Semantic Cache Hit Analysis
# Exact cache hit rate
sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))
# Semantic cache hit rate
sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m]))
/
sum(rate(isartor_requests_total[5m]))
Cache Backend: Memory vs. Redis
| Factor | In-Memory | Redis |
|---|---|---|
| Latency | ~0.1 ms | ~1–5 ms (network hop) |
| Capacity | Limited by process RAM | Limited by Redis memory |
| Multi-replica | ❌ No sharing | ✅ Shared across pods |
| Persistence | ❌ Lost on restart | ✅ Optional AOF/RDB |
| Recommended for | Single-instance, dev, edge | K8s, multi-replica, production |
Switch with:
export ISARTOR__CACHE_BACKEND=redis
export ISARTOR__REDIS_URL=redis://redis.svc:6379
When to Disable Semantic Cache
- Traffic is 100 % deterministic (exact same prompts repeated).
- Embedding overhead is unacceptable (< 1 ms budget).
- Set
ISARTOR__CACHE_MODE=exact.
SLM Router Tuning
Embedded vs. Sidecar
| Mode | Variable | Latency | Resource Usage |
|---|---|---|---|
| Embedded (Candle) | ISARTOR__INFERENCE_ENGINE=embedded | 50–200 ms | High CPU, 1–4 GB RAM |
| Sidecar (llama.cpp) | ISARTOR__INFERENCE_ENGINE=sidecar | 100–500 ms | Separate process, GPU optional |
| Remote (vLLM/TGI) | ISARTOR__ROUTER_BACKEND=vllm | 100–500 ms | Separate server, GPU recommended |
Model Selection
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| Phi-3-mini (Q4_K_M) | ~2 GB | Fast | Good |
| Gemma-2-2B-IT (Q4) | ~1.5 GB | Very fast | Good |
| Qwen-1.5-1.8B (Q4) | ~1.2 GB | Fastest | Adequate |
| Llama-3-8B (Q4) | ~4.5 GB | Slower | Best |
For intent classification (TEMPLATE/SNIPPET/COMPLEX in tiered mode, or SIMPLE/COMPLEX in legacy binary mode), smaller models (1–3 B params) are sufficient. Use the smallest model that meets your accuracy needs.
Tuning the Classification Prompt
The system prompt in src/middleware/slm_triage.rs determines classification
accuracy. If too many COMPLEX requests are misclassified as TEMPLATE or
SNIPPET (resulting in bad local answers), consider:
- Making the system prompt more specific to your domain.
- Adding examples to the prompt (few-shot).
- Switching to a larger model.
- Setting
ISARTOR__LAYER2__MAX_ANSWER_TOKENSto allow longer SLM responses (default 2048). - Falling back to binary mode via
ISARTOR__LAYER2__CLASSIFIER_MODE=binaryif the three-tier split does not suit your workload.
Embedder Tuning
In-Process (candle)
The default embedder uses candle with sentence-transformers/all-MiniLM-L6-v2 (pure-Rust BertModel):
- 384-dimensional vectors
- ~90 MB model footprint
- 1–5 ms per embedding (CPU)
- Runs on
spawn_blockingto avoid starving the Tokio runtime
Sidecar Embedder
For higher throughput or GPU acceleration:
export ISARTOR__EMBEDDING_SIDECAR__SIDECAR_URL=http://127.0.0.1:8082
export ISARTOR__EMBEDDING_SIDECAR__MODEL_NAME=all-minilm
export ISARTOR__EMBEDDING_SIDECAR__TIMEOUT_SECONDS=10
Embedding Model Selection
| Model | Dims | Speed | Quality |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fastest | Good |
| bge-small-en-v1.5 | 384 | Fast | Better |
| bge-base-en-v1.5 | 768 | Moderate | Best |
Use 384-dim models for production. 768-dim models double memory usage for marginal quality improvement in most use cases.
SLO / SLA Goal Templates
Developer / Internal SLO
| Metric | Target | Measurement |
|---|---|---|
| Availability | 99.5 % | up{job="isartor"} over 30-day window |
| P95 latency (cache hit) | < 10 ms | histogram_quantile(0.95, ...) on L1 |
| P95 latency (end-to-end) | < 3 s | histogram_quantile(0.95, ...) on all |
| Deflection rate | > 50 % | 1 - (L3 / total) over 24 h |
| Error rate | < 1 % | rate(isartor_requests_total{http_status=~"5.."}[5m]) |
Production / Enterprise SLO
| Metric | Target | Measurement |
|---|---|---|
| Availability | 99.9 % | Multi-replica, health check monitoring |
| P95 latency (cache hit) | < 5 ms | Requires Redis or fast in-memory |
| P95 latency (end-to-end) | < 2 s | Optimised models, provider SLAs |
| P99 latency (end-to-end) | < 5 s | Tail latency budget |
| Deflection rate | > 70 % | Tuned thresholds + warm cache |
| Error rate | < 0.1 % | Circuit breakers, retries |
| Token savings | > 60 % | isartor_tokens_saved_total vs estimated total |
SLA Template (for downstream consumers)
## Isartor Prompt Firewall SLA
**Availability:** 99.9 % monthly uptime (< 43.8 min downtime/month)
**Latency:** P95 end-to-end < 2 seconds
**Error Budget:** 0.1 % of requests may return 5xx
**Maintenance Window:** Sundays 02:00–04:00 UTC (excluded from SLA)
### Remediation
- Cache tier failure: automatic fallback to cloud LLM (degraded mode)
- SLM failure: automatic fallback to cloud LLM (degraded mode)
- Cloud LLM failure: 502 Bad Gateway returned, retry recommended
### Monitoring
- Health endpoint: GET /healthz
- Metrics endpoint: Prometheus scrape via OTel Collector on port 8889
- Dashboard: Grafana at http://<grafana-host>:3000
Alert Rules (Prometheus)
groups:
- name: isartor-slo
rules:
- alert: HighErrorRate
expr: |
sum(rate(isartor_requests_total{http_status=~"5.."}[5m]))
/
sum(rate(isartor_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Isartor error rate exceeds 1%"
- alert: HighP95Latency
expr: |
histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m]))
> 3
for: 5m
labels:
severity: warning
annotations:
summary: "Isartor P95 latency exceeds 3 seconds"
- alert: LowDeflectionRate
expr: |
1 - (
sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
/
sum(rate(isartor_requests_total[1h]))
) < 0.5
for: 30m
labels:
severity: warning
annotations:
summary: "Isartor deflection rate below 50%"
- alert: FirewallDown
expr: up{job="isartor"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Isartor gateway is down"
Scenario-Based Tuning Recipes
Scenario A: Agentic Loop (High-Volume Identical Prompts)
Profile: Autonomous agent sends the same prompt hundreds of times per minute.
ISARTOR__CACHE_MODE=exact # Semantic unnecessary for identical prompts
ISARTOR__CACHE_TTL_SECS=3600 # Long TTL — agent prompts are stable
ISARTOR__CACHE_MAX_CAPACITY=50000 # Large cache for many unique prompts
Expected deflection: 95–99 % (after warm-up).
Scenario B: Customer Support Bot (Paraphrased Questions)
Profile: End users ask the same questions in different ways.
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.80 # Lower threshold to catch paraphrases
ISARTOR__CACHE_TTL_SECS=1800 # 30 min — support answers change slowly
ISARTOR__CACHE_MAX_CAPACITY=10000
Expected deflection: 60–80 %.
Scenario C: Code Generation (Low Cache Hit Rate)
Profile: Developers ask unique, complex coding questions.
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.92 # High threshold — wrong cached code is costly
ISARTOR__CACHE_TTL_SECS=600 # Short TTL — code context changes quickly
ISARTOR__INFERENCE_ENGINE=embedded # Let SLM handle simple code questions
Expected deflection: 20–40 % (SLM handles simple extraction).
Scenario D: RAG Pipeline (Document Q&A)
Profile: Queries against a knowledge base; similar questions are common.
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.83 # Moderate threshold
ISARTOR__CACHE_TTL_SECS=3600 # Documents change infrequently
ISARTOR__CACHE_MAX_CAPACITY=20000 # Large cache for document variation
Expected deflection: 50–70 %.
Scenario E: Multi-Replica Kubernetes
Profile: Horizontally scaled behind a load balancer.
ISARTOR__CACHE_BACKEND=redis
ISARTOR__REDIS_URL=redis://redis-cluster.svc:6379
ISARTOR__ROUTER_BACKEND=vllm
ISARTOR__VLLM_URL=http://vllm.svc:8000
ISARTOR__VLLM_MODEL=meta-llama/Llama-3-8B-Instruct
ISARTOR__CACHE_MODE=both
ISARTOR__SIMILARITY_THRESHOLD=0.85
Benefit: All replicas share the same cache → deflection rate applies cluster-wide.
PromQL Cheat Sheet
| What | Query |
|---|---|
| Deflection rate (1 h) | 1 - (sum(increase(isartor_requests_total{final_layer="L3_Cloud"}[1h])) / sum(increase(isartor_requests_total[1h]))) |
| Request rate | rate(isartor_requests_total[5m]) |
| Request rate by layer | sum by (final_layer) (rate(isartor_requests_total[5m])) |
| P50 latency | histogram_quantile(0.50, rate(isartor_request_duration_seconds_bucket[5m])) |
| P95 latency | histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) |
| P99 latency | histogram_quantile(0.99, rate(isartor_request_duration_seconds_bucket[5m])) |
| Per-layer P95 | histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m]))) |
| Tokens saved (daily) | sum(increase(isartor_tokens_saved_total[24h])) |
| Tokens saved by layer | sum by (final_layer) (rate(isartor_tokens_saved_total[5m])) |
| Est. daily cost savings ($0.01/1K tok) | sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01 |
| Error rate | sum(rate(isartor_requests_total{http_status=~"5.."}[5m])) / sum(rate(isartor_requests_total[5m])) |
| Cache hit ratio (exact) | sum(rate(isartor_requests_total{final_layer="L1a_ExactCache"}[5m])) / sum(rate(isartor_requests_total[5m])) |
| Cache hit ratio (semantic) | sum(rate(isartor_requests_total{final_layer="L1b_SemanticCache"}[5m])) / sum(rate(isartor_requests_total[5m])) |
See also: Metrics & Tracing · Configuration Reference · Troubleshooting