Metrics & Tracing
Definitive reference for Isartor's OpenTelemetry traces, metrics, structured logging, and observability stack — from local development to Kubernetes.
Overview
Isartor uses OpenTelemetry for distributed
tracing and metrics, plus tracing-subscriber with a JSON layer for
structured logging.
| Signal | Protocol | Default Endpoint |
|---|---|---|
| Traces | OTLP gRPC | http://localhost:4317 |
| Metrics | OTLP gRPC | http://localhost:4317 |
| Logs | stdout (JSON) | — |
When ISARTOR__ENABLE_MONITORING=false (default), only the console log
layer is active — zero OTel overhead.
Architecture
┌─────────────┐ ┌──────────────────┐
│ Isartor │ OTLP gRPC │ OTel Collector │
│ Gateway │─────────────────▶│ :4317 │
│ │ (traces + │ │
│ │ metrics) │ Pipelines: │
└─────────────┘ │ traces → Jaeger │
│ metrics → Prom │
└───┬──────────┬────┘
│ │
┌──────────▼──┐ ┌────▼──────────┐
│ Jaeger │ │ Prometheus │
│ :16686 │ │ :9090 │
│ (UI) │ │ (scrape) │
└─────────────┘ └───────┬───────┘
│
┌───────▼───────┐
│ Grafana │
│ :3000 │
│ (dashboards) │
└───────────────┘
Enabling Monitoring
ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://localhost:4317
RUST_LOG=info,h2=warn,hyper=warn,tower=warn # optional override
When ISARTOR__ENABLE_MONITORING=false (the default), Isartor uses console-only logging via tracing-subscriber with RUST_LOG filtering. No OTel SDK is initialised — zero overhead.
Telemetry Initialisation (src/telemetry.rs)
init_telemetry() returns an OtelGuard (RAII). The guard holds the
SdkTracerProvider and SdkMeterProvider; dropping it flushes pending
telemetry and shuts down exporters gracefully.
| Component | Description |
|---|---|
| JSON stdout layer | Structured logs emitted as JSON when monitoring is on |
| Pretty console layer | Human-readable output when monitoring is off |
| OTLP trace exporter | gRPC via opentelemetry-otlp → Collector |
| OTLP metric exporter | gRPC via opentelemetry-otlp → Collector |
| EnvFilter | Reads RUST_LOG, defaults to info,h2=warn,hyper=warn,tower=warn |
Service identity:
service.name = "isartor-gateway"
service.version = env!("CARGO_PKG_VERSION") # e.g. "0.1.0"
Distributed Traces — Span Reference
Every request gets a root span (gateway_request) from the monitoring
middleware. Child spans are created per-layer:
Root Span
| Span Name | Source | Key Attributes |
|---|---|---|
gateway_request | src/middleware/monitoring.rs | http.method, http.route, http.status_code, client.address, isartor.final_layer |
http.status_code and isartor.final_layer are recorded after the
response returns (empty → filled pattern).
Layer 0 — Auth
| Span Name | Source | Key Attributes |
|---|---|---|
(inline tracing::debug!/warn!) | src/middleware/auth.rs | — |
Auth is lightweight; no dedicated span is created. Events are logged at debug/warn level.
Layer 1a — Exact Cache
| Span Name | Source | Key Attributes |
|---|---|---|
l1a_exact_cache_get | src/adapters/cache.rs | cache.backend (memory|redis), cache.key, cache.hit |
l1a_exact_cache_put | src/adapters/cache.rs | cache.backend, cache.key, response_len |
Layer 1b — Semantic Cache
| Span Name | Source | Key Attributes |
|---|---|---|
l1b_semantic_cache_search | src/vector_cache.rs | cache.entries_scanned, cache.hit, cosine_similarity |
l1b_semantic_cache_insert | src/vector_cache.rs | cache.evicted, cache.size_after |
cosine_similarity— the best-match score formatted to 4 decimal places. This is the key attribute for tuning the similarity threshold.
Layer 2 — SLM Triage
| Span Name | Source | Key Attributes |
|---|---|---|
layer2_slm | src/middleware/slm_triage.rs | slm.complexity_score (TEMPLATE|SNIPPET|COMPLEX; legacy binary mode: SIMPLE|COMPLEX) |
l2_classify_intent | src/adapters/router.rs | router.backend (embedded_candle|remote_vllm), router.decision, router.model, router.url, prompt_len |
Layer 2.5 — Context Optimiser
| Span Name | Source | Key Attributes |
|---|---|---|
layer2_5_context_optimizer | src/middleware/context_optimizer.rs | context.bytes_saved, context.strategy (e.g. "classifier+dedup", "classifier+log_crunch") |
When L2.5 modifies the request body, it also sets the response header x-isartor-context-optimized: bytes_saved=<N>.
Layer 3 — Cloud LLM
| Span Name | Source | Key Attributes |
|---|---|---|
layer3_llm | src/handler.rs | ai.prompt.length_bytes, provider.name, model |
Custom Span Attributes — Quick Reference
These are the Isartor-specific attributes (beyond standard OTel semantic conventions) that appear on spans and are useful for filtering in Jaeger / Tempo:
| Attribute | Type | Where Set | Purpose |
|---|---|---|---|
isartor.final_layer | string | Root gateway_request span | Which layer resolved the request |
cache.hit | bool | L1a and L1b spans | Whether the cache lookup succeeded |
cosine_similarity | string | L1b search span | Best cosine-similarity score (4 d.p) |
cache.entries_scanned | u64 | L1b search span | Entries scanned during similarity search |
cache.backend | string | L1a get/put spans | "memory" or "redis" |
router.decision | string | L2 classify span | "TEMPLATE", "SNIPPET", or "COMPLEX" (tiered mode); "SIMPLE" or "COMPLEX" (binary mode) |
router.backend | string | L2 classify span | "embedded_candle" or "remote_vllm" |
context.bytes_saved | u64 | L2.5 optimizer span | Bytes removed by compression pipeline |
context.strategy | string | L2.5 optimizer span | Pipeline stages that modified content (e.g. "classifier+dedup") |
provider.name | string | L3 handler span | e.g. "openai", "xai", "azure" |
model | string | L3 handler span | e.g. "gpt-4o", "grok-beta" |
http.status_code | u16 | Root span | HTTP response status code |
client.address | string | Root span | Client IP (from x-forwarded-for) |
OTel Metrics (src/metrics.rs)
Four instruments are registered as a singleton GatewayMetrics via OnceLock:
| Metric Name | Type | Attributes | Description |
|---|---|---|---|
isartor_requests_total | Counter | final_layer, status_code, traffic_surface, client, endpoint_family, tool | Total prompts processed |
isartor_request_duration_seconds | Histogram | final_layer, status_code, traffic_surface, client, endpoint_family | End-to-end request duration |
isartor_layer_duration_seconds | Histogram | layer_name, tool | Per-layer latency |
isartor_tokens_saved_total | Counter | final_layer, traffic_surface, client, endpoint_family, tool | Estimated tokens saved by early resolve |
isartor_errors_total | Counter | layer, error_class, tool | Error occurrences by layer / agent |
isartor_retries_total | Counter | operation, attempts, outcome, tool | Retry outcomes by agent |
isartor_cache_events_total | Counter | cache_layer, outcome, tool | L1 / L1a / L1b hit-miss safety by agent |
Where Metrics Are Recorded
| Call Site | Metrics Recorded |
|---|---|
root_monitoring_middleware | record_request_with_context(), record_tokens_saved_with_context() (if early) |
proxy::connect::emit_proxy_decision() | record_request_with_context(), record_tokens_saved_with_context() (if early) |
cache_middleware (L1 hit) | record_layer_duration("L1a_ExactCache" | "L1b_SemanticCache") |
slm_triage_middleware (L2 hit) | record_layer_duration("L2_SLM") |
context_optimizer_middleware | record_layer_duration("L2_5_ContextOptimiser") (when bytes saved > 0) |
chat_handler (L3) | record_layer_duration("L3_Cloud") |
Request Dimensions
Unified prompt telemetry distinguishes:
traffic_surface:gatewayorproxyclient:direct,openai,anthropic,copilot,claude,antigravity, etc.endpoint_family:native,openai, oranthropic
Token Estimation
estimate_tokens(prompt) uses the heuristic: max(1, prompt.len() / 4).
This is intentionally conservative — the metric tracks relative savings
rather than precise token counts.
ROI — isartor_tokens_saved_total
This is the headline business metric. Every request resolved before Layer 3 (exact cache, semantic cache, or local SLM) avoids a round-trip to the external LLM provider.
# Daily token savings
sum(increase(isartor_tokens_saved_total[24h]))
# Savings by layer
sum by (final_layer) (rate(isartor_tokens_saved_total[1h]))
# Prompt volume by traffic surface
sum by (traffic_surface) (rate(isartor_requests_total[5m]))
# Prompt volume by client
sum by (client) (rate(isartor_requests_total[5m]))
# Estimated cost savings (assuming $0.01 per 1K tokens)
sum(increase(isartor_tokens_saved_total[24h])) / 1000 * 0.01
Use this metric to justify infrastructure spend for the caching / SLM layers.
Docker Compose — Local Observability Stack
Use the provided compose file for local development:
cd docker
docker compose -f docker-compose.observability.yml up -d
| Service | Port | Purpose |
|---|---|---|
| OTel Collector | 4317 | OTLP gRPC receiver |
| Jaeger | 16686 | Trace UI |
| Prometheus | 9090 | Metrics scrape + query |
| Grafana | 3000 | Dashboards (anonymous admin) |
Configuration files:
| File | Purpose |
|---|---|
docker/otel-collector-config.yaml | Collector pipelines |
docker/prometheus.yml | Scrape targets |
Pipeline Flow
Isartor ──OTLP gRPC──▶ OTel Collector ──▶ Jaeger (traces)
└──▶ Prometheus (metrics)
│
▼
Grafana
OTel Collector Configuration
The collector config is at docker/otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlp:
endpoint: "jaeger:4317"
tls:
insecure: true
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp, debug]
metrics:
receivers: [otlp]
exporters: [prometheus, debug]
Prometheus Configuration
The Prometheus config is at docker/prometheus.yml:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['otel-collector:8889']
Prometheus scrapes the OTel Collector's Prometheus exporter on port 8889 every 5 seconds.
Per-Tier Setup
Level 1 — Minimal (Console Logs Only)
No observability stack is needed. Use RUST_LOG for structured console output:
ISARTOR__ENABLE_MONITORING=false
RUST_LOG=isartor=info
For debug-level output during development:
RUST_LOG=isartor=debug,tower_http=trace
Level 2 — Docker Compose (Full Stack)
The docker-compose.sidecar.yml includes the complete observability stack:
cd docker
docker compose -f docker-compose.sidecar.yml up --build
Services included:
| Service | URL | Purpose |
|---|---|---|
| OTel Collector | localhost:4317 (gRPC) | Receives OTLP from gateway |
| Jaeger UI | http://localhost:16686 | View distributed traces |
| Prometheus | http://localhost:9090 | Query metrics |
| Grafana | http://localhost:3000 | Dashboards (anonymous admin access) |
The gateway is pre-configured with:
ISARTOR__ENABLE_MONITORING=true
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector:4317
Level 3 — Kubernetes (Managed or Self-Hosted)
| Approach | Recommended Stack | Notes |
|---|---|---|
| Self-managed | OTel Collector DaemonSet + Jaeger Operator + kube-prometheus-stack | Full control, higher ops burden |
| AWS | AWS X-Ray + CloudWatch + Managed Grafana | ADOT Collector as sidecar/DaemonSet |
| GCP | Cloud Trace + Cloud Monitoring + Cloud Logging | Use OTLP exporter to Cloud Trace |
| Azure | Application Insights + Azure Monitor | Use Azure Monitor OpenTelemetry exporter |
| Grafana Cloud | Grafana Alloy + Grafana Cloud | Low ops, managed Prometheus + Tempo |
| Datadog | Datadog Agent + OTel Collector | Enterprise APM |
For all options, point the gateway at the collector:
ISARTOR__OTEL_EXPORTER_ENDPOINT=http://otel-collector.isartor:4317
Grafana Dashboard Queries (PromQL)
| Panel | PromQL |
|---|---|
| Request Rate | rate(isartor_requests_total[5m]) |
| P95 Latency | histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) |
| Layer Resolution | sum by (final_layer) (rate(isartor_requests_total[5m])) |
| Traffic Surface Split | sum by (traffic_surface) (rate(isartor_requests_total[5m])) |
| Client Split | sum by (client) (rate(isartor_requests_total[5m])) |
| Per-Layer Latency | histogram_quantile(0.95, sum by (le, layer_name) (rate(isartor_layer_duration_seconds_bucket[5m]))) |
| Tokens Saved / Hour | sum(increase(isartor_tokens_saved_total[1h])) |
| Tokens Saved by Layer | sum by (final_layer) (rate(isartor_tokens_saved_total[5m])) |
| Cache Hit Rate | rate(isartor_requests_total{final_layer=~"L1.*"}[5m]) / rate(isartor_requests_total[5m]) |
Jaeger — Useful Searches
| Goal | Search |
|---|---|
| Slow requests (> 500 ms) | Service isartor-gateway, Min Duration 500ms |
| Cache misses | Tag cache.hit=false |
| Semantic cache tuning | Tag cosine_similarity — sort by value |
| Layer 3 fallbacks | Tag isartor.final_layer=L3_Cloud |
| SLM local resolutions | Tag router.decision=TEMPLATE or router.decision=SNIPPET (tiered); router.decision=SIMPLE (binary) |
Trace Anatomy
A typical trace for a cache-miss, locally-resolved request:
isartor-gateway
└─ HTTP POST /api/chat [250ms]
├─ Layer0_AuthCheck [0.1ms]
├─ Layer1_SemanticCache (MISS) [5ms]
├─ Layer2_IntentClassifier [80ms]
│ intent=TEMPLATE, confidence=0.97
└─ Layer2_LocalExecutor [160ms]
model=phi-3-mini, tokens=42
Built-in User Views
For quick operator checks without a separate telemetry stack:
isartor stats --gateway-url http://localhost:8080
isartor stats --gateway-url http://localhost:8080 --by-tool
Add --gateway-api-key <key> only when gateway auth is enabled.
--by-tool prints richer per-agent stats: requests, cache hits/misses,
average latency, retry count, error count, and L1a/L1b safety ratios.
Built-in JSON endpoints:
GET /healthGET /debug/proxy/recentGET /debug/stats/promptsGET /debug/stats/agents
Alerting Rules
Prometheus Alerting Rules
Create docker/prometheus-alerts.yml:
groups:
- name: isartor
rules:
- alert: HighErrorRate
expr: rate(isartor_requests_total{status="error"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Isartor error rate > 5% for 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(isartor_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Isartor P95 latency > 2s for 5 minutes"
- alert: LowCacheHitRate
expr: >
rate(isartor_requests_total{final_layer=~"L1.*"}[15m]) /
rate(isartor_requests_total[15m]) < 0.3
for: 15m
labels:
severity: info
annotations:
summary: "Cache hit rate below 30% — consider tuning similarity threshold"
- alert: LowDeflectionRate
expr: |
1 - (
sum(rate(isartor_requests_total{final_layer="L3_Cloud"}[1h]))
/
sum(rate(isartor_requests_total[1h]))
) < 0.5
for: 30m
labels:
severity: warning
annotations:
summary: "Isartor deflection rate below 50%"
- alert: FirewallDown
expr: up{job="isartor"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Isartor gateway is down"
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| No traces in Jaeger | Monitoring disabled | Set ISARTOR__ENABLE_MONITORING=true |
| No traces in Jaeger | Collector unreachable | Verify OTEL_EXPORTER_ENDPOINT + port 4317 |
| No metrics in Prometheus | Prometheus can't scrape collector | Check prometheus.yml targets |
| Grafana "No data" | Data source misconfigured | URL should be http://prometheus:9090 |
| Console shows "OTel disabled" | Config precedence | Check env vars override file config |
isartor_layer_duration_seconds empty | No requests yet | Send a test request |
See also: Configuration Reference · Performance Tuning · Troubleshooting