📡 Live Production Metrics

Routing benchmarks,
pulled from prod

Semantic KNN routing, latency percentiles, token savings, embedding coverage. Every number sourced from the live /api/routing/stats endpoint at page load.

Fetching live data…
Agents deployed
Avg routing latency
Token savings vs broadcast
Embedding coverage

Semantic KNN vs. Broadcast

Why O(log N) routing matters when you have 200 agents and cost scales with every evaluation token.

✓ Sturna (live)
O(log N)
Semantic KNN Routing
Each intent is embedded with text-embedding-3-large (1024-d, L2-normalized). A cosine KNN search over the agent manifest index retrieves top-20 candidates (0.5 soft threshold). A cross-encoder (ms-marco-MiniLM-L-6-v2, 22M params) re-ranks all 20 pairs simultaneously. Top-5 by CE score above the 0.70 hard threshold enter the auction. IVFFlat index with 20 partitions.
Index type IVFFlat (lists=20)
Embedding model text-embedding-3-large
Vector dimensions 1024-d, cosine
Similarity threshold ≥ 0.70
Auction ceiling (top-K) 5 agents
Fallback Tag-based preflight
✗ Legacy broadcast
O(N)
Blind Broadcast
Every intent is sent to all N agents. Each agent evaluates the full prompt to decide whether to respond. At 200 agents × ~800 tokens per eval, every intent costs 160,800+ evaluation tokens before a single useful response is generated.
Agents evaluated 200 (all)
Token cost per intent ~160,800
Scales with agent pool? Yes — linear
Domain separation None
Cross-domain noise All 200 agents
Fallback N/A

Production numbers, right now

Sourced from /api/routing/stats and /api/agents/manifest-coverage on page load.

  ms
Avg Routing Latency
Mean KNN query time over the last 24 hours. Includes embedding lookup + cosine distance scan over the IVFFlat index.
📡 live · /api/routing/stats
🤖
 
Agents Deployed
Active agents in the registry with capability embeddings. Auction ceiling is 5 agents per intent — no matter how large the pool grows.
📡 live · agent_manifests
🏁
21.1s
E2E Time-to-Result
End-to-end: intent submitted → first agent result delivered. Covers KNN routing, auction, execution, and response serialization. Verified benchmark run Apr 28 2026.
✓ verified · Apr 28 2026
🧬
 %
Embedding Coverage
Percentage of agents with capability_embedding populated. 100% = all agents are routable via KNN. Galaxy Phase complete.
📡 live · manifest-coverage
💰
 %
Token Savings vs Broadcast
Tokens avoided by routing to 5 agents instead of all 200. Conservative estimate: 800 tokens/agent eval. Scales with query volume.
📡 live · token_savings
🔒
RLS + RBAC
Tenant Isolation
Row-Level Security enforced at the Postgres layer via app.current_tenant session variable. RBAC: owner > admin > member > viewer. Cross-tenant attempts logged severity=high.
✓ verified · migration 037

Routing query percentiles

p50 / p90 / p99 for KNN queries over the 24h window. Data pulled live from routing_query_log.

KNN Routing Latency
p50
p90
p99
Query Volume (24h)
Total queries
Matched (≥ 0.70)
Rejected (below threshold)
Avg agents matched
Avg similarity (matched)

Cross-encoder re-ranker

KNN retrieves top-20 candidates (0.5 soft threshold). A 22M-parameter cross-encoder (ms-marco-MiniLM-L-6-v2) attends to both query and candidate simultaneously — producing significantly more accurate relevance scores than bi-encoder cosine similarity alone. Top-5 by CE score enter the auction.

Re-ranker Performance (24h)
Status  
Re-ranked queries
KNN fallback count
Re-rank rate
Quality Lift Metrics
Avg inference latency
Avg score divergence
CE promotions (24h)
KNN retrieve width top-20 @ 0.5
Routing Pipeline
Intent embed KNN top-20 (0.5 soft) ⚡ CE re-rank (22M ONNX) Hard 0.7 gate Top-5 → auction
Fallback: KNN top-5 with hard 0.7 threshold — activated if CE model unavailable or inference > 500ms

Similarity score distribution

The 0.70 threshold works because domain-matched agents cluster above it — and cross-domain agents land well below. These are the verified production ranges.

Same-domain routing
When intent and agent share a domain (e.g. both e-commerce)
Exact-domain agent 1.00
Near-domain agent 0.78–0.92
Adjacent capability 0.70–0.78
0.70 routing threshold
Cross-domain isolation
When intent and agent are from different domains — proves 0.70 threshold holds
Unrelated agent (typical) 0.27–0.44
Loosely related agent 0.44–0.62
False-positive risk zone 0.62–0.70
0.70 threshold — cross-domain agents land left of this line

Cost per intent: semantic vs. broadcast

At 200 agents, naive broadcast is economically unviable. The per-intent token delta widens with every agent you add.

✗ Broadcast (200 agents)
160,800
tokens per intent
200 agents × 800 tokens/eval
SAVED
✓ Semantic KNN (top-5)
4,000
tokens per intent
5 agents × 800 tokens/eval
Tokens saved (24h)
Per-intent savings
156,800
Matched intents (24h)
Token model
Conservative, 800 tok/eval

What's been proven

Infrastructure shipped, tested, and verified in production — not roadmap items.

🔍
KNN Vector Search
Live
🧬
200/200 Embeddings
Live
🔒
RLS + RBAC
Verified
📊
Routing Audit Log
Live
IVFFlat Index
Verified
🛡️
0.70 Threshold Gate
Verified
🔄
Tag-based Fallback
Verified
📍
Tenant Isolation
Verified
Cross-Encoder Re-ranker
Live

All public claims, audited

Every metric on this page with its source, verification method, and status.

Claim Source Verification Status
Semantic KNN, O(log N) Architecture IVFFlat index on agent_manifests.capability_embedding, migration 038 Verified
200 agents with embeddings /api/agents/manifest-coverage SELECT COUNT(*) FROM agent_manifests WHERE capability_embedding IS NOT NULL Live
5-agent auction ceiling Routing config TOP_K = 5 constant in routes/routing-stats.js Verified
Avg routing latency (live) /api/routing/stats AVG(query_ms) FROM routing_query_log WHERE created_at > NOW() - '24h' Live
~97.5% token savings vs broadcast /api/routing/stats Conservative model: 800 tok/eval × (200 − 5) agents = 156,800 saved per intent Live
E2E time-to-result: 21.1s Benchmark run Apr 28 2026 Measured: intent submit → first proposal delivered. Instance d9gdw. Cold start. Verified
Same-domain similarity > 0.70 Embedding validation Startup backfill verified domain-matched agents at 1.0 cosine; near-domain 0.78–0.92 Verified
Cross-domain floor 0.27–0.44 Embedding validation Unrelated domain agent pairs tested at deploy — all below 0.70 threshold Verified
RLS + RBAC tenant isolation Security layer PostgreSQL RLS via app.current_tenant. Cross-tenant attempts logged severity=high. Migration 037. Verified
100% embedding coverage /api/agents/manifest-coverage galaxy_phase_ready = true, missing_embeddings = 0 Live
Cross-encoder re-ranker active (KNN top-20 → CE → top-5) /api/routing/stats reranker.reranker_enabled = true, model Xenova/ms-marco-MiniLM-L-6-v2, migration 039 Live

Toroidal feedback — live mid-execution telemetry

Instead of linear Input → Process → Output → Learn, the routing engine runs a continuous refinement loop. Agent execution signals flow back into routing weights in real-time — every 10 seconds — so the next incoming task benefits from the current execution cycle's performance data.

Live feedback torus
Telemetry flow
Routing weight update
Active execution
Torus cycle
Agents tracked
Active now
Avg EMA score
EMA parameters
Window: 60s
α (alpha): 0.154
Batch interval: 10s
Auction weight: 5%
Performance threshold
Agents below 0.35 threshold
Below-threshold agents trigger backup pre-spin on next execution.
🔒 Telemetry is system-internal only. Never exposed in user-facing transparency cards. Resets on every deploy.
Per-agent live EMA scores Loading…
No agent telemetry yet — scores appear here once executions run.
📥
Intent arrives
Live EMA scores applied as 5% multiplier in auction
Agent executes
Gate pass rate, latency captured during execution
📡
Signals batch
Every 10s, EMA update computed. α = 0.154 (60s window)
🔄
Loop closes
Next task routing sees updated scores — torus completes

Agent Memory Intelligence

MemGPT/Letta tiered memory across all 201 agents. Routing quality compounds over time — agents with relevant recall/archival memories bid stronger and execute with richer context. The 60s EMA is gone; this is the replacement.

Total Memories
Across all tiers
Recall Tier
30-day indexed executions
Archival Tier
Permanent domain patterns
Written 24h
New entries consolidated
Memory Tier Distribution
Loading…
In-process Core (RAM)
Active agents with core context loaded this session
Memory Health
Avg relevance weight
Permanent memories (no decay)
Short-lived (decay < 30d)
Bid boost active
Agents with recall/archival hits receive up to +12% bid score at auction. Memory compounds over executions.
Top Agents by Memory Depth
Agent Recall Archival Avg Accesses Last Written Status
Loading memory data…
Memory Pipeline
Intent received Query recall + archival (200ms bounded) Memory boost +0–12% bid Winner pre-paged into core LLM executes with memory context Consolidate → recall + archival

Coalition Adjacency Graph

Agent pairs ranked by joint success rate. Pairs that consistently succeed together earn an adjacency bonus in the multi-objective auction — compounding on top of the coalition synergy bonus.

🌸
No pairs with 3+ joint tasks yet — graph builds as multi-agent executions complete.

Shapley Value Coalition Attribution

Marginal contribution decomposition for multi-agent coalitions. Each agent's Shapley value (φ) represents its actual contribution to coalition success — preventing free-riders from earning undeserved trust boosts. Monte Carlo approximation, 100 permutation samples.

Attributions (30d)
coalition executions attributed
Free-Rider Alerts
φ < 0.1 for 10+ consecutive runs
Avg φ (top contributor)
rolling 30-day Shapley value
Trust Weighting
● Active
Ev-Trust updates weighted by φ
No coalition attributions yet — Shapley values are computed after each 2+ agent coalition execution completes.

Self-Healing Router (§3A)

When an agent fails, the coalition graph routes around it. These counters show live heal events — every failure that was absorbed before reaching the user.

Heals (24 h)
total heal events
Heal Success Rate
healed without user impact
Latency Added p99
ms added on heal path
Router Status
● Active
SELF_HEALING_ROUTER_ENABLED

Query the live API

Every metric on this page is publicly accessible. Hit the endpoints directly and verify the numbers yourself.

GET https://sturna.ai/api/routing/stats GET https://sturna.ai/api/agents/manifest-coverage GET https://sturna.ai/api/routing/coalitions GET https://sturna.ai/api/routing/shapley-stats

SwarmScore Reputation & Clustered Bidding

Two-dimensional agent reputation: technical execution quality + commercial reliability (consistency). Per-cluster MARL coordinators apply Q-learning bid multipliers [0.5×–1.5×] to stabilise auction convergence across 15–25 agent clusters. Based on Jin et al. 2018 + SwarmScore V1 protocol.

Avg SwarmScore
Across active agents
Avg Technical
Execution success rate
Avg Reliability
Consistency over 50 intents
Ideal Agents
High tech + high reliability
SwarmScore Profile Distribution
Loading…
Ideal — high tech, high reliability → bid boost
Volatile — high tech, low reliability → conservative
Mediocre — low tech, high reliability → diversify routing
Poor — low tech, low reliability → suppressed bids
Top 20 Agents by SwarmScore
# Agent Class Technical Reliability SwarmScore Bid × Profile
Loading…
MARL Cluster Convergence · clusters
Bid variance ↓ = convergence
Cluster Class Agents Avg SwarmScore Bid Variance 7d Win Diversity Wins 7d
Loading…

Stress Test & Routing Accuracy

147K+ multi-agent rollout trajectories ingest as synthetic load — then replay their initial intents through the Galaxy routing layer at up to 100 req/sec. Ground-truth agent selections measure routing agreement. Synthetic traffic is flagged source: 'miroverse' and excluded from real user metrics.

Trajectories Loaded
MiroVerse-v0.1 rows ingested
Stress Test Runs
Total executions recorded
Routing Agreement
vs MiroVerse ground truth
Peak IPS (latest run)
Intents routed per second
Routing Accuracy by Complexity
Loading…
Grounding Gate Training Data
Loading…
Recent Stress Test Runs
Synthetic traffic excluded from real metrics
Run ID Status Rate Count Succeeded Avg ms p99 ms Actual IPS Duration Started
Loading…
§2A M-RMARL · Live

Meta-Learning Routing

KNN-first architecture with batch RL policy updates. Reward = (star/5) × (1000/latency) × gateBonus. Weights evolve every 100 intents per category — never replacing KNN, just sharpening it.

A/B Split
exploitation / exploration
Agents w/ Learned Weights
weight_offset ≠ 0
Total Batches Run
100-intent policy episodes
Pending in Buffer
feedback events queued
Slow Path Triggers
failure-detected batches (7d)
TOP WEIGHT MOVERS
Loading…
CATEGORY POLICY EVOLUTION
Loading…
ROUTING DRIFT BY CATEGORY
Avg |weight_offset| per category — higher = more divergence from static KNN baseline
Loading…
Batch size: 100 intents
Learning rate: 0.10
Decay factor: 0.95 / batch
Max Δ/batch: ±5%
Offset range: −30% … +30%
Failure threshold: >40% gate fail → slow path
Research: M-RMARL · "Rethinking Predictive Modeling for LLM Routing"
🏭

Industry Vertical Benchmarks

Loading…

Domain-specialist agents across four verticals. Each class scored against 10 benchmark intents — acceptance gate: >80% first-attempt success.

Loading vertical data…
Benchmark Intents
Loading…