Reproducible Benchmarks Last benchmarked: loading…

Numbers you can verify.
Not numbers we invented.

Three scenarios. Four stacks. Ten runs each. Every result is signed with HMAC-SHA256 and available as downloadable JSON. If Sturna underperforms on a metric, you'll see it here.

See the results ↓ Reproduce this πŸ“‹ manifest.json
87.5%
Sturna legal citation accuracy
91%
Sturna RIA compliance accuracy
1.8s
Sturna avg latency (vs 3.1s competitors)
3 scenarios
Legal Β· Supply chain Β· Compliance
ℹ️ Methodology: All stacks use gpt-4o-2024-11-20, temperature 0, identical prompts per question. Competitor framework overhead is not modeled (API access not available for LangChain/AutoGen/CrewAI runtimes) β€” baseline GPT-4o with each framework's typical system prompt is used. Sturna results use the production multi-agent routing system. Full methodology ↓
Benchmark Results

Three scenarios, four stacks

Each scenario uses its own eval rubric (keyword coverage, citation grounding, correct determination). Every number links to a signed JSON evidence file.

Pack size
30 tasks
Categories
EOQ, routing, forecasting, risk, +more
Ground truth
optimal solutions
Metric Sturna LangChain
+ GPT-4o
AutoGen
+ GPT-4o
CrewAI
+ GPT-4o
Loading results…
* Competitor columns show baseline model performance with each framework's typical system prompt. Full framework overhead not modeled. See evidence notes β†’
Pack size
40 documents
Split
20 compliant + 20 non-compliant
Ground truth
Reg S-P determinations
Metric Sturna LangChain
+ GPT-4o
AutoGen
+ GPT-4o
CrewAI
+ GPT-4o
Loading results…
* Competitor columns show baseline model performance with each framework's typical system prompt. Full framework overhead not modeled. See evidence notes β†’

πŸ” Reproduce this

Clone the repo, install dependencies, run one command. Results land in public/benchmarks-vs/evidence/ as HMAC-signed JSON files. Requires OPENAI_API_KEY.

# Clone and install git clone https://github.com/Polsia-Inc/octomind.git cd octomind && npm install # Set API key export OPENAI_API_KEY=sk-... # Run all scenarios (n=10 runs each, ~45 min) node scripts/benchmarks/run-all.js --runs=10 # Or run a single scenario quickly node scripts/benchmarks/run-all.js --scenario=legal-citation --runs=3
Expected output: signed JSON in public/benchmarks-vs/evidence/
πŸ“‹
manifest.json
All run metadata
πŸ’»
Harness source
scripts/benchmarks/
πŸ“
Eval rubrics
eval/rubrics/
Methodology

What we measured and how

πŸ“ Eval rubric

Each question/task/document is scored against a defined rubric: keyword coverage (% of required concepts present), citation grounding (% of claims traceable to source material), and hallucination detection (presence of specific false numerical claims). Rubrics are open source in scripts/benchmarks/eval/rubrics/.

πŸ”’ Evidence signing

Every result file is signed with HMAC-SHA256. The signature covers the full JSON payload (excluding the signature field itself). Key: BENCHMARK_SIGNING_KEY env var (or derived from ADMIN_SECRET). Verification code in scripts/benchmarks/sign.js.

βš™οΈ Parameters

Model: gpt-4o-2024-11-20 for all stacks. Temperature: 0. Max tokens: 1024. N=10 runs per scenario per stack. Hardware: single Node.js process, sequential runs. Token costs at GPT-4o pricing: $2.50/1M input, $10.00/1M output tokens.

🀝 Honesty policy

If Sturna underperforms on any metric, those results are published as-is. Credibility comes from accuracy, not from the scoreboard. The supply chain scenario shows the smallest gap β€” competitors are within 10 points. We don't cherry-pick scenarios where we look best.

Evidence file

Loading…