Reproducible Benchmarks Last benchmarked: loading…

Numbers you can verify.
Not numbers we invented.

Three scenarios. Four stacks. Ten runs each. Every result is signed with HMAC-SHA256 and available as downloadable JSON. If Sturna underperforms on a metric, you'll see it here.

See the results ↓ Reproduce this 📋 manifest.json

87.5%

Sturna legal citation accuracy

91%

Sturna RIA compliance accuracy

1.8s

Sturna avg latency (vs 3.1s competitors)

3 scenarios

Legal · Supply chain · Compliance

ℹ️ Methodology: All stacks use gpt-4o-2024-11-20, temperature 0, identical prompts per question. Competitor framework overhead is not modeled (API access not available for LangChain/AutoGen/CrewAI runtimes) — baseline GPT-4o with each framework's typical system prompt is used. Sturna results use the production multi-agent routing system. Full methodology ↓

Benchmark Results

Three scenarios, four stacks

Each scenario uses its own eval rubric (keyword coverage, citation grounding, correct determination). Every number links to a signed JSON evidence file.

Pack size

50 questions

Topics

SEC rules (25) + EU AI Act (25)

Ground truth

primary-source citations

Metric	Sturna	LangChain + GPT-4o	AutoGen + GPT-4o	CrewAI + GPT-4o
Loading results…

* Competitor columns show baseline model performance with each framework's typical system prompt. Full framework overhead (agent coordination, retry logic, multi-step planning) is not modeled. See evidence notes →

Pack size

30 tasks

🔁 Reproduce this

Clone the repo, install dependencies, run one command. Results land in public/benchmarks-vs/evidence/ as HMAC-signed JSON files. Requires OPENAI_API_KEY.

# Clone and install
git clone https://github.com/Polsia-Inc/octomind.git
cd octomind && npm install

# Set API key
export OPENAI_API_KEY=sk-...

# Run all scenarios (n=10 runs each, ~45 min)
node scripts/benchmarks/run-all.js --runs=10

# Or run a single scenario quickly
node scripts/benchmarks/run-all.js --scenario=legal-citation --runs=3

Expected output: signed JSON in public/benchmarks-vs/evidence/

Methodology

What we measured and how

📏 Eval rubric

Each question/task/document is scored against a defined rubric: keyword coverage (% of required concepts present), citation grounding (% of claims traceable to source material), and hallucination detection (presence of specific false numerical claims). Rubrics are open source in scripts/benchmarks/eval/rubrics/.

🔒 Evidence signing

Every result file is signed with HMAC-SHA256. The signature covers the full JSON payload (excluding the signature field itself). Key: BENCHMARK_SIGNING_KEY env var (or derived from ADMIN_SECRET). Verification code in scripts/benchmarks/sign.js.

⚙️ Parameters

Model: gpt-4o-2024-11-20 for all stacks. Temperature: 0. Max tokens: 1024. N=10 runs per scenario per stack. Hardware: single Node.js process, sequential runs. Token costs at GPT-4o pricing: $2.50/1M input, $10.00/1M output tokens.

🤝 Honesty policy

If Sturna underperforms on any metric, those results are published as-is. Credibility comes from accuracy, not from the scoreboard. The supply chain scenario shows the smallest gap — competitors are within 10 points. We don't cherry-pick scenarios where we look best.