Benchmark-Driven Routing
for OpenClaw

Stop defaulting to one model for everything. The OpenMark router uses real evaluation data from your own tasks to route each prompt to the best model — with fallbacks, cost savings, and full visibility.

See It In Action

Watch how the router classifies tasks and picks optimal models from your benchmark data.

What the Router Does

Most routing solutions use "simple vs complex" heuristics or generic capability tiers to pick a model. That's a guess dressed as a system. The OpenMark router for OpenClaw takes a different approach: it uses your actual benchmark results to make every routing decision.

You benchmark your recurring tasks on OpenMark AI, export the results, and the router uses that data to match incoming prompts to the best-performing model for each task category. No keyword matching, no complexity scoring — just measured performance on your real work.

Key difference: the router doesn't guess which model is "good enough" — it knows, because you already tested it on your own task with deterministic scoring.

What You See

You send a prompt. The router classifies it, finds the matching benchmark, picks the winner, and the routed model answers — all in a single turn. A routing card shows what happened:

Routed to gpt-5.4-nano (openai) — Content Creation Benchmark
Benchmark: 92.9% score  |  $0.002731/call  |  30.28s

Why this route: better score than gemini-3.1-pro, 97.6% cheaper, 4.2x faster
Over 10K calls: $27.31 vs $1148.36

Strategy: balanced  |  Benchmark data: fresh

[actual response from gpt-5.4-nano follows here...]

The routed model generates the real reply. The classifier only identifies the task category — it never produces the user-visible answer.

How It Works

The plugin uses an internal two-phase architecture. To the user, it looks like a single reply.

Phase 1
Classify & Route — A lightweight LLM call (through OpenClaw's gateway) classifies the user message against your benchmark category names. The deterministic routing engine then ranks available models by your chosen strategy and selects the winner plus fallbacks. This takes ~60ms after classification.
Phase 2
Generate — OpenClaw immediately runs the real reply with the routed model, using full session context, system prompt, and conversation history. Authentication and streaming are handled by OpenClaw.

No direct provider API calls from the plugin. Classification goes through the OpenClaw gateway. Provider authentication and model execution stay inside OpenClaw — you don't hand API keys to the router.

Quick Start

  1. Benchmark your recurring tasks on OpenMark AI — test across 100+ models with deterministic scoring.
  2. Export — click Export → OpenClaw on the Results tab. The CSV includes dual model keys, scores, costs, and metadata.
  3. Install — run openclaw plugins install openmark-router and restart the gateway.
  4. Import — place CSVs in the benchmarks directory, or use the local dashboard's import flow. The router activates automatically.

That's it. The router registers as a provider, sets openmark/auto as your default model, and starts routing. Unmatched tasks pass through to your original default model unchanged.

Five Routing Strategies

Choose how the router ranks models from your benchmark data:

balanced

Weighted composite: accuracy (40%) + cost-efficiency (20%) + speed (25%) + stability (15%). Best for most workloads.

best_score

Highest benchmark accuracy regardless of cost or speed.

best_cost_efficiency

Best accuracy per dollar among viable models. Models below the viability floor are excluded.

best_under_budget

Highest score within your cost ceiling. Set cost_ceiling in config.

best_under_latency

Highest score within your latency ceiling. Set latency_ceiling_s in config.

All strategies use a 6-step cascade sort and a viability floor (max(top_score - 15pp, top_score * 0.5)) to exclude underperforming models. Fallback models are ranked from the same benchmark data.

Why Custom Benchmarks Matter for Routing

Every routing solution that uses generic categorization breaks in practice. "Email tasks" lumps cold outreach, complaint triage, and legal notices together — but model performance varies dramatically across these subtypes.

Generic benchmarks are equally broad. MMLU, Arena Elo, and HumanEval test general capabilities. A model scoring well on "writing" tells you nothing about your email templates with your tone requirements.

When you benchmark on OpenMark AI, you test models on your specific task, with your prompts, against your criteria. That's the data the router needs to make decisions you can trust.

Local Dashboard

The router ships with a local dashboard at http://127.0.0.1:2098/dashboard. From there you can:

Works With Your Existing Setup

The router detects which providers your OpenClaw install can use and filters benchmark candidates accordingly. Direct provider keys are preferred first; if a model's direct provider isn't available but OpenRouter is, the router falls back to the OpenRouter key for that row.

Single-provider setups still benefit — you can benchmark and route within one provider's model lineup. The router is also useful with subscriptions, hosted access, or OAuth-backed providers, as long as OpenClaw can execute the model IDs involved.

Frequently Asked Questions

What is the OpenClaw Model Router?

An open-source plugin for OpenClaw that routes prompts to the best AI model for each task category, using benchmark results from OpenMark AI. It uses a lightweight classifier to identify the task, then deterministically selects the optimal model from your data.

Do I need API keys from multiple providers?

No. The router works with whatever providers your OpenClaw install already has configured. Single-provider setups still benefit by routing within that provider's lineup. OpenRouter fallback is supported when benchmark rows include OR keys.

How much can I save?

Savings depend on your workload. Teams overusing flagship models for routine tasks commonly see 50-80% cost reduction. The router doesn't guarantee specific savings — it routes based on your measured benchmark data.

Is it open source?

Yes. Apache-2.0 licensed on GitHub. The Python routing engine uses only stdlib — no pip dependencies, no external network calls for model ranking.

Does it handle my API keys?

No. All model execution goes through OpenClaw's existing auth and gateway. The plugin never asks for or directly uses provider API keys. It reads local benchmark CSVs and communicates with OpenClaw's local gateway for classification.

Start Routing With Real Data

Benchmark your recurring tasks, export for OpenClaw, and let the router handle model selection.
100+ models, deterministic scoring, real cost tracking.
50 free credits — no API keys, no setup.

Start Benchmarking — Free →