Expert Template Guide¶
Expert Templates define how MoE Sovereign routes, processes, and synthesises requests. Every setting in a template has a concrete effect on latency, quality, and cost. This guide explains each field and illustrates it with three reference templates shipped with the system.
Template Fields Reference¶
Top-Level Fields¶
| Field | Type | Default | Description |
|---|---|---|---|
judge_model |
"model@endpoint" |
global judge | LLM that synthesises all expert outputs into the final response. Also used as the tool-calling model in the two-phase Kanban handler. |
planner_model |
"model@endpoint" |
global planner | LLM that decomposes the user request into 1–4 expert subtasks. Should be fast and instruction-following (not a thinking model). |
judge_prompt |
string | built-in | Overrides the default synthesis instruction. Use to enforce output format (e.g. ANSWER: <value> for benchmarks). |
planner_prompt |
string | built-in | Overrides the routing ruleset. Add domain-specific routing rules and MCP tool hints here. |
enable_cache |
bool | true |
When true, semantically similar past responses are retrieved from the L1 vector cache before starting the pipeline. Disable for benchmarks to prevent cross-question contamination. |
enable_graphrag |
bool | true |
Injects relevant facts from the Neo4j knowledge graph into expert context. Disable when the graph adds noise (e.g. short factual GAIA questions). |
enable_web_research |
bool | true |
Triggers a SearXNG web search when the query complexity score exceeds the routing threshold. |
max_agentic_rounds |
int | unlimited | Hard cap on pipeline re-planning cycles. 3 prevents unbounded 49-minute loops while allowing enough rounds for multi-step research. |
force_think |
bool | false |
Forces the extended thinking mode for ALL experts, regardless of their individual thinking_mode. |
history_max_turns |
int | 0 (global) |
Maximum conversation turns injected into expert context. 0 = use global default. |
graphrag_max_chars |
int | 0 (global) |
Maximum characters from GraphRAG context injection. |
The experts Object¶
Each key in experts defines a domain specialist. The planner routes
subtasks to these categories by name.
"experts": {
"category_name": {
"context_window": 262144,
"system_prompt": "...",
"thinking_mode": true,
"models": [
{"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "primary"},
{"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "fallback"}
]
}
}
Expert Sub-Fields¶
| Field | Type | Description |
|---|---|---|
context_window |
int | Actual model context window in tokens. The pipeline uses this to truncate input so the model never receives more tokens than it can process. Setting this incorrectly (too high) causes silent truncation by the model itself; too low wastes available context. |
system_prompt |
string | The expert's role instruction. For creation tasks (games, services) use "generate complete, runnable code". For review tasks use "identify OWASP issues". |
thinking_mode |
bool | When false, the pipeline prepends /no_think to the user message — this disables qwen3's Extended Thinking, reducing latency by 10–20×. Set false for factual lookup categories (general, data_analysis), true for categories that benefit from reasoning (reasoning, science, security_analysis). |
models[].role |
"always" / "primary" / "fallback" |
always = only this model is used. primary / fallback = Two-Tier escalation: T1 (primary) runs first; if confidence is low, T2 (fallback) runs. |
Standard Expert Categories¶
| Category | Routing Use Case | Thinking |
|---|---|---|
vision |
Image, chart, diagram, chess position analysis | off |
math |
Calculations, formulas (precision_tools has priority for exact arithmetic) | on |
code_reviewer |
Code creation AND review, full implementations, OWASP analysis | off |
reasoning |
Logic puzzles, formal deduction, probability trees | on |
science |
Physics, chemistry, biology, earth science | on |
data_analysis |
Statistics, pandas/SQL, data wrangling | off |
creative_writing |
Poems, stories, constrained text generation | off |
devops_sre |
Kubernetes, Docker, CI/CD, incident diagnosis | off |
security_analysis |
CVEs, threat modelling, zero-trust hardening | on |
translation |
Cross-lingual tasks, cultural nuance | off |
legal_advisor |
Statutory interpretation, case law | on |
medical_consult |
Evidence-based medical knowledge | on |
long_context |
Documents/codebases >32K tokens, full conversation histories | off |
general |
Factual lookups, summaries, everything else | off |
Reference Template 1: moe-n04rtx-specialist¶
All experts run exclusively on N04-RTX local hardware. Maximum quality through specialised local models. No external API dependencies.
{
"judge_model": "qwen3.6:35b@N04-RTX",
"planner_model": "phi4:14b-fp16@N04-RTX",
"enable_cache": true,
"enable_graphrag": true,
"enable_web_research": true,
"experts": {
"vision": {
"context_window": 128000,
"system_prompt": "Visual analysis expert. Analyze the image carefully and answer precisely.",
"thinking_mode": false,
"models": [{"model": "qwen2.5vl:32b", "endpoint": "N04-RTX", "role": "always"}]
},
"math": {
"context_window": 32768,
"system_prompt": "Mathematics expert. Use MCP calculate for numeric results. Show rigorous steps.",
"thinking_mode": true,
"models": [
{"model": "mathstral:7b", "endpoint": "N04-RTX", "role": "primary"},
{"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "fallback"}
]
},
"code_reviewer": {
"context_window": 32768,
"system_prompt": "Senior full-stack engineer. For creation: complete, runnable code — no placeholders. For review: OWASP Top 10, performance, style.",
"thinking_mode": false,
"models": [{"model": "qwen2.5-coder:32b", "endpoint": "N04-RTX", "role": "always"}]
},
"reasoning": {
"context_window": 262144,
"system_prompt": "Analytical reasoning expert. Formal logic, identify hidden assumptions.",
"thinking_mode": true,
"models": [
{"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "primary"},
{"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "fallback"}
]
},
"science": {
"context_window": 262144,
"system_prompt": "Natural scientist. Precise terminology, correct units, formulas, scientific consensus.",
"thinking_mode": true,
"models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "always"}]
},
"devops_sre": {
"context_window": 393216,
"system_prompt": "DevOps/SRE expert. Production-ready configs, Kubernetes, Docker, incident diagnosis.",
"thinking_mode": false,
"models": [
{"model": "devstral-small-2:24b", "endpoint": "N04-RTX", "role": "primary"},
{"model": "qwen3-coder:30b", "endpoint": "N04-RTX", "role": "fallback"}
]
},
"security_analysis": {
"context_window": 131072,
"system_prompt": "Cybersecurity expert. CVEs, threat modelling, zero-trust hardening.",
"thinking_mode": true,
"models": [{"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "always"}]
},
"translation": {
"context_window": 131072,
"system_prompt": "Professional translator. Preserve meaning, tone, cultural nuance.",
"thinking_mode": false,
"models": [
{"model": "translategemma:27b", "endpoint": "N04-RTX", "role": "primary"},
{"model": "mistral-small:24b", "endpoint": "N04-RTX", "role": "fallback"}
]
},
"long_context": {
"context_window": 1048576,
"system_prompt": "Long-context expert with 1M token window. Process full codebases, long documents, histories >32K tokens. Extract key facts, structured summaries.",
"thinking_mode": false,
"models": [
{"model": "mistral-nemo:12b", "endpoint": "N04-RTX", "role": "primary"},
{"model": "llama3-gradient:8b", "endpoint": "N04-RTX", "role": "fallback"}
]
},
"general": {
"context_window": 262144,
"system_prompt": "Knowledgeable assistant. Factual, accurate, concise.",
"thinking_mode": false,
"models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "always"}]
}
}
}
Design decisions:
devops_sreusesdevstral-small-2:24b(Mistral's DevOps-specialised model) with 393K context — ideal for processing full Dockerfiles, Helm charts, and IaC configs in a single pass.reasoningandsciencehavethinking_mode: truebecause extended Chain-of-Thought is essential for correctness in logic and STEM domains.creative_writingusessolar-pro:22bwhich only has 4K native context — the template must reflect this (context_window: 4096) to prevent the pipeline from sending overlong prompts.long_contextusesmistral-nemo:12b(1M context) for tasks where the full codebase or document history must be read without chunking.
Reference Template 2: moe-quality-optimal¶
Production-grade template. Best-per-category model from the Constellation Benchmark (N04-RTX). GraphRAG and web research enabled for maximum accuracy.
{
"judge_model": "qwen3.6:35b@N04-RTX",
"planner_model": "phi4:14b-fp16@N04-RTX",
"enable_cache": true,
"enable_graphrag": true,
"enable_web_research": true,
"experts": {
"general": {"context_window": 262144, "thinking_mode": false,
"models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
"math": {"context_window": 32768, "thinking_mode": true,
"models": [{"model": "mathstral:7b", "endpoint": "N04-RTX"}]},
"code_reviewer": {"context_window": 32768, "thinking_mode": false,
"models": [{"model": "qwen2.5-coder:32b", "endpoint": "N04-RTX"}]},
"reasoning": {"context_window": 262144, "thinking_mode": true,
"models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
"science": {"context_window": 262144, "thinking_mode": true,
"models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
"data_analysis": {"context_window": 262144, "thinking_mode": false,
"models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
"creative_writing": {"context_window": 4096, "thinking_mode": false,
"models": [{"model": "solar-pro:22b", "endpoint": "N04-RTX"}]},
"devops_sre": {"context_window": 393216, "thinking_mode": false,
"models": [{"model": "devstral-small-2:24b", "endpoint": "N04-RTX"},
{"model": "qwen3-coder:30b", "endpoint": "N04-RTX", "role": "fallback"}]},
"security_analysis": {"context_window": 131072, "thinking_mode": true,
"models": [{"model": "deepseek-r1:32b", "endpoint": "N04-RTX"}]},
"long_context": {"context_window": 1048576, "thinking_mode": false,
"models": [{"model": "mistral-nemo:12b", "endpoint": "N04-RTX"},
{"model": "llama3-gradient:8b", "endpoint": "N04-RTX", "role": "fallback"}]}
}
}
Reference Template 3: moe-openrouter-free¶
All experts route through the openrouterai user connection using free-tier models. No local GPU required. 13 specialised categories.
{
"judge_model": "nvidia/nemotron-3-ultra-550b-a55b:free@openrouterai",
"planner_model": "meta-llama/llama-3.3-70b-instruct:free@openrouterai",
"enable_cache": true,
"enable_graphrag": false,
"enable_web_research": true,
"experts": {
"vision": {"models": [{"model": "nvidia/nemotron-nano-12b-v2-vl:free", "endpoint": "openrouterai"}]},
"math": {"models": [{"model": "nvidia/nemotron-3-ultra-550b-a55b:free", "endpoint": "openrouterai"},
{"model": "openai/gpt-oss-120b:free", "endpoint": "openrouterai", "role": "fallback"}]},
"code_reviewer": {"models": [{"model": "poolside/laguna-m.1:free", "endpoint": "openrouterai"},
{"model": "qwen/qwen3-coder:free", "endpoint": "openrouterai", "role": "fallback"}]},
"reasoning": {"models": [{"model": "nvidia/nemotron-3-ultra-550b-a55b:free", "endpoint": "openrouterai"},
{"model": "nousresearch/hermes-3-llama-3.1-405b:free", "endpoint": "openrouterai", "role": "fallback"}]},
"science": {"models": [{"model": "nvidia/nemotron-3-super-120b-a12b:free", "endpoint": "openrouterai"}]},
"data_analysis": {"models": [{"model": "qwen/qwen3-next-80b-a3b-instruct:free", "endpoint": "openrouterai"}]},
"creative_writing": {"models": [{"model": "z-ai/glm-4.5-air:free", "endpoint": "openrouterai"}]},
"devops_sre": {"models": [{"model": "poolside/laguna-xs.2:free", "endpoint": "openrouterai"}]},
"security_analysis": {"models": [{"model": "nvidia/nemotron-3.5-content-safety:free", "endpoint": "openrouterai"}]},
"translation": {"context_window": 131072,
"models": [{"model": "moonshotai/kimi-k2.6:free", "endpoint": "openrouterai"}]},
"legal_advisor": {"models": [{"model": "openai/gpt-oss-120b:free", "endpoint": "openrouterai"}]},
"medical_consult": {"models": [{"model": "openai/gpt-oss-120b:free", "endpoint": "openrouterai"}]},
"long_context": {"context_window": 131072,
"models": [{"model": "moonshotai/kimi-k2.6:free", "endpoint": "openrouterai"}]},
"general": {"models": [{"model": "google/gemma-4-31b-it:free", "endpoint": "openrouterai"}]}
}
}
Design decisions:
enable_graphrag: false— GraphRAG is only useful if the local Neo4j knowledge graph is populated with domain facts. For a pure API setup without local infra, it adds latency without benefit.- Judge is
nemotron-3-ultra-550B— the largest available free model (550B parameters). Synthesis quality scales with judge size. - Planner is
llama-3.3-70b— fast and reliable at JSON instruction following, appropriate for the simple decomposition task. translationuseskimi-k2.6with 131K context — Moonshot AI's model excels at multilingual tasks and has the largest free-tier context window.- Rate limiting: Free models route through shared upstream providers (Venice, etc.). A 429 response triggers automatic
retry_afterbackoff (29s by default) without marking the endpoint as degraded.
Claude Code Profile: openrouterai-deep¶
CC Profiles extend Expert Templates for Claude Code CLI use. They add a dedicated Tooling LLM for function-calling alongside the MoE pipeline.
{
"tool_model": "moonshotai/kimi-k2.6:free",
"tool_endpoint": "openrouterai",
"moe_mode": "moe_orchestrated",
"expert_template_id": "<id-of-moe-openrouter-free>",
"tool_max_tokens": 8192,
"reasoning_max_tokens": 32768,
"tool_choice": "required",
"stream_think": false,
"system_prompt_prefix": "You are a principal-level software engineer. Deliver production-grade solutions with security analysis, test coverage strategy, and architectural rationale."
}
| Field | Description |
|---|---|
tool_model |
The LLM that handles Claude Code's function calls (read_file, bash, write_file). Must support OpenAI function-calling format reliably. kimi-k2.6 is chosen for its 131K context and strong agentic capabilities. |
moe_mode |
moe_orchestrated = tool calls go directly to tool_model; content generation uses the MoE expert pipeline. native = bypass MoE entirely (Claude Code controls its own model). |
expert_template_id |
Links to the Expert Template that handles content requests. The Tooling LLM handles structure; the MoE pipeline handles knowledge. |
tool_max_tokens |
Max tokens for a single tool response. 8192 is sufficient for most file reads and bash outputs. |
tool_choice |
required forces Claude Code to always call a tool rather than generating freeform text, which prevents hallucinated file paths. |
Key Design Principles¶
1. Context Window Accuracy¶
The context_window field must match the actual model context window:
qwen3.6:35b → 262,144 (256K — MoE architecture, not 32K!)
devstral-small-2 → 393,216 (384K)
deepseek-r1:32b → 131,072 (128K)
solar-pro:22b → 4,096 (4K — CRITICAL: must not exceed this!)
mistral-nemo:12b → 1,024,000 (1M)
The pipeline uses context_window to truncate task input. An incorrect value
either wastes context (too low) or causes silent mid-sentence truncation by
the model itself (too high).
2. Thinking Mode Granularity¶
thinking_mode: false prepends /no_think to the user message, suppressing
qwen3's Extended Thinking. This reduces generation by 10–20× tokens for
categories that do not need deep reasoning:
Off → general, data_analysis, devops_sre, translation, creative_writing
On → reasoning, science, security_analysis, legal_advisor, medical_consult
Never disable thinking for security or reasoning experts — correctness suffers.
3. Two-Tier Expert Escalation¶
role: "primary" / role: "fallback" implements T1/T2 escalation. The
primary runs first; if its confidence score is low, the fallback runs and
the judge merges both answers:
"math": {
"models": [
{"model": "mathstral:7b", "role": "primary"}, // T1: fast, specialised
{"model": "deepseek-r1:32b", "role": "fallback"} // T2: larger, more thorough
]
}
Use T1 for speed, T2 for quality. The cost of T2 is only paid when needed.
4. The Long-Context Expert¶
For tasks exceeding 32K tokens (full codebases, long documents, 1000-turn
conversation histories), route to long_context:
"long_context": {
"context_window": 1048576, // 1M tokens
"models": [
{"model": "mistral-nemo:12b"}, // 1,024,000 tokens
{"model": "llama3-gradient:8b"} // 1,048,576 tokens
]
}
These models process the entire input without chunking or summarisation — important for code review of large repositories where global dependencies matter.