Expert Template Guide¶

Expert Templates define how MoE Sovereign routes, processes, and synthesises requests. Every setting in a template has a concrete effect on latency, quality, and cost. This guide explains each field and illustrates it with three reference templates shipped with the system.

Template Fields Reference¶

Top-Level Fields¶

Field	Type	Default	Description
`judge_model`	`"model@endpoint"`	global judge	LLM that synthesises all expert outputs into the final response. Also used as the tool-calling model in the two-phase Kanban handler.
`planner_model`	`"model@endpoint"`	global planner	LLM that decomposes the user request into 1–4 expert subtasks. Should be fast and instruction-following (not a thinking model).
`judge_prompt`	string	built-in	Overrides the default synthesis instruction. Use to enforce output format (e.g. `ANSWER: <value>` for benchmarks).
`planner_prompt`	string	built-in	Overrides the routing ruleset. Add domain-specific routing rules and MCP tool hints here.
`enable_cache`	bool	`true`	When `true`, semantically similar past responses are retrieved from the L1 vector cache before starting the pipeline. Disable for benchmarks to prevent cross-question contamination.
`enable_graphrag`	bool	`true`	Injects relevant facts from the Neo4j knowledge graph into expert context. Disable when the graph adds noise (e.g. short factual GAIA questions).
`enable_web_research`	bool	`true`	Triggers a SearXNG web search when the query complexity score exceeds the routing threshold.
`max_agentic_rounds`	int	unlimited	Hard cap on pipeline re-planning cycles. `3` prevents unbounded 49-minute loops while allowing enough rounds for multi-step research.
`force_think`	bool	`false`	Forces the extended thinking mode for ALL experts, regardless of their individual `thinking_mode`.
`history_max_turns`	int	`0` (global)	Maximum conversation turns injected into expert context. `0` = use global default.
`graphrag_max_chars`	int	`0` (global)	Maximum characters from GraphRAG context injection.

The `experts` Object¶

Each key in experts defines a domain specialist. The planner routes subtasks to these categories by name.

"experts": {
  "category_name": {
    "context_window": 262144,
    "system_prompt": "...",
    "thinking_mode": true,
    "models": [
      {"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "primary"},
      {"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "fallback"}
    ]
  }
}

Expert Sub-Fields¶

Field	Type	Description
`context_window`	int	Actual model context window in tokens. The pipeline uses this to truncate input so the model never receives more tokens than it can process. Setting this incorrectly (too high) causes silent truncation by the model itself; too low wastes available context.
`system_prompt`	string	The expert's role instruction. For creation tasks (games, services) use "generate complete, runnable code". For review tasks use "identify OWASP issues".
`thinking_mode`	bool	When `false`, the pipeline prepends `/no_think` to the user message — this disables qwen3's Extended Thinking, reducing latency by 10–20×. Set `false` for factual lookup categories (`general`, `data_analysis`), `true` for categories that benefit from reasoning (`reasoning`, `science`, `security_analysis`).
`models[].role`	`"always"` / `"primary"` / `"fallback"`	`always` = only this model is used. `primary` / `fallback` = Two-Tier escalation: T1 (primary) runs first; if confidence is low, T2 (fallback) runs.

Standard Expert Categories¶

Category	Routing Use Case	Thinking
`vision`	Image, chart, diagram, chess position analysis	off
`math`	Calculations, formulas (precision_tools has priority for exact arithmetic)	on
`code_reviewer`	Code creation AND review, full implementations, OWASP analysis	off
`reasoning`	Logic puzzles, formal deduction, probability trees	on
`science`	Physics, chemistry, biology, earth science	on
`data_analysis`	Statistics, pandas/SQL, data wrangling	off
`creative_writing`	Poems, stories, constrained text generation	off
`devops_sre`	Kubernetes, Docker, CI/CD, incident diagnosis	off
`security_analysis`	CVEs, threat modelling, zero-trust hardening	on
`translation`	Cross-lingual tasks, cultural nuance	off
`legal_advisor`	Statutory interpretation, case law	on
`medical_consult`	Evidence-based medical knowledge	on
`long_context`	Documents/codebases >32K tokens, full conversation histories	off
`general`	Factual lookups, summaries, everything else	off

Reference Template 1: `moe-n04rtx-specialist`¶

All experts run exclusively on N04-RTX local hardware. Maximum quality through specialised local models. No external API dependencies.

{
  "judge_model": "qwen3.6:35b@N04-RTX",
  "planner_model": "phi4:14b-fp16@N04-RTX",
  "enable_cache": true,
  "enable_graphrag": true,
  "enable_web_research": true,
  "experts": {
    "vision": {
      "context_window": 128000,
      "system_prompt": "Visual analysis expert. Analyze the image carefully and answer precisely.",
      "thinking_mode": false,
      "models": [{"model": "qwen2.5vl:32b", "endpoint": "N04-RTX", "role": "always"}]
    },
    "math": {
      "context_window": 32768,
      "system_prompt": "Mathematics expert. Use MCP calculate for numeric results. Show rigorous steps.",
      "thinking_mode": true,
      "models": [
        {"model": "mathstral:7b",    "endpoint": "N04-RTX", "role": "primary"},
        {"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "fallback"}
      ]
    },
    "code_reviewer": {
      "context_window": 32768,
      "system_prompt": "Senior full-stack engineer. For creation: complete, runnable code — no placeholders. For review: OWASP Top 10, performance, style.",
      "thinking_mode": false,
      "models": [{"model": "qwen2.5-coder:32b", "endpoint": "N04-RTX", "role": "always"}]
    },
    "reasoning": {
      "context_window": 262144,
      "system_prompt": "Analytical reasoning expert. Formal logic, identify hidden assumptions.",
      "thinking_mode": true,
      "models": [
        {"model": "qwen3.6:35b",     "endpoint": "N04-RTX", "role": "primary"},
        {"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "fallback"}
      ]
    },
    "science": {
      "context_window": 262144,
      "system_prompt": "Natural scientist. Precise terminology, correct units, formulas, scientific consensus.",
      "thinking_mode": true,
      "models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "always"}]
    },
    "devops_sre": {
      "context_window": 393216,
      "system_prompt": "DevOps/SRE expert. Production-ready configs, Kubernetes, Docker, incident diagnosis.",
      "thinking_mode": false,
      "models": [
        {"model": "devstral-small-2:24b", "endpoint": "N04-RTX", "role": "primary"},
        {"model": "qwen3-coder:30b",      "endpoint": "N04-RTX", "role": "fallback"}
      ]
    },
    "security_analysis": {
      "context_window": 131072,
      "system_prompt": "Cybersecurity expert. CVEs, threat modelling, zero-trust hardening.",
      "thinking_mode": true,
      "models": [{"model": "deepseek-r1:32b", "endpoint": "N04-RTX", "role": "always"}]
    },
    "translation": {
      "context_window": 131072,
      "system_prompt": "Professional translator. Preserve meaning, tone, cultural nuance.",
      "thinking_mode": false,
      "models": [
        {"model": "translategemma:27b", "endpoint": "N04-RTX", "role": "primary"},
        {"model": "mistral-small:24b",  "endpoint": "N04-RTX", "role": "fallback"}
      ]
    },
    "long_context": {
      "context_window": 1048576,
      "system_prompt": "Long-context expert with 1M token window. Process full codebases, long documents, histories >32K tokens. Extract key facts, structured summaries.",
      "thinking_mode": false,
      "models": [
        {"model": "mistral-nemo:12b",    "endpoint": "N04-RTX", "role": "primary"},
        {"model": "llama3-gradient:8b",  "endpoint": "N04-RTX", "role": "fallback"}
      ]
    },
    "general": {
      "context_window": 262144,
      "system_prompt": "Knowledgeable assistant. Factual, accurate, concise.",
      "thinking_mode": false,
      "models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX", "role": "always"}]
    }
  }
}

Design decisions:

devops_sre uses devstral-small-2:24b (Mistral's DevOps-specialised model) with 393K context — ideal for processing full Dockerfiles, Helm charts, and IaC configs in a single pass.
reasoning and science have thinking_mode: true because extended Chain-of-Thought is essential for correctness in logic and STEM domains.
creative_writing uses solar-pro:22b which only has 4K native context — the template must reflect this (context_window: 4096) to prevent the pipeline from sending overlong prompts.
long_context uses mistral-nemo:12b (1M context) for tasks where the full codebase or document history must be read without chunking.

Reference Template 2: `moe-quality-optimal`¶

Production-grade template. Best-per-category model from the Constellation Benchmark (N04-RTX). GraphRAG and web research enabled for maximum accuracy.

{
  "judge_model": "qwen3.6:35b@N04-RTX",
  "planner_model": "phi4:14b-fp16@N04-RTX",
  "enable_cache": true,
  "enable_graphrag": true,
  "enable_web_research": true,
  "experts": {
    "general":    {"context_window": 262144, "thinking_mode": false,
                   "models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
    "math":       {"context_window": 32768,  "thinking_mode": true,
                   "models": [{"model": "mathstral:7b", "endpoint": "N04-RTX"}]},
    "code_reviewer": {"context_window": 32768, "thinking_mode": false,
                   "models": [{"model": "qwen2.5-coder:32b", "endpoint": "N04-RTX"}]},
    "reasoning":  {"context_window": 262144, "thinking_mode": true,
                   "models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
    "science":    {"context_window": 262144, "thinking_mode": true,
                   "models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
    "data_analysis": {"context_window": 262144, "thinking_mode": false,
                   "models": [{"model": "qwen3.6:35b", "endpoint": "N04-RTX"}]},
    "creative_writing": {"context_window": 4096, "thinking_mode": false,
                   "models": [{"model": "solar-pro:22b", "endpoint": "N04-RTX"}]},
    "devops_sre": {"context_window": 393216, "thinking_mode": false,
                   "models": [{"model": "devstral-small-2:24b", "endpoint": "N04-RTX"},
                               {"model": "qwen3-coder:30b",      "endpoint": "N04-RTX", "role": "fallback"}]},
    "security_analysis": {"context_window": 131072, "thinking_mode": true,
                   "models": [{"model": "deepseek-r1:32b", "endpoint": "N04-RTX"}]},
    "long_context": {"context_window": 1048576, "thinking_mode": false,
                   "models": [{"model": "mistral-nemo:12b",   "endpoint": "N04-RTX"},
                               {"model": "llama3-gradient:8b", "endpoint": "N04-RTX", "role": "fallback"}]}
  }
}

Reference Template 3: `moe-openrouter-free`¶

All experts route through the openrouterai user connection using free-tier models. No local GPU required. 13 specialised categories.

{
  "judge_model":   "nvidia/nemotron-3-ultra-550b-a55b:free@openrouterai",
  "planner_model": "meta-llama/llama-3.3-70b-instruct:free@openrouterai",
  "enable_cache": true,
  "enable_graphrag": false,
  "enable_web_research": true,
  "experts": {
    "vision":        {"models": [{"model": "nvidia/nemotron-nano-12b-v2-vl:free", "endpoint": "openrouterai"}]},
    "math":          {"models": [{"model": "nvidia/nemotron-3-ultra-550b-a55b:free", "endpoint": "openrouterai"},
                                  {"model": "openai/gpt-oss-120b:free",              "endpoint": "openrouterai", "role": "fallback"}]},
    "code_reviewer": {"models": [{"model": "poolside/laguna-m.1:free",  "endpoint": "openrouterai"},
                                  {"model": "qwen/qwen3-coder:free",     "endpoint": "openrouterai", "role": "fallback"}]},
    "reasoning":     {"models": [{"model": "nvidia/nemotron-3-ultra-550b-a55b:free", "endpoint": "openrouterai"},
                                  {"model": "nousresearch/hermes-3-llama-3.1-405b:free", "endpoint": "openrouterai", "role": "fallback"}]},
    "science":       {"models": [{"model": "nvidia/nemotron-3-super-120b-a12b:free", "endpoint": "openrouterai"}]},
    "data_analysis": {"models": [{"model": "qwen/qwen3-next-80b-a3b-instruct:free", "endpoint": "openrouterai"}]},
    "creative_writing": {"models": [{"model": "z-ai/glm-4.5-air:free",              "endpoint": "openrouterai"}]},
    "devops_sre":    {"models": [{"model": "poolside/laguna-xs.2:free",  "endpoint": "openrouterai"}]},
    "security_analysis": {"models": [{"model": "nvidia/nemotron-3.5-content-safety:free", "endpoint": "openrouterai"}]},
    "translation":   {"context_window": 131072,
                      "models": [{"model": "moonshotai/kimi-k2.6:free",  "endpoint": "openrouterai"}]},
    "legal_advisor": {"models": [{"model": "openai/gpt-oss-120b:free",   "endpoint": "openrouterai"}]},
    "medical_consult": {"models": [{"model": "openai/gpt-oss-120b:free", "endpoint": "openrouterai"}]},
    "long_context":  {"context_window": 131072,
                      "models": [{"model": "moonshotai/kimi-k2.6:free",  "endpoint": "openrouterai"}]},
    "general":       {"models": [{"model": "google/gemma-4-31b-it:free",  "endpoint": "openrouterai"}]}
  }
}

Design decisions:

enable_graphrag: false — GraphRAG is only useful if the local Neo4j knowledge graph is populated with domain facts. For a pure API setup without local infra, it adds latency without benefit.
Judge is nemotron-3-ultra-550B — the largest available free model (550B parameters). Synthesis quality scales with judge size.
Planner is llama-3.3-70b — fast and reliable at JSON instruction following, appropriate for the simple decomposition task.
translation uses kimi-k2.6 with 131K context — Moonshot AI's model excels at multilingual tasks and has the largest free-tier context window.
Rate limiting: Free models route through shared upstream providers (Venice, etc.). A 429 response triggers automatic retry_after backoff (29s by default) without marking the endpoint as degraded.

Claude Code Profile: `openrouterai-deep`¶

CC Profiles extend Expert Templates for Claude Code CLI use. They add a dedicated Tooling LLM for function-calling alongside the MoE pipeline.

{
  "tool_model":         "moonshotai/kimi-k2.6:free",
  "tool_endpoint":      "openrouterai",
  "moe_mode":           "moe_orchestrated",
  "expert_template_id": "<id-of-moe-openrouter-free>",
  "tool_max_tokens":    8192,
  "reasoning_max_tokens": 32768,
  "tool_choice":        "required",
  "stream_think":       false,
  "system_prompt_prefix": "You are a principal-level software engineer. Deliver production-grade solutions with security analysis, test coverage strategy, and architectural rationale."
}

Field	Description
`tool_model`	The LLM that handles Claude Code's function calls (read_file, bash, write_file). Must support OpenAI function-calling format reliably. `kimi-k2.6` is chosen for its 131K context and strong agentic capabilities.
`moe_mode`	`moe_orchestrated` = tool calls go directly to `tool_model`; content generation uses the MoE expert pipeline. `native` = bypass MoE entirely (Claude Code controls its own model).
`expert_template_id`	Links to the Expert Template that handles content requests. The Tooling LLM handles structure; the MoE pipeline handles knowledge.
`tool_max_tokens`	Max tokens for a single tool response. 8192 is sufficient for most file reads and bash outputs.
`tool_choice`	`required` forces Claude Code to always call a tool rather than generating freeform text, which prevents hallucinated file paths.

Key Design Principles¶

1. Context Window Accuracy¶

The context_window field must match the actual model context window:

qwen3.6:35b      → 262,144   (256K — MoE architecture, not 32K!)
devstral-small-2 → 393,216   (384K)
deepseek-r1:32b  → 131,072   (128K)
solar-pro:22b    →   4,096   (4K — CRITICAL: must not exceed this!)
mistral-nemo:12b → 1,024,000 (1M)

The pipeline uses context_window to truncate task input. An incorrect value either wastes context (too low) or causes silent mid-sentence truncation by the model itself (too high).

2. Thinking Mode Granularity¶

thinking_mode: false prepends /no_think to the user message, suppressing qwen3's Extended Thinking. This reduces generation by 10–20× tokens for categories that do not need deep reasoning:

Off  → general, data_analysis, devops_sre, translation, creative_writing
On   → reasoning, science, security_analysis, legal_advisor, medical_consult

Never disable thinking for security or reasoning experts — correctness suffers.

3. Two-Tier Expert Escalation¶

role: "primary" / role: "fallback" implements T1/T2 escalation. The primary runs first; if its confidence score is low, the fallback runs and the judge merges both answers:

"math": {
  "models": [
    {"model": "mathstral:7b",    "role": "primary"},   // T1: fast, specialised
    {"model": "deepseek-r1:32b", "role": "fallback"}   // T2: larger, more thorough
  ]
}

Use T1 for speed, T2 for quality. The cost of T2 is only paid when needed.

4. The Long-Context Expert¶

For tasks exceeding 32K tokens (full codebases, long documents, 1000-turn conversation histories), route to long_context:

"long_context": {
  "context_window": 1048576,        // 1M tokens
  "models": [
    {"model": "mistral-nemo:12b"},  // 1,024,000 tokens
    {"model": "llama3-gradient:8b"} // 1,048,576 tokens
  ]
}

These models process the entire input without chunking or summarisation — important for code review of large repositories where global dependencies matter.