MoE Sovereign – Benutzerhandbuch¶
Self-hosted • OpenAI-compatible • 12 specialist experts • GraphRAG • Vision • Skills
Quick Start¶
Claude Code¶
# Option A — per-session flag
claude --model moe-orchestrator \
--api-key any \
--base-url http://localhost:8002/v1
# Option B — persistent config in ~/.claude/settings.json
{
"env": {
"ANTHROPIC_BASE_URL": "http://localhost:8002/v1",
"ANTHROPIC_API_KEY": "any-string"
}
}
OpenAI-compatible clients (Continue.dev, Open Code, curl)¶
# Chat completion (streaming)
curl -s http://localhost:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moe-orchestrator",
"stream": true,
"messages": [{"role": "user", "content": "Erkläre mir Transformer-Architekturen."}]
}'
# List available model IDs
curl -s http://localhost:8002/v1/models | jq '.data[].id'
Modes¶
Select a mode by setting the model field. Each mode is optimized for a different workflow.
| Model ID | Mode | Best for | Output format |
|---|---|---|---|
moe-orchestrator |
default |
General questions, explanations, analysis | Full answer with KONFIDENZ block |
moe-orchestrator-code |
code |
Code generation, no prose | Code only, # KONFIDENZ: first line |
moe-orchestrator-concise |
concise |
Quick answers, mobile, tight context | ≤ 120 words |
moe-orchestrator-agent |
agent |
Continue.dev, Open Code — coding workflows | Clean Markdown, no confidence block |
moe-orchestrator-agent-orchestrated |
agent_orchestrated |
Claude Code — maximum quality | Full pipeline + reasoning, clean output |
moe-orchestrator-research |
research |
Deep-dive topics, citations needed | Structured research report with sources |
moe-orchestrator-report |
report |
Professional documents, decision support | Full Markdown report with Executive Summary |
moe-orchestrator-plan |
plan |
Multi-step planning, project breakdowns | Full pipeline, planning structure |
Choosing a mode¶
Is it a coding task?
├── Yes → moe-orchestrator-code (or moe-orchestrator-agent for coding agents)
└── No
├── Need deep research with sources? → moe-orchestrator-research
├── Writing a document or report? → moe-orchestrator-report
├── Planning a multi-step project? → moe-orchestrator-plan
├── Quick one-liner answer? → moe-orchestrator-concise
└── Default → moe-orchestrator
Expert Categories¶
The planner routes your query to one or more of these 12 specialists automatically. You don't need to select them manually.
| Category | Specialization | Example prompts |
|---|---|---|
general |
Facts, definitions, explanations across domains | "Was ist Quantenverschränkung?" / "Erkläre Docker Volumes." |
math |
Mathematics, physics formulas, LaTeX output | "Leite die Ableitung von sin(x²) her." / "Löse 3x² + 5x - 2 = 0" |
technical_support |
IT, DevOps, networking, debugging, servers | "Mein Nginx gibt 502 zurück — was prüfe ich zuerst?" |
code_reviewer |
Code review, security audit, refactoring | "Review this Python function for security issues." |
creative_writer |
Creative writing, storytelling, marketing copy | "Schreibe einen Blogpost-Einstieg über KI-Sicherheit." |
medical_consult |
Medical information (with disclaimer) | "Was sind Symptome einer Schilddrüsenunterfunktion?" |
legal_advisor |
German law: BGB, StGB, GG, HGB, EU-Recht | "Was regelt §433 BGB?" / "Kündigungsfristen nach TzBfG" |
translation |
Professional multi-language translation | "Übersetze diesen Text ins Englische, formell." |
reasoning |
Complex analysis, logical chains, strategy | "Analysiere die Vor- und Nachteile von Microservices." |
vision |
Image, screenshot, document analysis | (send image in message content) |
data_analyst |
Statistics, data analysis, pandas/numpy code | "Schreibe pandas-Code für Pivot-Analyse dieser CSV." |
science |
Chemistry, biology, physics, environmental | "Wie funktioniert CRISPR-Cas9 auf molekularer Ebene?" |
Multi-expert queries: "Schreibe ein Python-Skript, das Rechtsfragen analysiert und dokumentiert" — the planner routes to
code_reviewer,legal_advisor, andcreative_writersimultaneously.
Skills System¶
Skills are Markdown files with YAML frontmatter. When you call /skill-name [arguments], the server resolves the skill before it enters the pipeline — this works with any client (Claude Code, Continue.dev, Open Code, curl).
Calling a skill¶
Skill file format¶
Skills live in ~/.claude/commands/ (mounted into the container at /app/skills):
---
name: commit
description: Create a well-formatted git commit
---
Create a git commit with this message: $ARGUMENTS
Follow conventional commits format. Include a concise subject line
and a body explaining the "why", not just the "what".
$ARGUMENTSis replaced with everything after/skill-name- A skill without
$ARGUMENTSignores any arguments - Disable a skill: rename
skill.md→skill.md.disabled
Upstream skills (Anthropic)¶
The skills-upstream/ directory contains a clone of anthropics/skills. Sync and import via:
- Admin UI → Skills → Upstream (Anthropic) — browse, pull updates, import individual skills
- Or via API:
POST /api/skills/upstream/pullthenPOST /api/skills/upstream/import/{name}
Available upstream skills (examples)¶
pdf, docx, xlsx, pptx, claude-api, webapp-testing, frontend-design, and more.
Vision & Multimodal¶
Send images as base64-encoded content blocks in the OpenAI messages format:
{
"model": "moe-orchestrator",
"messages": [{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,<BASE64_DATA>"
}
},
{
"type": "text",
"text": "Was zeigt dieses Diagramm?"
}
]
}]
}
Claude Code sends images automatically when you paste or attach them.
What happens internally:
- The API layer extracts image blocks →
AgentState.images - The planner detects the
[BILD-EINGABE: N Bild(er)]annotation and routes tovisioncategory - The vision expert receives the full multimodal message (text + images)
- Models used:
llama3.2-vision:11b(T1),llava:34b(T2), or configured models
Supported content: screenshots, diagrams, charts, photos, scanned documents, UI mockups.
KONFIDENZ System¶
Most responses include a structured confidence block:
Confidence levels¶
| Level | Meaning | What to do |
|---|---|---|
hoch |
Expert answered with high certainty; factual, cross-validated | Trust and use directly |
mittel |
Reasonable answer but some uncertainty; may need verification | Verify important details |
niedrig |
Low confidence, especially in safety-critical categories | Consult a professional; treat as starting point only |
Per-mode format¶
| Mode | Format |
|---|---|
default, research, report, plan |
Full KERNAUSSAGE / KONFIDENZ / DETAILS block |
code |
First-line comment: # KONFIDENZ: hoch |
concise |
Inline prefix: KONFIDENZ: hoch — [answer] |
agent, agent_orchestrated |
No confidence block (clean output for coding agents) |
Submit feedback¶
Help the system learn which models perform best:
curl -X POST http://localhost:8002/v1/feedback \
-H "Content-Type: application/json" \
-d '{"response_id": "<id-from-response>", "rating": 5}'
Ratings 4–5 are positive, 1–2 are negative. After 5 feedback points per model/category pair, the routing algorithm uses Laplace-smoothed scores to prefer better-performing models.
Best Practices¶
Choose the right mode¶
| Task | Recommended mode |
|---|---|
| Explain a concept | moe-orchestrator |
| Write or review code | moe-orchestrator-code |
| Quick factual lookup | moe-orchestrator-concise |
| Research a topic with sources | moe-orchestrator-research |
| Write a business document | moe-orchestrator-report |
| Plan a project or architecture | moe-orchestrator-plan |
| Integrate with Claude Code | moe-orchestrator-agent-orchestrated |
| Integrate with Open Code / Continue.dev | moe-orchestrator-agent |
Use skills for repetitive workflows¶
Instead of re-typing the same prompt pattern, create a skill:
---
name: explain-code
description: Explain a code block in simple terms
---
Explain the following code in simple, clear German. Describe what it does,
why it does it, and any potential issues:
$ARGUMENTS
Then just: /explain-code def foo(x): return x * 2
Give feedback¶
Every niedrig confidence response you rate with 1–2 trains the routing to prefer better models for that category. Every hoch response you rate 5 reinforces the winner. After ~20 ratings per category, the system self-optimizes.
Use research mode for current events¶
moe-orchestrator-research runs multiple SearXNG queries in parallel and structures the result as a citable research report. Better than asking a single model about topics after its training cutoff.
Avoid overloading the context window¶
- The
concisemode limits answers to ~120 words — useful for in-editor tooltips - History is automatically compressed (older turns →
[…]) after 3,000 chars - For long documents, use the
pdfordocxskill from upstream
Configure models without rebuilding¶
All expert models, system prompts, and Claude Code profiles can be changed live via Admin UI → Servers / Skills / Profiles. Changes take effect immediately (loaded from .env volume on each request) — no container rebuild required.
Admin UI (Port 8088)¶
Open http://localhost:8088 in your browser.
| Page | Path | What you can do |
|---|---|---|
| Dashboard | / |
System health, service status, recent activity |
| Profiles | /profiles |
Create/edit/activate Claude Code integration profiles |
| Skills | /skills |
CRUD for custom skills + upstream Anthropic skills sync |
| Servers | /servers |
Check Ollama server health, list available models |
| MCP Tools | /mcp-tools |
Enable/disable individual precision tools |
| Monitoring | /monitoring |
Grafana + Prometheus metrics overview |
| Tool Eval | /tool-eval |
Log of all MCP tool invocations |
Integration Profiles¶
A profile bundles all Claude Code settings into one named configuration:
| Field | Description |
|---|---|
name |
Display name (e.g., "Coding Heavy", "Fast & Concise") |
tool_model |
Model for tool-use calls (default: devstral:24b) |
reasoning_model |
Model for extended thinking (optional) |
moe_mode |
Default mode: agent_orchestrated, agent, default, ... |
tool_max_tokens |
Max tokens for tool responses (default: 8192) |
reasoning_max_tokens |
Max tokens for reasoning (default: 16384) |
Activate a profile in Admin UI → it writes to .env and takes effect on the next request.
MCP Precision Tools¶
The MCP server (http://localhost:8003) provides 20 deterministic tools for computations where LLMs are unreliable. The planner routes precision_tools tasks here automatically.
| Tool | Description | Example input |
|---|---|---|
calculate |
Safe math expressions (AST eval + SymPy fallback) | "sqrt(2) * pi" |
solve_equation |
Algebraic equation solving | "x**2 - 4 = 0", var="x" |
date_diff |
Days between two dates | "2024-01-01", "2025-04-01" |
date_add |
Add/subtract days from a date | date="2025-01-01", days=90 |
day_of_week |
Weekday from date | "2026-04-01" → "Wednesday" |
unit_convert |
Unit conversion | value=100, from="km", to="miles" |
statistics_calc |
Mean, median, stdev, variance, percentiles | data=[1,2,3,4,5], op="mean" |
hash_text |
MD5 / SHA256 / SHA512 | text="hello", alg="sha256" |
base64_codec |
Encode / decode Base64 | data="hello", mode="encode" |
regex_extract |
Pattern matching & extraction | text="IP: 192.168.1.1", pattern="\d+\.\d+\.\d+\.\d+" |
subnet_calc |
IP/CIDR network analysis | cidr="192.168.1.0/24" |
text_analyze |
Word count, char count, sentence count | text="..." |
prime_factorize |
Prime factorization | number=360 |
gcd_lcm |
GCD and LCM of two integers | a=48, b=18 |
json_query |
JSON path extraction | json="...", path="$.users[0].name" |
roman_numeral |
Arabic ↔ Roman numeral conversion | input="2024", dir="to_roman" |
legal_search_laws |
Find German laws by keyword | keyword="Kündigung" |
legal_get_law_overview |
List sections of a law | law="BGB" |
legal_get_paragraph |
Exact text of a paragraph | law="BGB", paragraph="433" |
legal_fulltext_search |
Full-text search within a law | law="StGB", query="Betrug" |
Monitoring¶
Key Prometheus metrics¶
| Metric | Labels | Description |
|---|---|---|
moe_tokens_total |
model, token_type, node |
Cumulative token usage per model/node |
moe_expert_calls_total |
model, category, node |
Expert invocation count |
moe_confidence_total |
level, category |
Confidence distribution per category |
moe_cache_hits_total |
type |
Cache hit count (hard, soft) |
moe_cache_misses_total |
— | Cache miss count |
Grafana dashboards¶
Open http://localhost:3001 (admin / configured password).
Default dashboard: MoE Operations — includes cache hit rate, token spend per node, expert call distribution, confidence levels over time, response latency percentiles.
Useful Prometheus queries¶
# Cache hit rate (last 1h)
rate(moe_cache_hits_total[1h]) / (rate(moe_cache_hits_total[1h]) + rate(moe_cache_misses_total[1h]))
# Token spend per model (last 24h)
sum by (model) (increase(moe_tokens_total[24h]))
# Expert calls by category (last 1h)
sum by (category) (rate(moe_expert_calls_total[1h]))
# Confidence distribution
sum by (level) (moe_confidence_total)
Troubleshooting¶
Response is slow on the first call¶
The first call warms up the model in Ollama VRAM. Subsequent calls within the keep-alive window are fast. Check which models are loaded: curl http://<ollama-host>:11434/api/ps
"niedrig" confidence on every response¶
Either:
- The assigned model is too small for the category → configure a larger model in Admin UI → Servers
- No feedback has been given yet → rate responses with 4–5 for the good ones so the scorer learns
Skills not resolving¶
- Check the skill file exists:
ls ~/.claude/commands/skill-name.md - Check it doesn't end in
.disabled - The skill must start with
---(YAML frontmatter) or just have content directly
Cache returns outdated answers¶
Flag the response via feedback (rating 1–2). The orchestrator automatically marks flagged cache entries and skips them on future queries.
Vision queries don't work¶
- Confirm a vision model is configured in Admin UI (e.g.,
llama3.2-vision:11b) - Images must be sent as
image_urlwithdata:image/...;base64,...URL - Check logs:
sudo docker logs langgraph-orchestrator | grep vision