MoE Sovereign — System Architecture
Overview
MoE Sovereign is a LangGraph-based Mixture-of-Experts orchestrator. Each incoming query is decomposed by a planner LLM into typed tasks, routed to specialist models in parallel, enriched with knowledge graph context and optional web research, then synthesized by a judge LLM into a single coherent response.
All caching is multi-layered: semantic vector cache (ChromaDB), plan cache (Redis), GraphRAG cache (Redis), and performance-scored expert routing (Redis). The API is fully OpenAI-compatible.
LangGraph Pipeline
flowchart TD
IN([Client Request]) --> CACHE
CACHE{cache_lookup\nChromaDB semantic\nhit < 0.15}
CACHE -->|HIT ⚡| MERGE
CACHE -->|MISS| PLAN
PLAN[planner\nphi4:14b\nRedis plan cache\nTTL 30 min]
PLAN --> PAR
subgraph PAR [Parallel Execution]
direction LR
W[workers\nTier 1 + Tier 2\nexpert models]
R[research\nSearXNG\nweb search]
M[math\nSymPy\ncalculation]
MCP[mcp\nPrecision Tools\n20 deterministic tools]
GR[graph_rag\nNeo4j\nRedis cache TTL 1h]
end
PAR --> RF[research_fallback\nconditional\nweb fallback]
RF --> THINK[thinking\nchain-of-thought\nreasoning trace]
THINK --> MERGE
MERGE{merger\nJudge LLM\nor Fast-Path ⚡}
MERGE -->|single hoch expert\nno extra context| FP[⚡ Fast-Path\ndirect return]
MERGE -->|ensemble / multi| JUDGE[Judge LLM\nsynthesis]
JUDGE --> CRIT[critic\npost-validation\nself-evaluation]
FP --> CRIT
CRIT --> OUT([Streaming Response])
style CACHE fill:#1e3a5f,color:#fff
style MERGE fill:#1e3a5f,color:#fff
style PAR fill:#0d2137,color:#ccc
style FP fill:#1a4a1a,color:#fff
style OUT fill:#2d1b4e,color:#fff
Node Descriptions
| Node |
Function |
Key Logic |
cache_lookup |
ChromaDB semantic similarity |
distance < 0.15 → hard hit; 0.15–0.50 → soft/few-shot examples |
planner |
Task decomposition (phi4:14b) |
Produces [{task, category, search_query?, mcp_tool?}]; Redis plan cache TTL=30 min |
workers |
Parallel expert execution |
Two-tier routing; T1 (≤20B) first, T2 (>20B) only if T1 confidence < threshold |
research |
SearXNG web search |
Single or multi-query deep search; always runs if research category in plan |
math |
SymPy calculation |
Runs only if math category in plan AND no precision_tools task |
mcp |
MCP Precision Tools |
20 deterministic tools via HTTP; runs if precision_tools in plan |
graph_rag |
Neo4j knowledge graph |
Entity + relation context; Redis cache TTL=1h |
research_fallback |
Conditional extra search |
Triggers if merger needs more context |
thinking |
Chain-of-thought reasoning |
Generates reasoning_trace; activated by force_think modes |
merger |
Response synthesis (Judge LLM) |
Fast-path bypasses Judge for single high-confidence experts |
critic |
Post-generation validation |
Async self-evaluation; flags low-quality cache entries |
Service Topology
graph LR
subgraph Clients
CC[Claude Code]
OC[Open Code]
CD[Continue.dev]
CU[curl / any OpenAI client]
end
subgraph Core [:8002]
ORCH[langgraph-orchestrator\nFastAPI + LangGraph]
end
subgraph Storage
REDIS[(terra_cache\nRedis Stack :6379)]
CHROMA[(chromadb-vector\nChromaDB :8001)]
NEO4J[(neo4j-knowledge\nNeo4j :7687/:7474)]
KAFKA[moe-kafka\nKafka :9092]
end
subgraph Tools
MCP[mcp-precision\nMCP Server :8003]
SEARX[SearXNG\nexternal]
end
subgraph GPU_Inference
RTX[Ollama RTX\nconfigured via\nINFERENCE_SERVERS]
TESLA[Ollama Tesla\noptional]
end
subgraph Observability
PROM[moe-prometheus :9090]
GRAF[moe-grafana :3001]
NODE[node-exporter :9100]
CADV[cadvisor :9338]
end
subgraph Admin
ADMUI[moe-admin :8088]
end
CC & OC & CD & CU -->|OpenAI API| ORCH
ORCH --> REDIS
ORCH --> CHROMA
ORCH --> NEO4J
ORCH --> KAFKA
ORCH --> MCP
ORCH --> SEARX
ORCH --> RTX
ORCH -.-> TESLA
ADMUI --> ORCH
ADMUI -->|/var/run/docker.sock| DOCKER[(Docker API)]
ADMUI --> PROM
PROM --> ORCH
PROM --> NODE
PROM --> CADV
PROM --> GRAF
KAFKA -->|moe.ingest| ORCH
KAFKA -->|moe.feedback| ORCH
Kafka Topics
| Topic |
Publisher |
Consumer |
Purpose |
moe.ingest |
orchestrator |
orchestrator |
GraphRAG entity ingestion from responses |
moe.requests |
orchestrator |
orchestrator |
Audit log (input, answer snippet, models used) |
moe.feedback |
orchestrator |
orchestrator |
User ratings → plan pattern learning & model scoring |
Caching Architecture
graph TD
Q([Query]) --> L1
L1{L1: ChromaDB\nSemantic Cache\ncosine distance}
L1 -->|< 0.15 hard hit| DONE([Return cached response])
L1 -->|0.15–0.50 soft hit| FEW[Few-shot examples\nfor experts]
L1 -->|> 0.50 miss| L2
L2{L2: Redis\nPlan Cache\nmoe:plan:sha256[:16]}
L2 -->|TTL 30 min hit| SKIP_PLAN[Skip planner LLM\n~1,600 tokens saved]
L2 -->|miss| PLAN_LLM[Planner LLM call]
PLAN_LLM -->|write-back| L2
SKIP_PLAN --> L3
L3{L3: Redis\nGraphRAG Cache\nmoe:graph:sha256[:16]}
L3 -->|TTL 1h hit| SKIP_NEO4J[Skip Neo4j query\n1–3s saved]
L3 -->|miss| NEO4J_Q[Neo4j query]
NEO4J_Q -->|write-back| L3
SKIP_NEO4J --> L4
L4{L4: Redis\nPerformance Scores\nmoe:perf:model:category}
L4 -->|Laplace-smoothed\nscore ≥ 0.3| TIER1[Prefer high-scoring\nT1 model]
L4 -->|score < 0.3| TIER2[Fallback to T2]
style L1 fill:#1e3a5f,color:#fff
style L2 fill:#3a1e5f,color:#fff
style L3 fill:#1e5f3a,color:#fff
style L4 fill:#5f3a1e,color:#fff
style DONE fill:#1a4a1a,color:#fff
Cache Key Reference
| Cache |
Key Pattern |
TTL |
Storage |
| Semantic cache |
ChromaDB collection moe_fact_cache |
permanent (flagged if bad) |
ChromaDB |
| Plan cache |
moe:plan:{sha256(query[:300])[:16]} |
30 min |
Redis |
| GraphRAG cache |
moe:graph:{sha256(query[:200]+categories)[:16]} |
1 h |
Redis |
| Perf scores |
moe:perf:{model}:{category} |
permanent |
Redis Hash |
| Response metadata |
moe:response:{response_id} |
7 days |
Redis Hash |
| Planner patterns |
moe:planner_success (sorted set) |
180 days |
Redis ZSet |
| Ontology gaps |
moe:ontology_gaps (sorted set) |
90 days |
Redis ZSet |
Expert Routing
flowchart LR
PLAN([Plan Tasks]) --> SEL
SEL{Category\nin plan?}
SEL -->|precision_tools| MCP[MCP Node\ndeterministic]
SEL -->|research| WEB[Research Node\nSearXNG]
SEL -->|math| MATH[Math Node\nSymPy]
SEL -->|expert category| ROUTE
ROUTE{Expert\nRouting}
ROUTE -->|forced ensemble| BOTH[T1 + T2\nin parallel]
ROUTE -->|normal| T1[Tier 1\n≤20B params\nfast]
T1 -->|confidence == hoch| MERGE_CHECK
T1 -->|confidence < hoch| T2[Tier 2\n>20B params\nhigh quality]
T2 --> MERGE_CHECK
MERGE_CHECK{Merger\nFast-Path\ncheck}
MERGE_CHECK -->|1 expert, hoch\nno web/mcp/graph| FP[⚡ Fast-Path\nskip Judge LLM\n1,500–4,000 tokens saved]
MERGE_CHECK -->|multi / ensemble\nor extra context| JUDGE[Judge LLM\nsynthesis]
Expert Categories
| Category |
Planner Trigger Keywords |
Tier Preference |
general |
Allgemeine Wissensfragen, Definitionen, Erklärungen |
T1 |
math |
Berechnung, Gleichung, Formel, Statistik |
T1 |
technical_support |
IT, Server, Docker, Netzwerk, Debugging, DevOps |
T1 |
creative_writer |
Schreiben, Kreativität, Storytelling, Marketing |
T1 |
code_reviewer |
Code, Programmierung, Review, Security, Refactoring |
T1 |
medical_consult |
Medizin, Symptome, Diagnose, Medikamente |
T1 |
legal_advisor |
Recht, Gesetz, BGB, StGB, Vertrag, Urteile |
T1 |
translation |
Übersetzen, Sprache, Übersetzung |
T1 |
reasoning |
Analyse, Logik, komplexe Argumentation, Strategie |
T2 |
vision |
Bild, Screenshot, Dokument, Foto, erkennen |
T2 |
data_analyst |
Daten, CSV, Tabelle, Visualisierung, pandas |
T1 |
science |
Chemie, Biologie, Physik, Umwelt, Forschung |
T1 |
AgentState
The LangGraph state object passed through all nodes:
| Field |
Type |
Description |
input |
str |
Original user query (after skill resolution) |
response_id |
str |
UUID for feedback tracking |
mode |
str |
Active mode: default, code, concise, agent, agent_orchestrated, research, report, plan |
system_prompt |
str |
Client system prompt (e.g., file context from Claude Code) |
plan |
List[Dict] |
[{task, category, search_query?, mcp_tool?, mcp_args?}] |
expert_results |
List[str] |
Accumulated expert outputs (reducers: operator.add) |
expert_models_used |
List[str] |
["model::category", ...] for metrics |
web_research |
str |
SearXNG results with inline citations |
cached_facts |
str |
ChromaDB hard cache hit content |
cache_hit |
bool |
True if hard cache hit — skips most nodes |
math_result |
str |
SymPy output |
mcp_result |
str |
MCP precision tool output |
graph_context |
str |
Neo4j entity + relation context |
final_response |
str |
Synthesized answer from merger |
prompt_tokens |
int |
Cumulative across all nodes (reducer: operator.add) |
completion_tokens |
int |
Cumulative across all nodes |
chat_history |
List[Dict] |
Compressed conversation turns |
reasoning_trace |
str |
Chain-of-thought from thinking_node |
soft_cache_examples |
str |
Few-shot examples from soft cache |
images |
List[Dict] |
Extracted image blocks for vision expert |
Configuration Reference
Core
| Variable |
Default |
Description |
URL_RTX |
— |
Ollama base URL for primary GPU (e.g., http://192.168.1.10:11434/v1) |
URL_TESLA |
— |
Ollama base URL for secondary GPU (optional) |
INFERENCE_SERVERS |
"" |
JSON array of server configs (overrides URL_RTX/URL_TESLA) |
JUDGE_ENDPOINT |
RTX |
Which server runs the judge/merger LLM |
PLANNER_MODEL |
phi4:14b |
Model for task decomposition |
PLANNER_ENDPOINT |
RTX |
Which server runs the planner |
EXPERT_MODELS |
{} |
JSON: expert category → model list (set via Admin UI) |
MCP_URL |
http://mcp-precision:8003 |
MCP precision tools server |
SEARXNG_URL |
— |
SearXNG instance for web research |
Caching & Thresholds
| Variable |
Default |
Description |
CACHE_HIT_THRESHOLD |
0.15 |
ChromaDB cosine distance for hard cache hit |
SOFT_CACHE_THRESHOLD |
0.50 |
Distance threshold for few-shot examples |
SOFT_CACHE_MAX_EXAMPLES |
2 |
Max few-shot examples per query |
CACHE_MIN_RESPONSE_LEN |
150 |
Min chars to store a response in cache |
MAX_EXPERT_OUTPUT_CHARS |
2400 |
Max chars per expert output (~600 tokens) |
Expert Routing
| Variable |
Default |
Description |
EXPERT_TIER_BOUNDARY_B |
20 |
GB parameter threshold for Tier 1 vs Tier 2 |
EXPERT_MIN_SCORE |
0.3 |
Laplace score threshold to consider a model |
EXPERT_MIN_DATAPOINTS |
5 |
Minimum feedback points before score is used |
History & Timeouts
| Variable |
Default |
Description |
HISTORY_MAX_TURNS |
4 |
Conversation turns to include |
HISTORY_MAX_CHARS |
3000 |
Max total history chars |
JUDGE_TIMEOUT |
900 |
Merger/judge LLM timeout (seconds) |
EXPERT_TIMEOUT |
900 |
Expert model timeout (seconds) |
PLANNER_TIMEOUT |
300 |
Planner timeout (seconds) |
Claude Code Integration
| Variable |
Default |
Description |
CLAUDE_CODE_PROFILES |
[] |
JSON array of integration profiles (set via Admin UI) |
CLAUDE_CODE_MODELS |
(8 claude-* model IDs) |
Comma-separated Anthropic model IDs to route through MoE |
TOOL_MAX_TOKENS |
8192 |
Max tokens for tool-use responses |
REASONING_MAX_TOKENS |
16384 |
Max tokens for extended thinking |
Infrastructure
| Variable |
Default |
Description |
REDIS_URL |
redis://terra_cache:6379 |
Redis connection |
NEO4J_URI |
bolt://neo4j-knowledge:7687 |
Neo4j Bolt endpoint |
NEO4J_USER |
neo4j |
Neo4j username |
NEO4J_PASS |
moe-sovereign |
Neo4j password |
KAFKA_URL |
kafka://moe-kafka:9092 |
Kafka broker |
API Endpoints
Orchestrator (:8002)
| Method |
Path |
Description |
POST |
/v1/chat/completions |
Main chat endpoint (OpenAI-compatible, streaming) |
POST |
/v1/messages |
Anthropic Messages API format |
GET |
/v1/models |
List all modes as model IDs |
POST |
/v1/feedback |
Submit rating (1–5) for a response |
GET |
/v1/provider-status |
Rate-limit status for Claude Code |
GET |
/metrics |
Prometheus metrics scrape |
GET |
/graph/stats |
Neo4j entity/relation counts |
GET |
/graph/search?q=term |
Semantic search in knowledge graph |
GET |
/v1/admin/ontology-gaps |
Unknown terms found in queries |
GET |
/v1/admin/planner-patterns |
Learned expert-combination patterns |
Admin UI (:8088)
| Path |
Description |
/ |
Dashboard — system overview |
/profiles |
Claude Code integration profiles |
/skills |
Skill management (CRUD + upstream sync) |
/servers |
Inference server health & model list |
/mcp-tools |
MCP tool enable/disable |
/monitoring |
Prometheus/Grafana integration |
/tool-eval |
Tool invocation logs |
| Optimization |
Savings |
Condition |
| ChromaDB hard cache |
Full pipeline skip |
Cosine distance < 0.15 |
| Redis plan cache (TTL 30 min) |
~1,600 tokens, 2–5 s |
Same query within 30 min |
| Redis GraphRAG cache (TTL 1 h) |
1–3 s, Neo4j query |
Same query+categories within 1 h |
| Merger Fast-Path |
~1,500–4,000 tokens, 3–8 s |
1 expert + hoch + no extra context |
| Query normalization |
+20–30% cache hit rate |
Lowercase + strip punctuation before lookup |
| History compression |
~600–1,800 tokens |
History > 2,000 chars → old turns → […] |
| Two-tier routing |
T2 LLM call skipped |
T1 expert returns hoch confidence |
| VRAM unload after inference |
VRAM freed for judge |
Async keep_alive=0 after each expert |
| Soft cache few-shot |
Better accuracy without hit |
Distance 0.15–0.50 → in-context examples |
| Feedback-driven scoring |
Optimal model selection |
Laplace score from user feedback |