MoE Sovereign — System Architecture¶

Overview¶

MoE Sovereign is a LangGraph-based Mixture-of-Experts orchestrator. Each incoming query is decomposed by a planner LLM into typed tasks, routed to specialist models in parallel, enriched with knowledge graph context and optional web research, then synthesized by a judge LLM into a single coherent response.

All caching is multi-layered: semantic vector cache (ChromaDB), plan cache (Redis), GraphRAG cache (Redis), and performance-scored expert routing (Redis). The API is fully OpenAI-compatible.

LangGraph Pipeline¶

flowchart TD
    IN([Client Request]) --> CACHE

    CACHE{cache_lookup\nChromaDB semantic\nhit < 0.15}
    CACHE -->|HIT ⚡| MERGE
    CACHE -->|MISS| PLAN

    PLAN[planner\nphi4:14b\nRedis plan cache\nTTL 30 min]
    PLAN --> PAR

    subgraph PAR [Parallel Execution]
        direction LR
        W[workers\nTier 1 + Tier 2\nexpert models]
        R[research\nSearXNG\nweb search]
        M[math\nSymPy\ncalculation]
        MCP[mcp\nPrecision Tools\n20 deterministic tools]
        GR[graph_rag\nNeo4j\nRedis cache TTL 1h]
    end

    PAR --> RF[research_fallback\nconditional\nweb fallback]
    RF --> THINK[thinking\nchain-of-thought\nreasoning trace]
    THINK --> MERGE

    MERGE{merger\nJudge LLM\nor Fast-Path ⚡}
    MERGE -->|single hoch expert\nno extra context| FP[⚡ Fast-Path\ndirect return]
    MERGE -->|ensemble / multi| JUDGE[Judge LLM\nsynthesis]

    JUDGE --> CRIT[critic\npost-validation\nself-evaluation]
    FP --> CRIT
    CRIT --> OUT([Streaming Response])

    style CACHE fill:#1e3a5f,color:#fff
    style MERGE fill:#1e3a5f,color:#fff
    style PAR fill:#0d2137,color:#ccc
    style FP fill:#1a4a1a,color:#fff
    style OUT fill:#2d1b4e,color:#fff

Node Descriptions¶

Node	Function	Key Logic
`cache_lookup`	ChromaDB semantic similarity	distance < 0.15 → hard hit; 0.15–0.50 → soft/few-shot examples
`planner`	Task decomposition (phi4:14b)	Produces `[{task, category, search_query?, mcp_tool?}]`; Redis plan cache TTL=30 min
`workers`	Parallel expert execution	Two-tier routing; T1 (≤20B) first, T2 (>20B) only if T1 confidence < threshold
`research`	SearXNG web search	Single or multi-query deep search; always runs if `research` category in plan
`math`	SymPy calculation	Runs only if `math` category in plan AND no `precision_tools` task
`mcp`	MCP Precision Tools	20 deterministic tools via HTTP; runs if `precision_tools` in plan
`graph_rag`	Neo4j knowledge graph	Entity + relation context; Redis cache TTL=1h
`research_fallback`	Conditional extra search	Triggers if merger needs more context
`thinking`	Chain-of-thought reasoning	Generates `reasoning_trace`; activated by `force_think` modes
`merger`	Response synthesis (Judge LLM)	Fast-path bypasses Judge for single high-confidence experts
`critic`	Post-generation validation	Async self-evaluation; flags low-quality cache entries

Service Topology¶

graph LR
    subgraph Clients
        CC[Claude Code]
        OC[Open Code]
        CD[Continue.dev]
        CU[curl / any OpenAI client]
    end

    subgraph Core [:8002]
        ORCH[langgraph-orchestrator\nFastAPI + LangGraph]
    end

    subgraph Storage
        REDIS[(terra_cache\nRedis Stack :6379)]
        CHROMA[(chromadb-vector\nChromaDB :8001)]
        NEO4J[(neo4j-knowledge\nNeo4j :7687/:7474)]
        KAFKA[moe-kafka\nKafka :9092]
    end

    subgraph Tools
        MCP[mcp-precision\nMCP Server :8003]
        SEARX[SearXNG\nexternal]
    end

    subgraph GPU_Inference
        RTX[Ollama RTX\nconfigured via\nINFERENCE_SERVERS]
        TESLA[Ollama Tesla\noptional]
    end

    subgraph Observability
        PROM[moe-prometheus :9090]
        GRAF[moe-grafana :3001]
        NODE[node-exporter :9100]
        CADV[cadvisor :9338]
    end

    subgraph Admin
        ADMUI[moe-admin :8088]
    end

    CC & OC & CD & CU -->|OpenAI API| ORCH

    ORCH --> REDIS
    ORCH --> CHROMA
    ORCH --> NEO4J
    ORCH --> KAFKA
    ORCH --> MCP
    ORCH --> SEARX
    ORCH --> RTX
    ORCH -.-> TESLA

    ADMUI --> ORCH
    ADMUI -->|/var/run/docker.sock| DOCKER[(Docker API)]
    ADMUI --> PROM

    PROM --> ORCH
    PROM --> NODE
    PROM --> CADV
    PROM --> GRAF

    KAFKA -->|moe.ingest| ORCH
    KAFKA -->|moe.feedback| ORCH

Kafka Topics¶

Topic	Publisher	Consumer	Purpose
`moe.ingest`	orchestrator	orchestrator	GraphRAG entity ingestion from responses
`moe.requests`	orchestrator	orchestrator	Audit log (input, answer snippet, models used)
`moe.feedback`	orchestrator	orchestrator	User ratings → plan pattern learning & model scoring

Caching Architecture¶

graph TD
    Q([Query]) --> L1

    L1{L1: ChromaDB\nSemantic Cache\ncosine distance}
    L1 -->|< 0.15 hard hit| DONE([Return cached response])
    L1 -->|0.15–0.50 soft hit| FEW[Few-shot examples\nfor experts]
    L1 -->|> 0.50 miss| L2

    L2{L2: Redis\nPlan Cache\nmoe:plan:sha256[:16]}
    L2 -->|TTL 30 min hit| SKIP_PLAN[Skip planner LLM\n~1,600 tokens saved]
    L2 -->|miss| PLAN_LLM[Planner LLM call]
    PLAN_LLM -->|write-back| L2

    SKIP_PLAN --> L3

    L3{L3: Redis\nGraphRAG Cache\nmoe:graph:sha256[:16]}
    L3 -->|TTL 1h hit| SKIP_NEO4J[Skip Neo4j query\n1–3s saved]
    L3 -->|miss| NEO4J_Q[Neo4j query]
    NEO4J_Q -->|write-back| L3

    SKIP_NEO4J --> L4

    L4{L4: Redis\nPerformance Scores\nmoe:perf:model:category}
    L4 -->|Laplace-smoothed\nscore ≥ 0.3| TIER1[Prefer high-scoring\nT1 model]
    L4 -->|score < 0.3| TIER2[Fallback to T2]

    style L1 fill:#1e3a5f,color:#fff
    style L2 fill:#3a1e5f,color:#fff
    style L3 fill:#1e5f3a,color:#fff
    style L4 fill:#5f3a1e,color:#fff
    style DONE fill:#1a4a1a,color:#fff

Cache Key Reference¶

Cache	Key Pattern	TTL	Storage
Semantic cache	ChromaDB collection `moe_fact_cache`	permanent (flagged if bad)	ChromaDB
Plan cache	`moe:plan:{sha256(query[:300])[:16]}`	30 min	Redis
GraphRAG cache	`moe:graph:{sha256(query[:200]+categories)[:16]}`	1 h	Redis
Perf scores	`moe:perf:{model}:{category}`	permanent	Redis Hash
Response metadata	`moe:response:{response_id}`	7 days	Redis Hash
Planner patterns	`moe:planner_success` (sorted set)	180 days	Redis ZSet
Ontology gaps	`moe:ontology_gaps` (sorted set)	90 days	Redis ZSet

Expert Routing¶

flowchart LR
    PLAN([Plan Tasks]) --> SEL

    SEL{Category\nin plan?}
    SEL -->|precision_tools| MCP[MCP Node\ndeterministic]
    SEL -->|research| WEB[Research Node\nSearXNG]
    SEL -->|math| MATH[Math Node\nSymPy]
    SEL -->|expert category| ROUTE

    ROUTE{Expert\nRouting}
    ROUTE -->|forced ensemble| BOTH[T1 + T2\nin parallel]
    ROUTE -->|normal| T1[Tier 1\n≤20B params\nfast]

    T1 -->|confidence == hoch| MERGE_CHECK
    T1 -->|confidence < hoch| T2[Tier 2\n>20B params\nhigh quality]
    T2 --> MERGE_CHECK

    MERGE_CHECK{Merger\nFast-Path\ncheck}
    MERGE_CHECK -->|1 expert, hoch\nno web/mcp/graph| FP[⚡ Fast-Path\nskip Judge LLM\n1,500–4,000 tokens saved]
    MERGE_CHECK -->|multi / ensemble\nor extra context| JUDGE[Judge LLM\nsynthesis]

Expert Categories¶

Category	Planner Trigger Keywords	Tier Preference
`general`	Allgemeine Wissensfragen, Definitionen, Erklärungen	T1
`math`	Berechnung, Gleichung, Formel, Statistik	T1
`technical_support`	IT, Server, Docker, Netzwerk, Debugging, DevOps	T1
`creative_writer`	Schreiben, Kreativität, Storytelling, Marketing	T1
`code_reviewer`	Code, Programmierung, Review, Security, Refactoring	T1
`medical_consult`	Medizin, Symptome, Diagnose, Medikamente	T1
`legal_advisor`	Recht, Gesetz, BGB, StGB, Vertrag, Urteile	T1
`translation`	Übersetzen, Sprache, Übersetzung	T1
`reasoning`	Analyse, Logik, komplexe Argumentation, Strategie	T2
`vision`	Bild, Screenshot, Dokument, Foto, erkennen	T2
`data_analyst`	Daten, CSV, Tabelle, Visualisierung, pandas	T1
`science`	Chemie, Biologie, Physik, Umwelt, Forschung	T1

AgentState¶

The LangGraph state object passed through all nodes:

Field	Type	Description
`input`	`str`	Original user query (after skill resolution)
`response_id`	`str`	UUID for feedback tracking
`mode`	`str`	Active mode: `default`, `code`, `concise`, `agent`, `agent_orchestrated`, `research`, `report`, `plan`
`system_prompt`	`str`	Client system prompt (e.g., file context from Claude Code)
`plan`	`List[Dict]`	`[{task, category, search_query?, mcp_tool?, mcp_args?}]`
`expert_results`	`List[str]`	Accumulated expert outputs (reducers: `operator.add`)
`expert_models_used`	`List[str]`	`["model::category", ...]` for metrics
`web_research`	`str`	SearXNG results with inline citations
`cached_facts`	`str`	ChromaDB hard cache hit content
`cache_hit`	`bool`	True if hard cache hit — skips most nodes
`math_result`	`str`	SymPy output
`mcp_result`	`str`	MCP precision tool output
`graph_context`	`str`	Neo4j entity + relation context
`final_response`	`str`	Synthesized answer from merger
`prompt_tokens`	`int`	Cumulative across all nodes (reducer: `operator.add`)
`completion_tokens`	`int`	Cumulative across all nodes
`chat_history`	`List[Dict]`	Compressed conversation turns
`reasoning_trace`	`str`	Chain-of-thought from `thinking_node`
`soft_cache_examples`	`str`	Few-shot examples from soft cache
`images`	`List[Dict]`	Extracted image blocks for vision expert

Configuration Reference¶

Core¶

Variable	Default	Description
`URL_RTX`	—	Ollama base URL for primary GPU (e.g., `http://192.168.1.10:11434/v1`)
`URL_TESLA`	—	Ollama base URL for secondary GPU (optional)
`INFERENCE_SERVERS`	`""`	JSON array of server configs (overrides URL_RTX/URL_TESLA)
`JUDGE_ENDPOINT`	`RTX`	Which server runs the judge/merger LLM
`PLANNER_MODEL`	`phi4:14b`	Model for task decomposition
`PLANNER_ENDPOINT`	`RTX`	Which server runs the planner
`EXPERT_MODELS`	`{}`	JSON: expert category → model list (set via Admin UI)
`MCP_URL`	`http://mcp-precision:8003`	MCP precision tools server
`SEARXNG_URL`	—	SearXNG instance for web research

Caching & Thresholds¶

Variable	Default	Description
`CACHE_HIT_THRESHOLD`	`0.15`	ChromaDB cosine distance for hard cache hit
`SOFT_CACHE_THRESHOLD`	`0.50`	Distance threshold for few-shot examples
`SOFT_CACHE_MAX_EXAMPLES`	`2`	Max few-shot examples per query
`CACHE_MIN_RESPONSE_LEN`	`150`	Min chars to store a response in cache
`MAX_EXPERT_OUTPUT_CHARS`	`2400`	Max chars per expert output (~600 tokens)

Expert Routing¶

Variable	Default	Description
`EXPERT_TIER_BOUNDARY_B`	`20`	GB parameter threshold for Tier 1 vs Tier 2
`EXPERT_MIN_SCORE`	`0.3`	Laplace score threshold to consider a model
`EXPERT_MIN_DATAPOINTS`	`5`	Minimum feedback points before score is used

History & Timeouts¶

Variable	Default	Description
`HISTORY_MAX_TURNS`	`4`	Conversation turns to include
`HISTORY_MAX_CHARS`	`3000`	Max total history chars
`JUDGE_TIMEOUT`	`900`	Merger/judge LLM timeout (seconds)
`EXPERT_TIMEOUT`	`900`	Expert model timeout (seconds)
`PLANNER_TIMEOUT`	`300`	Planner timeout (seconds)

Claude Code Integration¶

Variable	Default	Description
`CLAUDE_CODE_PROFILES`	`[]`	JSON array of integration profiles (set via Admin UI)
`CLAUDE_CODE_MODELS`	(8 claude-* model IDs)	Comma-separated Anthropic model IDs to route through MoE
`TOOL_MAX_TOKENS`	`8192`	Max tokens for tool-use responses
`REASONING_MAX_TOKENS`	`16384`	Max tokens for extended thinking

Infrastructure¶

Variable	Default	Description
`REDIS_URL`	`redis://terra_cache:6379`	Redis connection
`NEO4J_URI`	`bolt://neo4j-knowledge:7687`	Neo4j Bolt endpoint
`NEO4J_USER`	`neo4j`	Neo4j username
`NEO4J_PASS`	`moe-sovereign`	Neo4j password
`KAFKA_URL`	`kafka://moe-kafka:9092`	Kafka broker

API Endpoints¶

Orchestrator (`:8002`)¶

Method	Path	Description
`POST`	`/v1/chat/completions`	Main chat endpoint (OpenAI-compatible, streaming)
`POST`	`/v1/messages`	Anthropic Messages API format
`GET`	`/v1/models`	List all modes as model IDs
`POST`	`/v1/feedback`	Submit rating (1–5) for a response
`GET`	`/v1/provider-status`	Rate-limit status for Claude Code
`GET`	`/metrics`	Prometheus metrics scrape
`GET`	`/graph/stats`	Neo4j entity/relation counts
`GET`	`/graph/search?q=term`	Semantic search in knowledge graph
`GET`	`/v1/admin/ontology-gaps`	Unknown terms found in queries
`GET`	`/v1/admin/planner-patterns`	Learned expert-combination patterns

Admin UI (`:8088`)¶

Path	Description
`/`	Dashboard — system overview
`/profiles`	Claude Code integration profiles
`/skills`	Skill management (CRUD + upstream sync)
`/servers`	Inference server health & model list
`/mcp-tools`	MCP tool enable/disable
`/monitoring`	Prometheus/Grafana integration
`/tool-eval`	Tool invocation logs

Performance Optimizations¶

Optimization	Savings	Condition
ChromaDB hard cache	Full pipeline skip	Cosine distance < 0.15
Redis plan cache (TTL 30 min)	~1,600 tokens, 2–5 s	Same query within 30 min
Redis GraphRAG cache (TTL 1 h)	1–3 s, Neo4j query	Same query+categories within 1 h
Merger Fast-Path	~1,500–4,000 tokens, 3–8 s	1 expert + `hoch` + no extra context
Query normalization	+20–30% cache hit rate	Lowercase + strip punctuation before lookup
History compression	~600–1,800 tokens	History > 2,000 chars → old turns → `[…]`
Two-tier routing	T2 LLM call skipped	T1 expert returns `hoch` confidence
VRAM unload after inference	VRAM freed for judge	Async `keep_alive=0` after each expert
Soft cache few-shot	Better accuracy without hit	Distance 0.15–0.50 → in-context examples
Feedback-driven scoring	Optimal model selection	Laplace score from user feedback

MoE Sovereign — System Architecture¶

Overview¶

LangGraph Pipeline¶

Node Descriptions¶

Service Topology¶

Kafka Topics¶

Caching Architecture¶

Cache Key Reference¶

Expert Routing¶

Expert Categories¶

AgentState¶

Configuration Reference¶

Core¶

Caching & Thresholds¶

Expert Routing¶

History & Timeouts¶

Claude Code Integration¶

Infrastructure¶

API Endpoints¶

Orchestrator (:8002)¶

Admin UI (:8088)¶

Performance Optimizations¶

Orchestrator (`:8002`)¶

Admin UI (`:8088`)¶