Memory Architecture¶

MoE Sovereign implements a four-tier memory architecture rooted in Tulving's cognitive taxonomy (Tulving 1972) and the LLM-agent memory research of Park et al. (Generative Agents, 2023) and Packer et al. (MemGPT, 2023).

Each tier covers a different time horizon without increasing inference token costs.

Overview: Four-Tier Memory¶

Tier 1 — HOT        (LLM context, verbatim)    Last N conversation turns
Tier 2 — WARM       (ChromaDB, disk-bound)     ANN retrieval of evicted turns
Tier 3 — COLD       (Neo4j, disk-bound)        GraphRAG entity/fact extraction
Tier 4 — EPISODIC   (Neo4j, :Episode nodes)    Past task outcomes + routing hints

CC tool-call path note: The Claude Code (CC) tool path layers two additional, request-scoped compression mechanisms on top of T1/T2 above — a session-scoped Context Index and a Summarization-on-Drop step. These are distinct from the long-term T3 (Neo4j GraphRAG) / T4 (Episodic) tiers despite similar numbering. See CC Tool-Path Context Compression.

Tier	Type (Tulving)	Backend	Capacity	Retrieval	TTL
T1 — Hot	Working memory	LLM native context	Model-dependent (4k–128k)	Verbatim, instant	Session duration
T2 — Warm	Episodic (conversation)	ChromaDB + nomic-embed-text	Effectively unlimited	ANN + hybrid keyword ranking	6 hours
T3 — Cold	Semantic	Neo4j knowledge graph	Unlimited (disk)	GraphRAG cypher queries	Permanent
T4 — Episodic	Episodic (task-level)	Neo4j `:Episode` nodes	Unlimited (disk)	Sørensen–Dice similarity	90 days (configurable)

How It Works¶

Turn Eviction and Storage (Tier-2)¶

When a conversation exceeds the configured hot-window size (default_max_turns), the oldest turns are evicted from the LLM context. Instead of discarding them, the orchestrator stores the evicted turns in ChromaDB as dense vector embeddings:

nomic-embed-text (768 dim, via Ollama) → ChromaDB HttpClient

Each stored turn includes: - Document: the raw text ([role] content) - Metadata: session_id, turn_index, role, keywords, timestamp - Embedding: 768-dimensional float32 vector (nomic-embed-text)

Collections are versioned by embedding slug (conversation_memory_nomic-embed-text) to prevent data corruption when switching embedding models.

Retrieval at Query Time¶

When a request arrives, the orchestrator:

Query reformulation: strips interrogative prefixes ("Was ist X?" → "X") for better ANN match quality.
Session-scoped retrieval: fetches all documents for the current session_id from ChromaDB (not total collection count).
Hybrid ranking (for small sessions ≤ 50 turns):
Direct numpy cosine similarity over all session turns
Topic-overlap fallback: content word matching for low-confidence ANN results
Keyword metadata filter: exact token matching as final fallback
Context injection: relevant turns are prepended to the current messages as a [WARM CONTEXT — SEMANTIC MEMORY] block before the expert prompt.

The memory_recall expert bypasses the LLM planner entirely (fast-path) to minimise latency overhead for pure recall queries.

Configuration¶

Enable per Template¶

In the Admin UI → Expert Templates → Edit → config_json:

{
  "enable_semantic_memory": true
}

No model change or container restart required. The flag activates Tier-2 retrieval for all requests processed by that template.

Environment Variables¶

Variable	Default	Description
`SEMANTIC_MEMORY_EMBED_MODEL`	`""` (all-MiniLM-L6-v2)	`ollama:nomic-embed-text` for 768-dim embeddings
`SEMANTIC_MEMORY_EMBED_URL`	`http://localhost:11434`	Ollama base URL for embedding inference
`SEMANTIC_MEMORY_MAX_TURNS`	`4`	Hot-window size (turns kept in LLM context)
`SEMANTIC_MEMORY_N_RESULTS`	`6`	Max warm turns injected per request
`SEMANTIC_MEMORY_TTL_HOURS`	`6`	ChromaDB entry TTL (cleanup runs every 6h)
`CHROMA_HOST`	`chromadb-vector`	ChromaDB service hostname
`CHROMA_PORT`	`8001`	ChromaDB HTTP port

Recommended Embedding Model¶

ollama:nomic-embed-text (768 dimensions) is strongly preferred over the default all-MiniLM-L6-v2 (384 dimensions). It doubles the semantic resolution and markedly improves recall at deeper needle depths (20–100 turns).

# .env
SEMANTIC_MEMORY_EMBED_MODEL=ollama:nomic-embed-text
SEMANTIC_MEMORY_EMBED_URL=http://<ollama-host>:11434

Benchmark: MRCR-lite v2¶

The Multi-turn Recall Comprehension Recall (MRCR-lite v2) benchmark measures how far back the system can reliably retrieve specific injected facts ("needles") with and without Tier-2 memory.

Protocol¶

A synthetic conversation is constructed as:

[depth × filler turns]             ← pre-needle (evicted from hot window)
[NEEDLE injection]                 ← fact to remember (evicted)
[5 × recent filler turns]          ← recent context (stays in hot window)
RECALL QUESTION: "What was X?"

With 5 recent filler pairs and a hot window of 4 pairs, the needle is always evicted. The orchestrator must rely entirely on Tier-2 retrieval to answer correctly.

A/B Conditions¶

Condition	ChromaDB	Template
`with_prepopulation`	Pre-seeded with evicted turns	`moe-memory-aihub-hybrid`
`without_prepopulation`	Empty (baseline)	`moe-memory-aihub-nosm`

Running the Benchmark¶

# Full run (depths 5/10/20/50/100, 2 reps each)
MOE_API_KEY=moe-sk-... python3 benchmarks/mrcr_lite_runner.py

# Quick smoke test (depth 5/10 only)
MOE_API_KEY=moe-sk-... MRCR_MAX_DEPTH=10 python3 benchmarks/mrcr_lite_runner.py

# A/B comparison with custom templates
MRCR_TEMPLATE_WITH=my-template-with-sm \
MRCR_TEMPLATE_NO=my-template-no-sm \
  python3 benchmarks/mrcr_lite_runner.py

Measured Results (April 2026)¶

Template: moe-memory-aihub-hybrid | Embedding: nomic-embed-text 768-dim
Retrieval method: direct numpy cosine ranking (no HNSW approximation)

By condition¶

Condition	Recall score	Notes
`with_prepopulation` (Tier-2 SM enabled)	1.000	All 5 needle types, all tested depths
`without_prepopulation` (baseline)	0.000	Needle confirmed evicted from hot window

A/B delta: +1.000 — the entire recall improvement is attributable to Tier-2 retrieval.

By needle type (WITH semantic memory)¶

Needle type	Pre-fix score	Post-fix score	Root cause of pre-fix failure
`number`	0.20	1.00	Session-scoped count bug → HNSW used instead of numpy
`person`	0.40	1.00	Same bug; HNSW missed low-frequency proper nouns
`date`	1.00	1.00	Unaffected (high ANN similarity for date patterns)
`name`	1.00	1.00	Unaffected
`technical`	1.00	1.00	Unaffected

Root cause of pre-fix failures (documented)¶

The original code used self._collection.count() (total collection count) as the threshold for switching between numpy direct ranking and HNSW approximation. With hundreds of sessions in ChromaDB, the total count always exceeded the threshold, causing HNSW to be used for all sessions — including small ones where numpy would have found the needle at rank #1. Fix: count = len(collection.get(where={"session_id": ...})).

After the fix, numpy direct ranking runs for all session sizes. HNSW is retained only as a last-resort fallback when embeddings are unavailable.

Comparison to Native LLM Context Windows¶

System	Native window	Effective window	Privacy	Cost per inference
GPT-4o	128,000 tokens	128,000 tokens	Cloud	Per token
Claude 3.5 Sonnet	200,000 tokens	200,000 tokens	Cloud	Per token
Local 7B (no SM)	4,000–32,000 tokens	4,000–32,000 tokens	Local	0
MoE Sovereign + Tier-2 SM only	4,000–32,000 (model)	1,000,000+ (infra, conversation history)	Local	0
MoE Sovereign + Tier-2 SM + Tier-3 Context Index + Summarization-on-Drop	4,000–32,000 (model)	*1,000,000+ (infra, conversation history and* per-request documents/codebase)**	Local	0

Key insight: The effective context window is no longer a model property — it is an infrastructure property. Upgrading from a 7B to a 70B model does not increase the recall range. Enabling Tier-2 Semantic Memory does, for any model; the CC tool path's Tier-3 Context Index and Summarization-on-Drop (see below) extend this further to large per-request system_prompt content (codebases, documents) and long CC sessions.

Accuracy comparison at different depths¶

Depth	Local 7B (no SM)	GPT-4o (128k native)	MoE + Tier-2 SM
5 turns	1.00 (in window)	1.00	1.00
10 turns	0.00 (evicted)	1.00	1.00
50 turns	0.00 (evicted)	1.00	1.00*
100 turns	0.00 (evicted)	1.00	1.00*

*Unit-test verified retrieval at depth 100; end-to-end LLM benchmark pending.

CC Tool-Path Context Compression (Layers 1–4)¶

The Claude Code (CC) tool-call path (services/pipeline/anthropic.py) extends the effective context window of the configured tool_model through four layers. The numbering below parallels the T1–T4 memory tiers above but is request-scoped: "Layer 3" (Context Index) and "Layer 4" (Summarization-on-Drop) are ephemeral, per-session mechanisms — distinct from the long-term T3 (Neo4j GraphRAG) / T4 (Episodic) tiers, which persist across sessions indefinitely.

Layer	Mechanism	Component	Status
1 — Hot	Native LLM context (`tool_max_tokens` / `context_window`, verbatim)	Model	Always on
2 — Warm retrieval	Tier-2 Semantic Memory — cross-turn ANN retrieval (see above)	`memory_retrieval.py`	Per-template opt-in (`enable_semantic_memory`)
3 — Context Index	Chunk + ChromaDB-index a large `system_prompt` for the session; retrieve semantically relevant chunks per expert call	`services/context_index.py`	`CC_CONTEXT_INDEX_ENABLED=false` (default; see context-variables.md)
4 — Summarization-on-Drop	When conversation history must still be trimmed to fit `avail_input_tokens`, the dropped message groups are LLM-summarized into `cc:work:{session_id}["dropped_history_summary"]` and re-injected on the next request	`services/pipeline/anthropic.py` (`_trim_oai_to_budget_async`)	Active when `CC_HISTORY_COMPRESS_LLM` resolves to a non-empty model (falls back to `GRAPH_COMPRESS_LLM`)

Layers 3 and 4 are also the targets of pre-flight overflow monitoring (estimate_overflow() / PROM_BUDGET_EXCEEDED): an overflowing CC request triggers Tier-3 indexing on the spot, regardless of the normal CONTEXT_INDEX_THRESHOLD.

Full variable/threshold reference, including resolve_io_budget() (the shared input/output budget split used by the CC tool path, graph/expert.py, and graph/synthesis.py): docs/system/context-variables.md.

Important: No context_window / num_ctx value configured anywhere in this system is ever 1,000,000. The static context-window heuristic (_PARAM_CTX_HEURISTIC in context_budget.py) caps at 32768 for models ≥ 25B parameters, and all current CC profiles set context_window: 32768 explicitly. "1M+" in the comparison table above refers to the aggregate retrievable context across Layers 1–4 — how much prior conversation/document content can influence a response — not any single model's num_ctx.

Compatibility¶

Tier-2 Semantic Memory is fully OpenAI API-compatible. No client changes are required. Any client that sends POST /v1/chat/completions benefits automatically once the template has enable_semantic_memory: true.

Client	Compatible	Notes
Open WebUI	✓	Session ID derived from conversation header
Claude Code	✓	Works via `X-Session-Id` or fingerprint
OpenAI Python SDK	✓	Pass `extra_headers={"X-Session-Id": "..."}` for explicit session
curl / httpie	✓	Add `-H "X-Session-Id: <uuid>"` header
Any OpenAI-compatible client	✓	No changes needed; session auto-fingerprinted

Needle Types and Scoring¶

Type	Example	Score Logic
`number`	"7342"	Exact digit match (ignoring spaces/separators)
`technical`	`http://api-staging.internal:9977/v2`	Exact match → 1.0; hostname-only match → 0.5
`date`	"14. November 2026"	Exact → 1.0; year + day or month → 0.5
`name`/`person`	"Dr. Katharina Breitfeld"	All tokens matched → 1.0; one token → 0.5

Cross-Session Memory¶

Tier-2 can optionally retrieve relevant turns from past sessions of the same user or from team-shared sessions — extending memory across conversation boundaries.

Privacy hierarchy¶

Scope	Who can retrieve	Stored when
`private`	Owner only (matching `user_id`)	Default for all turns
`team`	All members of `team_id`	User has `memory_share_with_team = true`
`shared`	Team + linked tenants (Mandanten)	Explicit admin action (future)

Enable cross-session in a template¶

{
  "enable_semantic_memory": true,
  "enable_cross_session_memory": true,
  "cross_session_scopes": ["private"],
  "cross_session_ttl_days": 30
}

User preferences¶

Users control their memory behaviour in the User Portal → Profile → Conversation Memory:

Setting	Effect
Fresh Start	Disables cross-session; every conversation begins clean. No old session data injected.
Share with Team	Stores turns as `scope=team`; team members with cross-session enabled can retrieve them.

Implementation Reference¶

Component	File	Description
Memory store	`memory_retrieval.py`	`ConversationMemoryStore` — storage, retrieval, merge
Embedding function	`memory_retrieval.HttpxOllamaEF`	httpx-based Ollama embedding, no `ollama` package required; reused by `services/context_index.py` for Tier-3
Retrieval strategy	`memory_retrieval._retrieve_sync()`	Always-numpy cosine ranking; HNSW last resort only
Cross-session retrieval	`memory_retrieval.retrieve_cross_session()`	Privacy-scoped retrieval across sessions
Merge strategy	`memory_retrieval.merge_session_results()`	Recency-first + hard cap (current always precedes cross)
Orchestrator integration	`main.py:_apply_semantic_memory()`	Eviction, storage, retrieval, context injection
Planner fast-path	`main.py:planner_node()`	Bypasses LLM planner for `memory_recall` complexity class
User preferences	`admin_ui/database.py:get_user_memory_prefs()`	`prefer_fresh`, `share_with_team` per user
Benchmark runner	`benchmarks/mrcr_lite_runner.py`	MRCR-lite v2, A/B design, configurable warmup
Dataset	`benchmarks/datasets/mrcr_lite_v1.json`	5 needles, filler turns, test matrix

Tier 4 — Episodic Memory (Task-Level)¶

Scientific basis: Tulving (1972) episodic/semantic memory distinction; Park et al. 2023, Generative Agents (arXiv:2304.03442); Packer et al. 2023, MemGPT (arXiv:2310.08560).

Tier-4 complements the conversation-level Warm memory (T2) with task-level experience. Every successful pipeline run is logged as a :Episode node in Neo4j. On similar future queries, routing hints from past episodes are injected into graph_context alongside the regular GraphRAG output.

What is stored¶

Each :Episode node holds:

Field	Content
`hash`	SHA-256 fingerprint of normalised query + task type (deduplication key)
`query_pattern`	Normalised query string (first 300 chars, lowercase)
`task_type`	Primary expert category from the planner
`routing_path`	Ordered list of categories executed (e.g. `["technical_support", "math"]`)
`tools_used`	Active pipeline tools: `graphrag`, `mcp`, `math`, `web`, `cache`
`model_signature`	Sorted unique list of `expert_models_used`
`confidence`	Weighted estimate: `0.7 × expert_confidence + 0.3 × response_completeness`
`total_tokens`	Total prompt + completion tokens
`expires_at`	ISO-8601 expiry timestamp (default: 90 days from creation)
`user_id`	Originating user (for auditing, not used in retrieval)

How retrieval works¶

get_episode_hint() is called in graph_rag_node before the Neo4j query.
Past episodes for the same task_type are ranked by Sørensen–Dice string similarity against the current query pattern (requires Neo4j APOC).
Episodes scoring above EPISODIC_MIN_CONFIDENCE (default 0.6) and within their TTL are returned as a [Episode Hint] block appended to graph_context.
Without APOC, a recency-based fallback is used automatically.

What the judge sees¶

[Episode Hint — past similar tasks]
• Routing: technical_support → math | Tools: graphrag, mcp | Confidence: 87% | Recalled 4×
• Routing: technical_support | Tools: graphrag | Confidence: 72% | Recalled 1×
[End of Episode Hint]

The hint informs which routing strategies and tools have historically produced high-confidence answers — without prescribing the current answer.

Configuration¶

Variable	Default	Description
`EPISODIC_MEMORY_ENABLED`	`1`	Set to `0` to disable entirely
`EPISODIC_MAX_HINTS`	`2`	Max episodes injected per request
`EPISODIC_MIN_CONFIDENCE`	`0.6`	Minimum stored confidence to recall
`EPISODIC_TTL_DAYS`	`90`	Days before `:Episode` nodes expire

Implementation reference¶

Component	File	Description
Schema setup	`episodic_memory.ensure_schema()`	Creates `:Episode` uniqueness constraint
Logging	`episodic_memory.log_episode()`	Fire-and-forget write after merger completion
Retrieval	`episodic_memory.get_episode_hint()`	Sørensen–Dice + recency fallback
Integration — log	`graph/synthesis.py:merger_node()`	`asyncio.create_task(log_episode(...))`
Integration — hint	`graph/tool_nodes.py:graph_rag_node()`	Called before Neo4j query; appended to `graph_context`