Claude Code Profiles¶
Claude Code profiles control how the MoE Sovereign orchestrator handles requests from Claude Code CLI, VS Code extension, and other Anthropic API clients. Each profile maps to a different processing mode with distinct trade-offs.
Three Reference Profiles¶
Native (Direct LLM)¶
The request is forwarded directly to a single LLM (e.g., phi4:14b-fp16)
without any MoE pipeline involvement. The model handles tool calls natively.
- Latency: 5-30 seconds
- Use case: Quick edits, simple bug fixes, interactive coding
- Trade-off: No multi-expert synthesis, no GraphRAG, no knowledge accumulation
Reasoning (Thinking Node)¶
The request passes through the MoE pipeline with the Thinking Node enabled.
The LLM performs chain-of-thought reasoning with <think> blocks before generating
the response.
- Latency: 30-120 seconds
- Use case: Architecture decisions, complex debugging, code review
- Trade-off: Deeper analysis but slower; no parallel expert routing
Orchestrated (Full Pipeline)¶
The full MoE pipeline: Planner decomposes the task, parallel expert LLMs process sub-tasks, Merger synthesizes results, Judge evaluates quality, and GraphRAG accumulates knowledge for future requests.
- Latency: 2-10 minutes
- Use case: Deep research, multi-domain synthesis, knowledge-enriched analysis
- Trade-off: Highest quality but impractical for interactive coding
Choosing a Profile¶
| Scenario | Recommended Profile |
|---|---|
| Fix a typo or syntax error | Native |
| Add a simple feature | Native |
| Debug a complex race condition | Reasoning |
| Architecture review | Reasoning |
| Security audit of a codebase | Orchestrated |
| Research + implementation plan | Orchestrated |
| Multi-file refactoring with tests | Reasoning |
Configuration¶
Admin UI¶
Navigate to CC Profile in the admin navigation. Each profile has:
| Field | Description |
|---|---|
name |
Display name shown in clients |
moe_mode |
native, moe_reasoning, or moe_orchestrated |
tool_model |
LLM for tool execution (e.g., phi4:14b-fp16) |
tool_endpoint |
Inference server node (e.g., N04-RTX) |
expert_template_id |
Expert template for orchestrated mode (optional) |
tool_max_tokens |
Max output tokens for tool calls — see note below |
reasoning_max_tokens |
Max tokens for thinking blocks — see note below |
tool_choice |
auto, required, or any |
tool_max_tokens and reasoning_max_tokens — Template is the Source of Truth¶
Profile values must not exceed the template's context_window
When a profile references an expert template via expert_template_id, the template's
context_window per expert category is the authoritative upper bound.
Setting tool_max_tokens: 65536 in the profile while the template defines
context_window: 32768 is a misconfiguration: the orchestrator will silently cap the
value and log a warning — but Claude Code never learns that its advertised token budget
is wrong. The session appears to work, but responses are shorter than expected with no
explanation.
Rule: Set tool_max_tokens ≤ template context_window. When you change the
hardware or switch to a model with a different context window, update the
template first, then align the profile.
Template context_window: 32768 ← source of truth (set in Admin UI → Expert Templates)
Profile tool_max_tokens: 32768 ← must be ≤ template value
Profile reasoning_max_tokens: 32768
The orchestrator validates this at session build time (Phase 6 of CC session resolution)
and logs a WARNING with a fix hint if the values diverge.
User Portal¶
Users can create personal profiles under My Templates > CC Profiles. Personal profiles override admin-assigned profiles.
API Key Binding¶
Each API key can be bound to a specific CC profile:
- Admin UI > Users > select user > API Keys
- Set the CC Profile dropdown for the key
- All requests with that key will use the bound profile
Client Configuration¶
Point Claude Code to your MoE Sovereign instance:
# Claude Code CLI
export ANTHROPIC_BASE_URL=https://your-moe-instance.example.com
export ANTHROPIC_API_KEY=moe-sk-xxxxxxxx...
# VS Code settings.json
{
"claude-code.apiEndpoint": "https://your-moe-instance.example.com",
"claude-code.apiKey": "moe-sk-xxxxxxxx..."
}
Innovator Profiles¶
The Innovator profile family (cc-innovator-*) is designed for Claude Code
power users who want the full MoE pipeline with different quality/speed trade-offs.
All three profiles use moe_orchestrated mode with dedicated expert templates.
Profile Comparison¶
| Profile | ID | Tool Model | Target Latency | Thinking | Max Tokens |
|---|---|---|---|---|---|
| Fast | cc-innovator-fast |
phi4:14b-fp16 |
30-90s | off | 4,096 |
| Balanced | cc-innovator-balanced |
Qwen3-Coder-Next |
2-5 min | on | 8,192 / 16K reasoning |
| Deep | cc-innovator-deep |
Qwen3-Coder-Next |
5-15 min | on | 8,192 / 32K reasoning |
Key differences:
- Fast uses
tool_choice: requiredwith lightweight models andstream_think: falsefor minimal overhead. Best for rapid iteration cycles. - Balanced enables thinking blocks and escalates to domain-specialist models via T2 fallback. Good default for everyday development.
- Deep uses the largest available models with a security-analyst expert category and an assertive system prompt requiring complete, production-grade code. Best for security audits, architecture reviews, and complex refactoring.
5-Epoch Benchmark Results¶
A controlled benchmark across 5 consecutive runs measures the accumulation effect of the MoE knowledge pipeline. Each epoch re-runs the same test suite; GraphRAG accumulates knowledge from prior runs, improving accuracy and reducing latency.
| Epoch | Avg Score | Avg Latency | Latency vs Epoch 1 |
|---|---|---|---|
| 1 | 5.2 / 10 | 280s | baseline |
| 2 | 6.4 / 10 | 125s | 0.45x |
| 3 | 7.1 / 10 | 72s | 0.26x |
| 4 | 7.8 / 10 | 45s | 0.16x |
| 5 | 8.1 / 10 | 30s | 0.11x |
Accumulation effect: By Epoch 5, the system delivers 9.3x faster responses than Epoch 1 while simultaneously improving answer quality by 56%. This is driven by three mechanisms:
- GraphRAG context enrichment — prior synthesis results are stored as
SYNTHESIS_INSIGHTrelations and injected into future expert prompts - L2 plan cache — identical task decompositions hit the Valkey SHA-256 plan cache, skipping the planner LLM entirely
- Model warmth — sticky sessions and the model registry keep frequently used models loaded in VRAM, eliminating cold-start overhead
The accumulation effect is most pronounced in Epochs 1-3 (steep improvement) and plateaus around Epoch 4-5 as the knowledge graph saturates for the test domain.
Download Reference Profiles¶
Pre-configured profile JSONs are available for download:
cc-ref-native.json— Direct LLMcc-ref-reasoning.json— Thinking Nodecc-ref-orchestrated.json— Full Pipeline
Replace <YOUR_OLLAMA_HOST> and <YOUR_TEMPLATE_ID> with your actual values.
Conversation History Compression¶
Long MoE synthesis responses in the conversation history can fill the tool model's context window over many turns. The orchestrator compresses older assistant and tool messages automatically before sending the history to the tool LLM.
How It Works¶
- After converting the incoming message history to OpenAI format, assistant and tool
messages older than
CC_HISTORY_COMPRESS_KEEP_TURNSturns are inspected. - Messages exceeding
CC_HISTORY_COMPRESS_THRESHOLDcharacters are condensed: the first 800 chars and last 200 chars are kept; the middle is replaced with[…N chars — condensed for context window]. - The full original content is cached in Redis under
cc:hist:<session_id>:<sha1[:12]>with a 1-hour TTL — no data is permanently lost. - The existing
_trim_oai_to_budget()then operates on the already-compressed history, so it needs to drop far fewer complete turns.
Configuration (.env)¶
| Variable | Default | Effect |
|---|---|---|
CC_HISTORY_COMPRESS_THRESHOLD |
3000 |
Chars above which a message is compressed |
CC_HISTORY_COMPRESS_KEEP_TURNS |
2 |
Most recent N turns are never compressed |
Increase CC_HISTORY_COMPRESS_THRESHOLD for models with large context windows where
you want to retain more verbatim history. Decrease it if the tool model runs out of
context quickly.
API Compatibility¶
The /v1/messages endpoint is fully compatible with the Anthropic Messages API.
Claude Code CLI, Anthropic Python SDK, and VS Code extensions work without
modification — just point ANTHROPIC_BASE_URL to your MoE Sovereign instance.
Supported features:
- Streaming responses (SSE)
- Tool use / function calling
- Multi-turn conversations
- System prompts
- Thinking blocks (reasoning mode)
- Image inputs (forwarded to vision-capable models)