Monitoring & Processes¶

The admin backend provides four monitoring layers:

System Monitoring (/monitoring) — aggregated Prometheus metrics and charts
Live Monitoring (/live-monitoring) — real-time process tracking with kill functionality
Starfleet (/starfleet) — ambient intelligence dashboard with proactive alerts, live node health, mission context, and feature toggles
Pipeline Transparency Log (/pipeline-log) — per-request routing decisions, expert domains, complexity levels, latency, and agentic rounds; filterable and CSV-exportable

Starfleet documentation: Starfleet — Ambient Intelligence

Observability architecture at a glance¶

flowchart LR
    subgraph App["Orchestrator (main.py)"]
        M[prometheus_client<br/>/metrics endpoint]
        A[Valkey key<br/>moe:active:*]
        H[Valkey sorted-set<br/>moe:admin:completed]
    end
    subgraph Admin["Admin UI"]
        SM["/monitoring<br/>(Chart.js + PromQL)"]
        LM["/live-monitoring<br/>(polling 5 s)"]
    end
    subgraph Ext["External"]
        P[(Prometheus)]
        G[(Grafana)]
    end

    M -- scrape 15 s --> P
    P -- PromQL --> SM
    P -- datasource --> G
    A -- Valkey SCAN --> LM
    H -- ZRANGE --> LM
    LM -- POST kill-request --> A

    classDef app fill:#eef2ff,stroke:#6366f1;
    classDef ui  fill:#f0fdf4,stroke:#16a34a;
    classDef ext fill:#fef3c7,stroke:#d97706;
    class M,A,H app;
    class SM,LM ui;
    class P,G ext;

System Monitoring is pull-based: it queries Prometheus on-demand when the operator opens the page.
Live Monitoring is poll-based: the browser hits Valkey-backed REST endpoints every 5 s.
Both layers share the same prometheus_client data surface but serve different latency needs.

Screenshots¶

System Monitoring¶

Fully populated dashboard with the grouped navigation bar — six system gauges (ChromaDB, Neo4j entities/relations, ontology, planner patterns), LLM server status cards per inference node, and Chart.js widgets for token usage, cache performance, expert calls, and latency.

System Monitoring

Live Monitoring — Active Processes & History¶

Real-time process table (5 s polling). User, IP, and request ID columns are blurred for privacy.

Live Monitoring — Active Processes

Live Monitoring — LLM Instances¶

Per-server cards: loaded models with VRAM / quantisation / TTL, Ollama metrics, and the expandable available-models list.

Live Monitoring — LLM Instances

Idle detection

Cards without any loaded models are labelled "Kein Modell geladen (idle)" — useful for spotting cold nodes during load-balancing reviews.

Starfleet — Ambient Intelligence Dashboard¶

LCARS-style dashboard with live node grid (14/14 UP), active alerts, feature toggle table, and Watchdog alert feed.

Starfleet Dashboard

Pipeline Transparency Log¶

Routing metadata per request — filterable by user, model, mode, complexity, date range, and cache hit. Columns are sortable (▼/▲). Data blurred for privacy.

Pipeline Log

Grafana — MoE System Overview¶

Grafana MoE Overview

Grafana — LLM & Expert Usage¶

Grafana LLM Usage

Grafana — Knowledge Base Health¶

Grafana Knowledge Base

Grafana — GPU & Inference Nodes¶

The moe-gpu-nodes dashboard provides per-node, per-GPU panels for VRAM usage, GPU utilization, RAM, and disk. Data is scraped from node-exporter instances via the inference-nodes Prometheus job.

Grafana GPU Nodes

Panel	Metric
VRAM Usage	`node_gpu_memory_used_bytes` / `node_gpu_memory_total_bytes`
GPU Utilization	`node_gpu_utilization_percent`
RAM Usage	`node_memory_MemTotal_bytes` - `node_memory_MemAvailable_bytes`
Disk Usage	`node_filesystem_size_bytes` - `node_filesystem_avail_bytes`

Grafana — User Metrics¶

Grafana User Metrics

Prometheus — Scrape Targets¶

Prometheus Targets

System Monitoring (`/monitoring`)¶

Endpoint Availability (24 h)¶

A stepped line chart at the top of the monitoring page shows the availability history of every configured inference server over the last 24 hours.

Data source: Prometheus query_range on moe_inference_server_up{server} (5-minute resolution)
Y-axis: UP (1) / DOWN (0) — the label switches for readability
One line per server — colors are assigned round-robin and match the legend below the chart
API: GET /api/endpoints/availability — returns a list of {server, values: [[ts, v], …]} objects

[
  { "server": "N04-RTX",  "values": [[1745400000, 1.0], [1745400300, 1.0], …] },
  { "server": "AIHUB",    "values": [[1745400000, 1.0], …] }
]

If Prometheus has no data yet (fresh install, no traffic), the chart area is replaced by a "No data" notice.

API Endpoint Budget¶

For every OpenAI-compatible inference server (e.g. AIHUB / LiteLLM), the monitoring page shows a live budget card:

Element	Description
Server name	Display name from the server configuration
Spend / Max	USD spend and maximum budget read from LiteLLM response headers
Progress bar	Visual fill: green < 70 % · orange 70–90 % · red ≥ 90 %
Percentage	Exact `(spend / max) × 100 %`

How it works: The Admin UI makes a lightweight GET /v1/models request to each OpenAI-compatible endpoint on every page load and reads the LiteLLM-specific response headers:

Header	Meaning
`x-litellm-key-spend`	Current cumulative spend in USD
`x-litellm-key-max-budget`	Maximum budget configured for this API key

API: GET /api/endpoints/budget — returns [{name, url, spend_usd, max_usd, pct}]
Ollama servers are skipped (they have no budget concept)
On network error the card shows the error message instead of values

Budget alerts

When the fill reaches orange (70 %) or red (90 %), plan for a budget top-up or switch to a local-only template to avoid 402/429 errors from the provider.

Provider Rate Limits¶

When Claude Code is active, this section shows the remaining API quota for each endpoint:

Column	Description
Endpoint	Server/provider name
Remaining	Remaining tokens until next reset
Limit	Total limit
%	Usage as progress bar
Reset	Next reset time

Color coding: Green (>20% remaining) · Yellow (<20%) · Red (exhausted)

System Gauges¶

Six real-time indicators for the state of the knowledge stack:

Gauge	Metric	Description
ChromaDB Documents	`moe_chroma_documents_total`	Number of vector documents in cache
Neo4j Entities	`moe_graph_entities_total`	Nodes in knowledge graph
Neo4j Relations	`moe_graph_relations_total`	Edges in knowledge graph
Ontology Entities	`moe_ontology_entities_total`	Ontology concepts
Planner Patterns	`moe_planner_patterns_total`	Learned routing patterns
Ontology Gaps	`moe_ontology_gaps_total`	Topics not covered

LLM Server Status¶

Compact overview of all configured inference servers:

Online / Offline badge
API type badge (Ollama / OpenAI)
Latency (ms)
GPU count
Loaded models with VRAM usage (Ollama) or model count (OpenAI)
Error message when offline

Hardware Metrics from Node Exporter

Each server status card also displays GPU, VRAM, RAM, and disk metrics when a node-exporter instance is reachable on port 9100 of the inference node's host. The Admin UI derives the host IP from the Ollama URL and queries the /metrics endpoint directly. GPU metrics (node_gpu_memory_used_bytes, node_gpu_memory_total_bytes, node_gpu_utilization_percent) are expected from a textfile collector (e.g., a cron job running nvidia-smi and writing to the collector directory).

Metrics Charts¶

All charts are queried via the Prometheus API (/api/monitoring) and rendered with Chart.js.

Chart	Metric	Type
Token usage by model	`moe_tokens_total` (by model)	Bar chart
Cache performance	`moe_cache_hits_total` / `moe_cache_misses_total`	Donut
Expert calls by category	`moe_expert_calls_total` (by category)	Bar chart
Expert calls by model	`moe_expert_calls_total` (by model)	Bar chart
Expert calls by model & node	Grouped	Bar chart
Requests by mode	`moe_requests_total` (by mode)	Donut
Confidence distribution	`moe_expert_confidence_total`	Donut
Latency & scores	P50/P95, self-evaluation, feedback	Table

Latency Metrics¶

Metric	Formula
P50 (Median)	`histogram_quantile(0.50, rate(moe_response_duration_seconds_bucket[1h]))`
P95 (95th percentile)	`histogram_quantile(0.95, rate(moe_response_duration_seconds_bucket[1h]))`
Self-evaluation avg	`moe_self_eval_score_bucket` (avg)
User feedback avg	`moe_feedback_score_bucket` (avg)

Live Monitoring (`/live-monitoring`)¶

Live Monitoring provides real-time insight into running processes with the ability to terminate individual requests.

Tab layout¶

flowchart TB
    P[/live-monitoring/]:::page
    P --> T1["Tab 1: Active Processes<br/>(badge = count)"]:::tab
    P --> T2["Tab 2: LLM Instances"]:::tab

    T1 --> R["Running API Requests<br/>(green/yellow/red by runtime)"]
    T1 --> K["Kill button<br/>per row"]
    T1 --> H["Process History<br/>up to HISTORY_MAX_ENTRIES"]

    T2 --> SC["Per-server cards"]
    SC --> LM["Loaded models<br/>(VRAM, quant, family, TTL)"]
    SC --> MC["Ollama Metrics<br/>(in-progress, queued, avg)"]
    SC --> AV["Available models<br/>(expandable list)"]

    classDef page fill:#eef2ff,stroke:#6366f1,font-weight:bold;
    classDef tab  fill:#f0fdf4,stroke:#16a34a;

All tab panes share the same 5-second polling loop but hit different REST endpoints:

Tab	Endpoint	Data source
Active Processes — running	`GET /api/live/active-requests`	Valkey `SCAN moe:active:*`
Active Processes — history	`GET /api/live/history`	Valkey `ZRANGE moe:admin:completed`
LLM Instances	`GET /api/live/llm-instances`	Direct HTTP fan-out to each configured inference server (`/api/ps`, `/api/tags` for Ollama; `/v1/models` for OpenAI-compatible)

Tab: Active Processes¶

Running API Requests¶

Table of all currently running requests (auto-refresh every 5 seconds):

Column	Description
Started	Request start time
Duration	Runtime in seconds
User	Username
Model	LLM in use
Mode	MoE mode (`native`, `moe_reasoning`, etc.)
Template	Expert template used (if set)
Type	`streaming` or `standard`
Client IP	Client IP address
Request ID	Unique chat ID
Kill	Button to terminate

Color coding by runtime:

Color	Runtime	Meaning
Green	≤ 30s	Normal
Yellow	≤ 120s	Longer than usual
Red	> 120s	Potential timeout

Killing a Process¶

sequenceDiagram
    actor Admin
    participant Browser as Admin UI<br/>(live_monitoring.html)
    participant Backend as admin_ui/app.py
    participant Valkey as Valkey<br/>moe:active:*
    participant Orch as Orchestrator<br/>main.py
    participant Client as API client

    Admin->>Browser: click "Kill"
    Browser->>Browser: confirmation dialog
    Browser->>Backend: POST /api/live/kill-request/{chatId}
    Backend->>Valkey: DEL moe:active:{chatId}
    Backend->>Valkey: ZADD moe:admin:completed status=killed
    Backend-->>Browser: 200 OK
    Orch->>Valkey: GET moe:active:{chatId}  (next checkpoint)
    Valkey-->>Orch: nil → abort
    Orch-->>Client: StreamingResponse closes
    Browser->>Backend: GET /api/live/active-requests  (next poll)
    Backend-->>Browser: row gone

What happens on kill:

The Valkey key moe:active:{chatId} is deleted
The request is moved to moe:admin:completed (Sorted Set) with status killed
The running LangGraph node receives the kill signal on the next checkpoint
The client receives an abort error

Streaming Requests

For streaming requests, it may take a few seconds for the kill command to take effect (at the next LangGraph checkpoint).

Process History¶

Table of all completed requests (up to HISTORY_MAX_ENTRIES, default: 5000):

Column	Description
Started	Start time
Ended	End time
Duration	Total runtime
User	Username
Model	LLM
Mode	MoE mode
Type	streaming / standard
Status	`completed` (green) or `killed` (red)

Use the "Clear History" button (top right in the history panel) to delete the entire history from Valkey.

The history limit can be configured via environment variable:

HISTORY_MAX_ENTRIES=5000   # Maximum number of entries before cleanup (default: 5000)

Tab: LLM Instances¶

Detailed status of all inference servers:

Per Server (Ollama)¶

Loaded Models:

Column	Description
Model	Model name:tag
VRAM	Currently used VRAM (MB)
Total	Total model size (MB)
Parameters	Parameter count
Quant.	Quantization level
Family	Model family
Expires	When the model will be unloaded from VRAM

Ollama Metrics (chips):

Metric	Meaning	Alert
In Progress	Current requests	Red if > 0
Queued	Queue length	Red if > 0
Loaded	Number of loaded models	–
Total Requests	Lifetime requests	–
Avg / Request	Average duration	–
↑ Input	Request size (MB)	–
↓ Output	Response size (MB)	–

Available Models (expandable):

All installed models with name, size (GB), parameter count, quantization.

Per Server (OpenAI-compatible)¶

Model count
Available models (list)

Refresh Controls¶

Element	Function
Last Updated	Timestamp of last query
Refresh manually	Immediate query
Auto 5s ☑	5-second polling (default: active)

Prometheus Metrics – Full List¶

Metric	Labels	Type	Description
`moe_tokens_total`	model, token_type, node, user_id	Counter	Processed tokens
`moe_expert_calls_total`	category, model, node	Counter	Expert invocations
`moe_requests_total`	mode	Counter	Requests by mode
`moe_response_duration_seconds`	–	Histogram	Response times
`moe_cache_hits_total`	–	Counter	Cache hits
`moe_cache_misses_total`	–	Counter	Cache misses
`moe_expert_confidence_total`	level	Counter	Confidence distribution
`moe_self_eval_score_bucket`	le	Histogram	Self-evaluation scores
`moe_feedback_score_bucket`	le	Histogram	User feedback scores
`moe_chroma_documents_total`	–	Gauge	ChromaDB documents
`moe_graph_entities_total`	–	Gauge	Neo4j entities
`moe_graph_relations_total`	–	Gauge	Neo4j relations
`moe_ontology_entities_total`	–	Gauge	Ontology entities
`moe_planner_patterns_total`	–	Gauge	Planner patterns
`moe_ontology_gaps_total`	–	Gauge	Ontology gaps

All metrics are also directly accessible via Prometheus (http://localhost:9090) and Grafana (http://localhost:3001).

Pipeline Transparency Log¶

URL: /pipeline-log
API: GET /v1/admin/pipeline-log (admin key required) or /api/pipeline-log (session)

The Pipeline Transparency Log records per-request routing metadata for every request processed by the MoE pipeline. It answers questions like: which expert domains were engaged, what complexity level was assigned, how long did the pipeline take, and how many agentic re-planning rounds occurred.

Available fields¶

Field	Description
`requested_at`	ISO timestamp of the request
`user_id` / `username`	Requesting user
`model`	Template/model used
`moe_mode`	Pipeline mode (`default`, `research`, `code`, …)
`complexity_level`	Planner complexity estimate (`trivial`, `moderate`, `complex`, `memory_recall`)
`expert_domains`	Comma-separated expert categories engaged (e.g. `reasoning,web_researcher`)
`prompt_tokens` / `completion_tokens`	Token counts
`latency_ms`	End-to-end pipeline latency in milliseconds
`cache_hit`	Whether the L0 Redis or L1 ChromaDB cache was hit
`agentic_rounds`	Number of Judge-triggered re-planning iterations
`status`	`ok` or error indicator

Filters & Sorting¶

All filters are optional. The UI exposes them as inputs at the top of the table; all are also available as query parameters on the API:

Filter	UI control	API param	Notes
User	Text input	`username`	Partial match (ILIKE)
Model	Text input	`model`	Partial match — e.g. `wcc` finds all WCC templates
Mode	Dropdown	`moe_mode`	Exact match
Complexity	Dropdown	`complexity_level`	`trivial` / `moderate` / `complex` / `memory_recall`
Cache	Dropdown	`cache_hit`	`true` / `false`
Date from / to	Date picker	`from_date` / `to_date`	ISO date `YYYY-MM-DD`
Limit	Dropdown	`limit`	50 / 100 / 500

Sorting: Click any column header (Time, User, Model, Mode, Complexity, Tokens, Latency) to sort ascending (▲) or descending (▼). Repeated clicks toggle direction. Sorting is server-side and respects pagination boundaries.

API params: sort_by (one of requested_at, model, moe_mode, username, total_tokens, latency_ms, complexity_level) and sort_dir (asc / desc).

Clear all filters via the Clear button below the date pickers.

Export¶

Append ?format=csv for a CSV download suitable for BI tools. The export respects all active filters (limit is raised to 10 000 automatically for CSV).

Schema migration¶

The usage_log table is extended automatically on first startup with the new columns via ALTER TABLE … ADD COLUMN IF NOT EXISTS — no manual migration required.

Monitoring & Processes¶

Observability architecture at a glance¶

Screenshots¶

System Monitoring¶

Live Monitoring — Active Processes & History¶

Live Monitoring — LLM Instances¶

Starfleet — Ambient Intelligence Dashboard¶

Pipeline Transparency Log¶

Grafana — MoE System Overview¶

Grafana — LLM & Expert Usage¶

Grafana — Knowledge Base Health¶

Grafana — GPU & Inference Nodes¶

Grafana — User Metrics¶

Prometheus — Scrape Targets¶

System Monitoring (/monitoring)¶

Endpoint Availability (24 h)¶

API Endpoint Budget¶

Provider Rate Limits¶

System Gauges¶

LLM Server Status¶

Metrics Charts¶

Latency Metrics¶

Live Monitoring (/live-monitoring)¶

Tab layout¶

Tab: Active Processes¶

Running API Requests¶

Killing a Process¶

Process History¶

Tab: LLM Instances¶

Per Server (Ollama)¶

Per Server (OpenAI-compatible)¶

Refresh Controls¶

Prometheus Metrics – Full List¶

Pipeline Transparency Log¶

Available fields¶

Filters & Sorting¶

Export¶

Schema migration¶

System Monitoring (`/monitoring`)¶

Live Monitoring (`/live-monitoring`)¶