How MCPWorks Uses Grafana and Prometheus to Monitor AI Agent Infrastructure
Running infrastructure for AI agents introduces monitoring challenges that traditional web application observability doesn't cover. When your platform executes LLM-authored code in sandboxes, proxies tool calls to external MCP servers, and orchestrates multi-step agent workflows, you need metrics that go beyond request rates and error codes.
This is how we built the observability stack for MCPWorks — our open-source, namespace-based function hosting platform for AI assistants.
The stack
Our monitoring infrastructure runs on a dedicated management node, separated from the production API:
- Prometheus scrapes the
/metricsendpoint on the API server every 15 seconds - Grafana visualizes metrics and provides alerting dashboards
- Loki aggregates structured logs from all containers
- Promtail ships Docker container logs and syslog from production nodes to Loki
- Node Exporter provides system-level metrics (CPU, memory, disk, network)
- Uptime Kuma handles external uptime monitoring independently
All of these run as Docker containers on the management node with 30-day retention, communicating over a private VPC network to scrape the production server.
Instrumenting FastAPI with Prometheus
The foundation is prometheus-fastapi-instrumentator, which gives us HTTP-level metrics with zero boilerplate:
from prometheus_fastapi_instrumentator import Instrumentator
def setup_metrics(app: FastAPI) -> Instrumentator:
instrumentator = Instrumentator(
should_group_status_codes=False,
should_ignore_untemplated=True,
should_instrument_requests_inprogress=True,
excluded_handlers=["/metrics", "/health", "/health/live", "/health/ready"],
inprogress_name="http_requests_inprogress",
inprogress_labels=True,
)
instrumentator.instrument(app).expose(app, endpoint="/metrics", include_in_schema=False)
return instrumentator
This gives us http_requests_total, http_request_duration_seconds, and http_requests_inprogress out of the box — broken down by method, endpoint, and status code. Health check and metrics endpoints are excluded to keep the data clean.
The instrumentator attaches after all routers are registered so every route gets captured, and it's gated behind a prometheus_enabled config flag for environments where you don't need it.
Custom business metrics: where it gets interesting
HTTP metrics tell you your API is up. Business metrics tell you your platform is working. We define these in a centralized observability module with one-line helper functions that mirror the fire-and-forget pattern we use for analytics events:
Agent orchestration
agent_runs_total = Counter(
"mcpworks_agent_runs_total",
"Total agent orchestration runs",
["namespace", "trigger_type", "status"],
)
agent_run_duration_seconds = Histogram(
"mcpworks_agent_run_duration_seconds",
"Agent orchestration run duration",
["namespace", "trigger_type"],
buckets=[0.5, 1, 2, 5, 10, 30, 60, 120, 300],
)
agents_running = Gauge(
"mcpworks_agents_running",
"Number of agent orchestrations currently in progress",
["namespace"],
)
The histogram buckets are tuned for agent workloads — an agent run might complete a simple tool call in under a second or run a multi-step orchestration for five minutes. The trigger_type label distinguishes scheduled runs, webhook-triggered runs, and chat-initiated runs, which have very different performance profiles.
MCP server proxy
When agents call tools on external MCP servers, we measure the full proxy lifecycle:
mcp_proxy_calls_total = Counter(
"mcpworks_mcp_proxy_calls_total",
"Total MCP proxy calls to external servers",
["namespace", "server_name", "tool_name", "status"],
)
mcp_proxy_latency_seconds = Histogram(
"mcpworks_mcp_proxy_latency_seconds",
"MCP proxy call latency",
["namespace", "server_name"],
buckets=[0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30],
)
mcp_proxy_injections_total = Counter(
"mcpworks_mcp_proxy_injections_total",
"Prompt injection attempts detected in MCP proxy responses",
["namespace", "server_name"],
)
That last one — mcp_proxy_injections_total — is unique to AI infrastructure. We scan MCP server responses for prompt injection patterns before they reach the orchestrating LLM, and this counter lets us track which external servers are returning suspicious content. A spike in this metric is an immediate security signal.
Sandbox execution
Code sandbox metrics are critical because we're executing LLM-authored code in nsjail-isolated environments:
sandbox_executions_total = Counter(
"sandbox_executions_total",
"Total sandbox executions",
["tier", "status", "namespace"],
)
sandbox_violations_total = Counter(
"sandbox_violations_total",
"Total sandbox seccomp/resource violations",
["tier"],
)
sandbox_executions_in_progress = Gauge(
"sandbox_executions_in_progress",
"Sandbox executions currently in progress",
["tier"],
)
The tier label maps to subscription tiers — sandbox resource limits (CPU time, memory, network access) vary by plan, so we need to know if one tier is hitting resource ceilings more than others. The violations counter tracks seccomp policy violations, which would indicate either a sandbox escape attempt or a function trying to make syscalls it shouldn't.
We also expose an async context manager for tracking in-progress executions:
@asynccontextmanager
async def track_execution(tier: str, namespace: str = "unknown"):
sandbox_executions_in_progress.labels(tier=tier).inc()
try:
yield
finally:
sandbox_executions_in_progress.labels(tier=tier).dec()
This guarantees the gauge stays accurate even when executions fail or timeout — the finally block always decrements.
Token savings measurement
Because MCPWorks' identity is token efficiency, we measure the actual bytes flowing through MCP tool calls:
mcp_response_bytes = Histogram(
"mcpworks_mcp_response_bytes",
"MCP tool response size in bytes (proxy for token usage)",
["endpoint_type", "tool_name"],
buckets=[100, 250, 500, 1000, 2500, 5000, 10000, 50000],
)
Response size in bytes is a proxy for token consumption. By comparing this across different function implementations and MCP server configurations, we can quantify the token savings our platform provides and validate our efficiency claims with real data.
Grafana dashboards
Our API dashboard covers six panels, provisioned from JSON so they're version-controlled alongside the infrastructure:
- Request Rate —
sum(rate(http_requests_total[5m]))overall and by method - Error Rate (5xx) — the ratio that triggers alerts above 5%
- P95/P50 Latency —
histogram_quantileover request duration buckets - Status Codes — rate by status code for spotting unusual patterns
- In-Progress Requests — concurrency gauge for capacity planning
- Request Rate by Endpoint — top 10 endpoints by traffic volume
Dashboards are provisioned automatically when Grafana starts — the JSON files live in the repository under infra/mgmt/grafana/dashboards/ and a provisioning config points Grafana at that directory.
Alerting
Prometheus evaluates alert rules every 15 seconds. Our current rules:
- alert: ApiDown
expr: up{job="mcpworks-api"} == 0
for: 1m
labels:
severity: critical
- alert: HighErrorRate
expr: >
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
- alert: HighLatency
expr: >
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
The for durations prevent flapping — a single slow request won't page anyone, but five minutes of sustained high latency will. System-level alerts (CPU > 80%, memory > 85%, disk < 15% free) come from Node Exporter metrics.
Log aggregation with Loki
Metrics tell you what happened. Logs tell you why. Promtail runs on the production node and ships Docker container logs and syslog to Loki on the management node.
The pipeline is configured to parse structured JSON logs from the API container, extracting level and event fields as labels:
pipeline_stages:
- docker: {}
- match:
selector: '{container_name="mcpworks-api"}'
stages:
- json:
expressions:
level: level
event: event
- labels:
level:
event:
This works because the API uses structlog throughout — every log line is structured JSON with consistent fields. In Grafana, you can query Loki with LogQL to correlate logs with metric spikes:
{container_name="mcpworks-api", level="error"} |= "sandbox"
This instantly surfaces sandbox-related errors during an incident, without grepping through files on the production server.
Health checks: three tiers
The API exposes three health check endpoints at different depths:
/health— is the process running? Returns{"status": "healthy"}with no dependencies./health/live— liveness probe for orchestrators. Same as above but semantically distinct./health/ready— readiness probe. Verifies database connectivity, Redis connectivity, and sandbox binary availability. Returns component-level status so you know exactly what's degraded.
Docker Compose health checks hit /v1/health to determine container health. The readiness endpoint is what you'd wire into a load balancer to drain traffic from a node that's lost its database connection.
Architecture: management node separation
We deliberately run monitoring infrastructure on a separate node from production. This means:
- A production outage doesn't take down your ability to observe it
- Monitoring containers don't compete with the API for CPU and memory
- Prometheus and Loki retention (30 days each) has its own disk budget
- The management node also runs secrets management (Infisical), keeping sensitive infrastructure off the API server
Communication happens over DigitalOcean's private VPC network — Prometheus scrapes the API's /metrics endpoint over the VPC IP, and Promtail pushes logs to Loki over the same network. No monitoring traffic touches the public internet.
What's next
As we approach our open-source launch, the observability stack evolves with the platform:
- Per-namespace dashboards — Grafana variables scoped to individual namespaces so operators can see their own metrics
- Token savings dashboards — visualizing the actual token efficiency gains across different function backends
- Agent orchestration traces — correlating agent runs with their individual tool calls and sandbox executions
- Self-hosted operator experience — making the monitoring stack easy to deploy for self-hosted MCPWorks installations
The full infrastructure configuration is open-source in the MCPWorks API repository. If you're building AI agent infrastructure and want to see how we instrument it, the infra/mgmt/ directory has everything — Prometheus configs, Grafana dashboards, Loki setup, and alert rules.
MCPWorks is the open-source standard for token-efficient AI agents. Learn more at mcpworks.io.
MCPWorks is open source.
Self-host free forever, or try MCPWorks Cloud — 14-day Pro trial, no credit card.