Monitoring & Dashboards
SynthOrg exposes runtime telemetry via a Prometheus /metrics endpoint plus structured JSON logs. This guide walks through every metric the application emits, a ready-to-import Grafana dashboard, and suggested alert rules. The canonical metric registration lives in src/synthorg/observability/prometheus_collector.py (pull-refreshed families) and src/synthorg/observability/prometheus_push_metrics.py (push-updated families); bounded label allowlists live in src/synthorg/observability/prometheus_labels.py.
Scraping
Point any Prometheus-compatible scraper at the running app:
scrape_configs:
- job_name: synthorg
scrape_interval: 30s
static_configs:
- targets: ['synthorg:8000']
The endpoint is unauthenticated by default; put it behind your normal scrape-ACL (firewall, sidecar proxy, Kubernetes NetworkPolicy). All metric names are prefixed with synthorg_.
Metric inventory
The Dashboard column maps each metric to a row in the default Grafana overview dashboard (monitoring/grafana/synthorg-overview.json). Rows are collapsible; the only row expanded by default is Health & SLO. The dashboard exposes four filter variables ($agent_id, $agent, $workflow_definition_id, $department) that drill panels down per-entity; the two agent-named variables exist because synthorg_tasks_total uses agent while the per-agent cost metrics use agent_id. Default queries aggregate across the full set so the unfiltered view is always meaningful.
Bounded-label values are enforced at record time in src/synthorg/observability/prometheus_labels.py; PromQL filters that reference values outside those allowlists will never match data. The five formerly-unbounded labels (agent_id, agent, department, workflow_definition_id on the metrics noted below) are validated against a registry-bound snapshot rebuilt on every Prometheus scrape; unknown values drop that one sample with a metrics.scrape.failed WARN log per unknown label per scrape. The log repeats on the next scrape if the value is still unknown.
Info
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_app |
Info |
version |
Application build info. |
Client Health |
Coordination (push-updated per multi-agent run)
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_coordination_efficiency |
Gauge |
- |
0.0-1.0 efficiency ratio. |
Health & SLO |
synthorg_coordination_overhead_percent |
Gauge |
- |
% of wall time spent coordinating. |
Health & SLO |
Cost & budget (pull-refreshed at scrape)
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_cost_total |
Gauge |
- |
Total accumulated cost. |
Cost & Budget |
synthorg_budget_used_percent |
Gauge |
- |
Monthly budget utilisation. |
Health & SLO |
synthorg_budget_monthly_cost |
Gauge |
- |
Monthly budget in configured currency. |
Cost & Budget |
synthorg_budget_daily_used_percent |
Gauge |
- |
Daily utilisation (prorated). |
Cost & Budget |
synthorg_agent_cost_total |
Gauge |
agent_id (registry-bound) |
Per-agent accumulated cost. |
Cost & Budget |
synthorg_agent_budget_used_percent |
Gauge |
agent_id (registry-bound) |
Per-agent daily utilisation. |
Cost & Budget |
Agents & tasks
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_active_agents_total |
Gauge |
status, trust_level |
Active agent count by status. |
Health & SLO |
synthorg_tasks_total |
Gauge |
status, agent (registry-bound) |
Task count per status per agent. |
Tasks |
synthorg_task_runs_total |
Counter |
outcome |
Emitted task outcomes by bounded outcome (succeeded / failed / cancelled / rejected). One increment per terminal-status hop on a task; a task that transitions through failed and is later retried therefore counts as one failed and one succeeded (or another terminal value) -- the counter records emitted outcomes, not unique task ids. |
Tasks |
synthorg_task_duration_seconds |
Histogram |
outcome |
Task execution duration in seconds, partitioned by the same outcome values as synthorg_task_runs_total (buckets 0.1s-600s). Observed only when the engine has a recorded creation timestamp; transitions where the timestamp is unavailable (e.g. a task created before a process restart) skip the histogram and emit task_engine.timing_fallback WARN with synthorg_task_runs_total still incremented so the count and histogram percentages remain comparable. |
Tasks |
Providers
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_provider_tokens_total |
Counter |
provider, model, direction |
Input/output tokens by model (direction bounded to input/output). |
Tools & Providers |
synthorg_provider_cost_total |
Counter |
provider, model |
Cost per provider call. |
Tools & Providers |
synthorg_provider_errors_total |
Counter |
provider, model, error_class |
Provider-call failures classified by rate_limit / timeout / connection / internal / invalid_request / auth / content_filter / not_found / other. |
Tools & Providers |
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_tool_invocations_total |
Counter |
tool_name, outcome |
Tool invocations by bounded outcome (success / error / timeout). |
Tools & Providers |
synthorg_tool_duration_seconds |
Histogram |
tool_name, outcome |
Tool invocation duration (buckets 5ms-120s). |
Tools & Providers |
API
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_api_request_duration_seconds |
Histogram |
method, route, status_class |
HTTP request handler duration (buckets 5ms-10s). The auto-emitted _count series is the per-label request counter; use it for request-rate PromQL. |
Client Health |
synthorg_api_error_classification_total |
Counter |
category, status_class |
4xx/5xx response counter partitioned by RFC 9457 category (auth / validation / not_found / conflict / rate_limit / budget_exhausted / provider_error / internal) and status class. |
Audit & Security |
Caches
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_cache_operations_total |
Counter |
cache_name, outcome |
In-process cache operations (cache_name bounded to mcp_result / reranker; outcome bounded to hit / miss / evict). |
Client Health |
Security
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_security_evaluations_total |
Counter |
verdict |
Pre-tool security verdicts (verdict bounded to allow / deny / escalate / output_scan). |
Audit & Security |
Audit chain
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_audit_chain_appends_total |
Counter |
status |
Audit chain append operations (status bounded to signed / fallback / error). |
Audit & Security |
synthorg_audit_chain_depth |
Gauge |
- |
Current hash chain length. |
Audit & Security |
synthorg_audit_chain_last_append_timestamp_seconds |
Gauge |
- |
Unix timestamp of the most recent append. |
Audit & Security |
synthorg_security_audit_log_fill_ratio |
Gauge |
- |
Security audit log occupancy as a fraction of max_entries (0.0 empty, 1.0 full). Alert at 0.9: increase retention or archive older entries before the ring buffer wraps and overwrites unread evidence. |
Audit & Security |
OTLP export health
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_otlp_export_batches_total |
Counter |
kind, outcome |
Export batches by kind (logs / traces) and outcome (success / failure). |
Client Health |
synthorg_otlp_export_dropped_records_total |
Counter |
kind |
Records dropped because the queue was full or the retry budget exhausted. |
Client Health |
Client transport
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_client_disconnects_total |
Counter |
transport, reason |
Client transport disconnections (transport bounded to sse / websocket / mcp_stdio / mcp_http; reason bounded to client_initiated / transport_error / cancelled / timeout). |
Client Health |
Escalation + identity + workflow
| Metric |
Type |
Labels |
Description |
Dashboard |
synthorg_escalation_queue_depth |
Gauge |
department (registry-bound) |
Pending escalations awaiting decision. |
Health & SLO |
synthorg_agent_identity_version_changes_total |
Counter |
agent_id (registry-bound), change_type |
Identity-version lifecycle events (change_type bounded to created / updated / rolled_back / archived). |
Audit & Security |
synthorg_workflow_execution_seconds |
Histogram |
workflow_definition_id (registry-bound), status |
Workflow execution duration (status bounded to completed / failed / cancelled / timeout; buckets 0.5s-3600s). |
Workflows |
Suggested PromQL queries
Saturation / backlog
# Escalation backlog (any department) sustained above 5 for 10m
max_over_time(synthorg_escalation_queue_depth[10m]) > 5
# Workflow p95 latency exceeds 60s
histogram_quantile(0.95, sum by (le) (rate(synthorg_workflow_execution_seconds_bucket[5m]))) > 60
Cost / budget
# Burned 80% of the monthly budget
synthorg_budget_used_percent > 80
# Per-agent cost top 5 (most expensive right now)
topk(5, synthorg_agent_cost_total)
Coordination health
# Coordination overhead sustained above 40% for 10 minutes
avg_over_time(synthorg_coordination_overhead_percent[10m]) > 40
# Coordination efficiency dropped below 0.5 (half of runs wasted)
avg_over_time(synthorg_coordination_efficiency[15m]) < 0.5
Identity lifecycle
# Rollback rate over the last hour (audit-relevant spike check)
sum(rate(synthorg_agent_identity_version_changes_total{change_type="rolled_back"}[1h]))
# Churn rate -- identity updates per minute
sum by (change_type) (rate(synthorg_agent_identity_version_changes_total[5m]))
API health
# 5xx rate as a fraction of total (clamp_min avoids NaN/Inf in idle windows)
sum(rate(synthorg_api_request_duration_seconds_count{status_class="5xx"}[5m]))
/ clamp_min(sum(rate(synthorg_api_request_duration_seconds_count[5m])), 1)
# Request rate by status class (histogram's auto-emitted _count series)
sum by (status_class) (rate(synthorg_api_request_duration_seconds_count[1m]))
# Error rate by RFC 9457 category
sum by (category) (rate(synthorg_api_error_classification_total[5m]))
# 5xx rate by category (internal vs rate_limit vs provider_error, etc.)
sum by (category) (rate(synthorg_api_error_classification_total{status_class="5xx"}[5m]))
Provider health
# Provider error rate per class (hot loop: rate_limit + timeout + connection)
sum by (provider, error_class) (rate(synthorg_provider_errors_total[5m]))
# Token-normalized provider error rate (error events per token volume)
sum by (provider) (rate(synthorg_provider_errors_total[5m]))
/ clamp_min(sum by (provider) (rate(synthorg_provider_tokens_total[5m])), 1)
Cache hit rate
# Hit rate per cache (0.0-1.0)
sum by (cache_name) (rate(synthorg_cache_operations_total{outcome="hit"}[5m]))
/ clamp_min(sum by (cache_name) (rate(synthorg_cache_operations_total[5m])), 1)
# Eviction spike (may indicate undersized cache)
sum by (cache_name) (rate(synthorg_cache_operations_total{outcome="evict"}[5m]))
Security posture
# Denial rate (should be low; spike indicates policy tightening or attack)
rate(synthorg_security_evaluations_total{verdict="deny"}[5m])
# Escalation rate per minute
rate(synthorg_security_evaluations_total{verdict="escalate"}[1m])
Audit chain health
# Append-error rate (non-zero = signing pipeline is broken)
rate(synthorg_audit_chain_appends_total{status="error"}[5m])
# Seconds since last append (flat line for > 5m is suspicious)
time() - synthorg_audit_chain_last_append_timestamp_seconds
# Audit log fill ratio (alert when the ring buffer is near capacity).
# At >0.9 the next bursts of activity overwrite the oldest entries
# before an operator can read them; rotate retention or archive.
synthorg_security_audit_log_fill_ratio
Audit log fill ratio
The synthorg_security_audit_log_fill_ratio gauge reports the
occupancy of the in-memory security audit log as a fraction of its
configured max_entries capacity. The log is a ring buffer: once
full, the oldest entries are overwritten as new audit events land.
A sustained value above 0.9 means the buffer is about to wrap; any
unread evidence beyond that point is permanently lost.
Recommended alert rule:
- alert: SynthorgSecurityAuditLogNearCapacity
expr: synthorg_security_audit_log_fill_ratio > 0.9
for: 10m
labels: {severity: warning}
annotations:
summary: "Security audit log is {{ $value | humanizePercentage }} full"
runbook: "increase max_entries, archive entries to long-term storage, or shorten retention"
Grafana panel definition (drop into the Audit & Security row of
monitoring/grafana/synthorg-overview.json):
{
"title": "Security audit log fill ratio",
"type": "gauge",
"datasource": "${DS_PROMETHEUS}",
"fieldConfig": {
"defaults": {
"min": 0,
"max": 1,
"unit": "percentunit",
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 0.75},
{"color": "red", "value": 0.9}
]
}
}
},
"targets": [
{"expr": "synthorg_security_audit_log_fill_ratio", "refId": "A"}
]
}
OTLP export health
# Export-failure rate per kind
sum by (kind) (rate(synthorg_otlp_export_batches_total{outcome="failure"}[5m]))
# Dropped records per kind (queue overflow or retries exhausted)
sum by (kind) (rate(synthorg_otlp_export_dropped_records_total[5m]))
Client transport health
# Disconnect rate by transport (alert on transport_error spikes)
sum by (transport, reason) (rate(synthorg_client_disconnects_total[5m]))
# Transport-error rate as a fraction of all disconnects
sum(rate(synthorg_client_disconnects_total{reason="transport_error"}[5m]))
/ clamp_min(sum(rate(synthorg_client_disconnects_total[5m])), 1)
Grafana dashboard
Import monitoring/grafana/synthorg-overview.json into any Grafana v10+ instance. The file is Grafana v10-compatible dashboard JSON (authored against the v11 editor, which emits a schema readable by v10) with a single ${DS_PROMETHEUS} template variable bound to your Prometheus data source plus four filter variables: $agent_id (sourced from synthorg_agent_cost_total's agent_id label, used by Cost & Budget + Audit & Security panels), $agent (sourced from synthorg_tasks_total's agent label, used by the Tasks row's per-agent panel), $workflow_definition_id, and $department. The two agent-named variables exist because synthorg_tasks_total and synthorg_agent_cost_total use different label names (agent vs agent_id); panels filter on whichever variable matches their underlying metric.
The dashboard organises 30+ panels into seven collapsible rows. Only Health & SLO is expanded by default; expand the others as needed to keep the unfiltered view scannable.
| Row |
Default |
Panels |
Health & SLO |
expanded |
Coordination efficiency, coordination overhead, budget utilisation, active agents, escalation queue depth |
Tasks |
collapsed |
Task completion rate, task duration p50/p95, tasks-by-status, task-runs-by-outcome, tasks per agent |
Workflows |
collapsed |
Workflow duration p50/p95, workflow execution rate by status, top-N workflow definitions |
Tools & Providers |
collapsed |
Tool invocation rate, tool duration p95 by tool_name, provider tokens, provider cost, provider errors by class |
Cost & Budget |
collapsed |
synthorg_cost_total, monthly cost, daily used %, top-25 per-agent cost, agent budget used % |
Audit & Security |
collapsed |
Audit chain append rate, depth, last-append age, audit-log fill-ratio gauge, security verdicts, agent identity version changes, API error categories |
Client Health |
collapsed |
Client disconnects by transport+reason, API request rate by status class, OTLP export batches, OTLP dropped records, cache hit rate, app info |
To install via the Grafana UI: Dashboards → New → Import → Upload JSON file. Via the provisioning API: POST /api/dashboards/db with {"dashboard": <file>, "overwrite": true, "inputs": [...]}.
Alerts
The file does not ship alert rules because thresholds are deployment-specific. The suggested PromQL above is ready to drop into Prometheus' rules.yml; pair each query with a labels: severity: warning|critical and a for: duration. Example:
groups:
- name: synthorg
rules:
- alert: SynthorgCoordinationOverheadHigh
expr: avg_over_time(synthorg_coordination_overhead_percent[10m]) > 40
for: 10m
labels: {severity: warning}
annotations:
summary: "Coordination overhead is {{ $value }}%"
runbook: "https://synthorg.io/docs/runbooks/coordination-overhead"
Logfire
Logfire's Prometheus integration can scrape the same /metrics endpoint directly; no additional wiring is required on the SynthOrg side. Follow the Logfire documentation for the Prometheus setup and point it at http://synthorg:8000/metrics. All metrics documented above will appear under the same names in Logfire dashboards.
Further reading
- Observability design: sink layout, correlation IDs, per-domain routing
- Reference: errors: RFC 9457 error categories
src/synthorg/observability/prometheus_collector.py: canonical metric registration
src/synthorg/observability/prometheus_push_metrics.py: push-updated metric families
src/synthorg/observability/prometheus_labels.py: bounded label value sets