Monitoring & Dashboards¶

SynthOrg exposes runtime telemetry via a Prometheus /metrics endpoint plus structured JSON logs. This guide walks through every metric the application emits, a ready-to-import Grafana dashboard, and suggested alert rules. The canonical metric registration lives in src/synthorg/observability/prometheus_collector.py (pull-refreshed families) and src/synthorg/observability/prometheus_push_metrics.py (push-updated families); bounded label allowlists live in src/synthorg/observability/prometheus_labels.py.

Scraping¶

Point any Prometheus-compatible scraper at the running app:

scrape_configs:
  - job_name: synthorg
    scrape_interval: 30s
    static_configs:
      - targets: ['synthorg:8000']

The endpoint is unauthenticated by default; put it behind your normal scrape-ACL (firewall, sidecar proxy, Kubernetes NetworkPolicy). All metric names are prefixed with synthorg_.

Metric inventory¶

The Dashboard column maps each metric to a row in the default Grafana overview dashboard (monitoring/grafana/synthorg-overview.json). Rows are collapsible; the only row expanded by default is Health & SLO. The dashboard exposes four filter variables ($agent_id, $agent, $workflow_definition_id, $department) that drill panels down per-entity; the two agent-named variables exist because synthorg_tasks_total uses agent while the per-agent cost metrics use agent_id. Default queries aggregate across the full set so the unfiltered view is always meaningful.

Bounded-label values are enforced at record time in src/synthorg/observability/prometheus_labels.py; PromQL filters that reference values outside those allowlists will never match data. The five formerly-unbounded labels (agent_id, agent, department, workflow_definition_id on the metrics noted below) are validated against a registry-bound snapshot rebuilt on every Prometheus scrape; unknown values drop that one sample with a metrics.scrape.failed WARN log per unknown label per scrape. The log repeats on the next scrape if the value is still unknown.

Info¶

Metric	Type	Labels	Description	Dashboard
`synthorg_app`	Info	`version`	Application build info.	`Client Health`

Coordination (push-updated per multi-agent run)¶

Metric	Type	Labels	Description	Dashboard
`synthorg_coordination_efficiency`	Gauge	-	0.0-1.0 efficiency ratio.	`Health & SLO`
`synthorg_coordination_overhead_percent`	Gauge	-	% of wall time spent coordinating.	`Health & SLO`

Cost & budget (pull-refreshed at scrape)¶

Metric	Type	Labels	Description	Dashboard
`synthorg_cost_total`	Gauge	-	Total accumulated cost.	`Cost & Budget`
`synthorg_budget_used_percent`	Gauge	-	Monthly budget utilisation.	`Health & SLO`
`synthorg_budget_monthly_cost`	Gauge	-	Monthly budget in configured currency.	`Cost & Budget`
`synthorg_budget_daily_used_percent`	Gauge	-	Daily utilisation (prorated).	`Cost & Budget`
`synthorg_agent_cost_total`	Gauge	`agent_id` (registry-bound)	Per-agent accumulated cost.	`Cost & Budget`
`synthorg_agent_budget_used_percent`	Gauge	`agent_id` (registry-bound)	Per-agent daily utilisation.	`Cost & Budget`

Agents & tasks¶

Metric	Type	Labels	Description	Dashboard
`synthorg_active_agents_total`	Gauge	`status`, `trust_level`	Active agent count by status.	`Health & SLO`
`synthorg_tasks_total`	Gauge	`status`, `agent` (registry-bound)	Task count per status per agent.	`Tasks`
`synthorg_task_runs_total`	Counter	`outcome`	Emitted task outcomes by bounded `outcome` (`succeeded` / `failed` / `cancelled` / `rejected`). One increment per terminal-status hop on a task; a task that transitions through `failed` and is later retried therefore counts as one `failed` and one `succeeded` (or another terminal value) -- the counter records emitted outcomes, not unique task ids.	`Tasks`
`synthorg_task_duration_seconds`	Histogram	`outcome`	Task execution duration in seconds, partitioned by the same `outcome` values as `synthorg_task_runs_total` (buckets 0.1s-600s). Observed only when the engine has a recorded creation timestamp; transitions where the timestamp is unavailable (e.g. a task created before a process restart) skip the histogram and emit `task_engine.timing_fallback` WARN with `synthorg_task_runs_total` still incremented so the count and histogram percentages remain comparable.	`Tasks`

Providers¶

Metric	Type	Labels	Description	Dashboard
`synthorg_provider_tokens_total`	Counter	`provider`, `model`, `direction`	Input/output tokens by model (`direction` bounded to `input`/`output`).	`Tools & Providers`
`synthorg_provider_cost_total`	Counter	`provider`, `model`	Cost per provider call.	`Tools & Providers`
`synthorg_provider_errors_total`	Counter	`provider`, `model`, `error_class`	Provider-call failures classified by `rate_limit` / `timeout` / `connection` / `internal` / `invalid_request` / `auth` / `content_filter` / `not_found` / `other`.	`Tools & Providers`

Tools¶

Metric	Type	Labels	Description	Dashboard
`synthorg_tool_invocations_total`	Counter	`tool_name`, `outcome`	Tool invocations by bounded outcome (`success` / `error` / `timeout`).	`Tools & Providers`
`synthorg_tool_duration_seconds`	Histogram	`tool_name`, `outcome`	Tool invocation duration (buckets 5ms-120s).	`Tools & Providers`

API¶

Metric	Type	Labels	Description	Dashboard
`synthorg_api_request_duration_seconds`	Histogram	`method`, `route`, `status_class`	HTTP request handler duration (buckets 5ms-10s). The auto-emitted `_count` series is the per-label request counter; use it for request-rate PromQL.	`Client Health`
`synthorg_api_error_classification_total`	Counter	`category`, `status_class`	4xx/5xx response counter partitioned by RFC 9457 category (`auth` / `validation` / `not_found` / `conflict` / `rate_limit` / `budget_exhausted` / `provider_error` / `internal`) and status class.	`Audit & Security`

Caches¶

Metric	Type	Labels	Description	Dashboard
`synthorg_cache_operations_total`	Counter	`cache_name`, `outcome`	In-process cache operations (`cache_name` bounded to `mcp_result` / `reranker`; `outcome` bounded to `hit` / `miss` / `evict`).	`Client Health`

Security¶

Metric	Type	Labels	Description	Dashboard
`synthorg_security_evaluations_total`	Counter	`verdict`	Pre-tool security verdicts (`verdict` bounded to `allow` / `deny` / `escalate` / `output_scan`).	`Audit & Security`

Audit chain¶

Metric	Type	Labels	Description	Dashboard
`synthorg_audit_chain_appends_total`	Counter	`status`	Audit chain append operations (`status` bounded to `signed` / `fallback` / `error`).	`Audit & Security`
`synthorg_audit_chain_depth`	Gauge	-	Current hash chain length.	`Audit & Security`
`synthorg_audit_chain_last_append_timestamp_seconds`	Gauge	-	Unix timestamp of the most recent append.	`Audit & Security`
`synthorg_security_audit_log_fill_ratio`	Gauge	-	Security audit log occupancy as a fraction of `max_entries` (0.0 empty, 1.0 full). Alert at 0.9: increase retention or archive older entries before the ring buffer wraps and overwrites unread evidence.	`Audit & Security`

OTLP export health¶

Metric	Type	Labels	Description	Dashboard
`synthorg_otlp_export_batches_total`	Counter	`kind`, `outcome`	Export batches by kind (`logs` / `traces`) and outcome (`success` / `failure`).	`Client Health`
`synthorg_otlp_export_dropped_records_total`	Counter	`kind`	Records dropped because the queue was full or the retry budget exhausted.	`Client Health`

Client transport¶

Metric	Type	Labels	Description	Dashboard
`synthorg_client_disconnects_total`	Counter	`transport`, `reason`	Client transport disconnections (`transport` bounded to `sse` / `websocket` / `mcp_stdio` / `mcp_http`; `reason` bounded to `client_initiated` / `transport_error` / `cancelled` / `timeout`).	`Client Health`

Escalation + identity + workflow¶

Metric	Type	Labels	Description	Dashboard
`synthorg_escalation_queue_depth`	Gauge	`department` (registry-bound)	Pending escalations awaiting decision.	`Health & SLO`
`synthorg_agent_identity_version_changes_total`	Counter	`agent_id` (registry-bound), `change_type`	Identity-version lifecycle events (`change_type` bounded to `created` / `updated` / `rolled_back` / `archived`).	`Audit & Security`
`synthorg_workflow_execution_seconds`	Histogram	`workflow_definition_id` (registry-bound), `status`	Workflow execution duration (`status` bounded to `completed` / `failed` / `cancelled` / `timeout`; buckets 0.5s-3600s).	`Workflows`

Suggested PromQL queries¶

Saturation / backlog¶

# Escalation backlog (any department) sustained above 5 for 10m
max_over_time(synthorg_escalation_queue_depth[10m]) > 5

# Workflow p95 latency exceeds 60s
histogram_quantile(0.95, sum by (le) (rate(synthorg_workflow_execution_seconds_bucket[5m]))) > 60

Cost / budget¶

# Burned 80% of the monthly budget
synthorg_budget_used_percent > 80

# Per-agent cost top 5 (most expensive right now)
topk(5, synthorg_agent_cost_total)

Coordination health¶

# Coordination overhead sustained above 40% for 10 minutes
avg_over_time(synthorg_coordination_overhead_percent[10m]) > 40

# Coordination efficiency dropped below 0.5 (half of runs wasted)
avg_over_time(synthorg_coordination_efficiency[15m]) < 0.5

Identity lifecycle¶

# Rollback rate over the last hour (audit-relevant spike check)
sum(rate(synthorg_agent_identity_version_changes_total{change_type="rolled_back"}[1h]))

# Churn rate -- identity updates per minute
sum by (change_type) (rate(synthorg_agent_identity_version_changes_total[5m]))

API health¶

# 5xx rate as a fraction of total (clamp_min avoids NaN/Inf in idle windows)
sum(rate(synthorg_api_request_duration_seconds_count{status_class="5xx"}[5m]))
  / clamp_min(sum(rate(synthorg_api_request_duration_seconds_count[5m])), 1)

# Request rate by status class (histogram's auto-emitted _count series)
sum by (status_class) (rate(synthorg_api_request_duration_seconds_count[1m]))

# Error rate by RFC 9457 category
sum by (category) (rate(synthorg_api_error_classification_total[5m]))

# 5xx rate by category (internal vs rate_limit vs provider_error, etc.)
sum by (category) (rate(synthorg_api_error_classification_total{status_class="5xx"}[5m]))

Provider health¶

# Provider error rate per class (hot loop: rate_limit + timeout + connection)
sum by (provider, error_class) (rate(synthorg_provider_errors_total[5m]))

# Token-normalized provider error rate (error events per token volume)
sum by (provider) (rate(synthorg_provider_errors_total[5m]))
  / clamp_min(sum by (provider) (rate(synthorg_provider_tokens_total[5m])), 1)

Cache hit rate¶

# Hit rate per cache (0.0-1.0)
sum by (cache_name) (rate(synthorg_cache_operations_total{outcome="hit"}[5m]))
  / clamp_min(sum by (cache_name) (rate(synthorg_cache_operations_total[5m])), 1)

# Eviction spike (may indicate undersized cache)
sum by (cache_name) (rate(synthorg_cache_operations_total{outcome="evict"}[5m]))

Security posture¶

# Denial rate (should be low; spike indicates policy tightening or attack)
rate(synthorg_security_evaluations_total{verdict="deny"}[5m])

# Escalation rate per minute
rate(synthorg_security_evaluations_total{verdict="escalate"}[1m])

Audit chain health¶

# Append-error rate (non-zero = signing pipeline is broken)
rate(synthorg_audit_chain_appends_total{status="error"}[5m])

# Seconds since last append (flat line for > 5m is suspicious)
time() - synthorg_audit_chain_last_append_timestamp_seconds

# Audit log fill ratio (alert when the ring buffer is near capacity).
# At >0.9 the next bursts of activity overwrite the oldest entries
# before an operator can read them; rotate retention or archive.
synthorg_security_audit_log_fill_ratio

Audit log fill ratio¶

The synthorg_security_audit_log_fill_ratio gauge reports the occupancy of the in-memory security audit log as a fraction of its configured max_entries capacity. The log is a ring buffer: once full, the oldest entries are overwritten as new audit events land. A sustained value above 0.9 means the buffer is about to wrap; any unread evidence beyond that point is permanently lost.

Recommended alert rule:

- alert: SynthorgSecurityAuditLogNearCapacity
  expr: synthorg_security_audit_log_fill_ratio > 0.9
  for: 10m
  labels: {severity: warning}
  annotations:
    summary: "Security audit log is {{ $value | humanizePercentage }} full"
    runbook: "increase max_entries, archive entries to long-term storage, or shorten retention"

Grafana panel definition (drop into the Audit & Security row of monitoring/grafana/synthorg-overview.json):

{
  "title": "Security audit log fill ratio",
  "type": "gauge",
  "datasource": "${DS_PROMETHEUS}",
  "fieldConfig": {
    "defaults": {
      "min": 0,
      "max": 1,
      "unit": "percentunit",
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 0.75},
          {"color": "red", "value": 0.9}
        ]
      }
    }
  },
  "targets": [
    {"expr": "synthorg_security_audit_log_fill_ratio", "refId": "A"}
  ]
}

OTLP export health¶

# Export-failure rate per kind
sum by (kind) (rate(synthorg_otlp_export_batches_total{outcome="failure"}[5m]))

# Dropped records per kind (queue overflow or retries exhausted)
sum by (kind) (rate(synthorg_otlp_export_dropped_records_total[5m]))

Client transport health¶

# Disconnect rate by transport (alert on transport_error spikes)
sum by (transport, reason) (rate(synthorg_client_disconnects_total[5m]))

# Transport-error rate as a fraction of all disconnects
sum(rate(synthorg_client_disconnects_total{reason="transport_error"}[5m]))
  / clamp_min(sum(rate(synthorg_client_disconnects_total[5m])), 1)

Grafana dashboard¶

Import monitoring/grafana/synthorg-overview.json into any Grafana v10+ instance. The file is Grafana v10-compatible dashboard JSON (authored against the v11 editor, which emits a schema readable by v10) with a single ${DS_PROMETHEUS} template variable bound to your Prometheus data source plus four filter variables: $agent_id (sourced from synthorg_agent_cost_total's agent_id label, used by Cost & Budget + Audit & Security panels), $agent (sourced from synthorg_tasks_total's agent label, used by the Tasks row's per-agent panel), $workflow_definition_id, and $department. The two agent-named variables exist because synthorg_tasks_total and synthorg_agent_cost_total use different label names (agent vs agent_id); panels filter on whichever variable matches their underlying metric.

The dashboard organises 30+ panels into seven collapsible rows. Only Health & SLO is expanded by default; expand the others as needed to keep the unfiltered view scannable.

Row	Default	Panels
`Health & SLO`	expanded	Coordination efficiency, coordination overhead, budget utilisation, active agents, escalation queue depth
`Tasks`	collapsed	Task completion rate, task duration p50/p95, tasks-by-status, task-runs-by-outcome, tasks per agent
`Workflows`	collapsed	Workflow duration p50/p95, workflow execution rate by status, top-N workflow definitions
`Tools & Providers`	collapsed	Tool invocation rate, tool duration p95 by `tool_name`, provider tokens, provider cost, provider errors by class
`Cost & Budget`	collapsed	`synthorg_cost_total`, monthly cost, daily used %, top-25 per-agent cost, agent budget used %
`Audit & Security`	collapsed	Audit chain append rate, depth, last-append age, audit-log fill-ratio gauge, security verdicts, agent identity version changes, API error categories
`Client Health`	collapsed	Client disconnects by transport+reason, API request rate by status class, OTLP export batches, OTLP dropped records, cache hit rate, app info

To install via the Grafana UI: Dashboards → New → Import → Upload JSON file. Via the provisioning API: POST /api/dashboards/db with {"dashboard": <file>, "overwrite": true, "inputs": [...]}.

Alerts¶

The file does not ship alert rules because thresholds are deployment-specific. The suggested PromQL above is ready to drop into Prometheus' rules.yml; pair each query with a labels: severity: warning|critical and a for: duration. Example:

groups:
  - name: synthorg
    rules:
      - alert: SynthorgCoordinationOverheadHigh
        expr: avg_over_time(synthorg_coordination_overhead_percent[10m]) > 40
        for: 10m
        labels: {severity: warning}
        annotations:
          summary: "Coordination overhead is {{ $value }}%"
          runbook: "https://synthorg.io/docs/runbooks/coordination-overhead"

Logfire¶

Logfire's Prometheus integration can scrape the same /metrics endpoint directly; no additional wiring is required on the SynthOrg side. Follow the Logfire documentation for the Prometheus setup and point it at http://synthorg:8000/metrics. All metrics documented above will appear under the same names in Logfire dashboards.

Monitoring & Dashboards¶

Scraping¶

Metric inventory¶

Info¶

Coordination (push-updated per multi-agent run)¶

Cost & budget (pull-refreshed at scrape)¶

Agents & tasks¶

Providers¶

Tools¶

API¶

Caches¶

Security¶

Audit chain¶

OTLP export health¶

Client transport¶

Escalation + identity + workflow¶

Suggested PromQL queries¶

Saturation / backlog¶

Cost / budget¶

Coordination health¶

Identity lifecycle¶

API health¶

Provider health¶

Cache hit rate¶

Security posture¶

Audit chain health¶

Audit log fill ratio¶

OTLP export health¶

Client transport health¶

Grafana dashboard¶

Alerts¶

Logfire¶

Further reading¶