Skip to content

Grafana

z4j ships five Grafana dashboards in deploy/grafana/. They are plain JSON files, schema-compatible with Grafana 10.4+, and they work against any Prometheus that is scraping the brain’s /metrics endpoint.

FileTitlePurpose
z4j-overview.jsonz4j — Brain overviewFirst stop. Stat panels for agents online, brain RSS, DB pool utilisation, deadlock rate. Task throughput by final state. Task duration p50 / p95 / p99. Queue depth by project. Background-task error flags. Swallowed exceptions by module.
z4j-tasks.jsonz4j — TasksTail latency or failure rate climbing? Open this. Stacked task throughput by state, failure-rate ratio, full duration heatmap, top-10 failing / slow (p99) / by-volume / retried task names.
z4j-agents.jsonz4j — Agents and commandsAgent + worker counts per project. Command dispatch flow by status and action. Late-result counter (tuning signal for Z4J_COMMAND_TIMEOUT_SECONDS). Live WebSocket connection count. In-memory state by subsystem.
z4j-notifications.jsonz4j — NotificationsSend rate by channel type and status. Failure-rate table (per channel, red rows are channels currently failing). Cooldown skip rate per trigger. 24h channel mix donut. Blocked-by-SSRF / host-lock rate.
z4j-scheduler.jsonz4j-schedulerOnly relevant when the scheduler companion is deployed. Leader status, fire throughput by terminal state, fire latency p50/p99 against the §23 SLI budget, tick drift, per-schedule top-N.

Each brain dashboard exposes a $project template variable (multi-select, all by default) so multi-tenant deployments can scope panels to a single project without editing JSON.

  1. Grafana, Dashboards, New, Import.
  2. Upload the JSON.
  3. Pick your Prometheus datasource.
  4. Save.

Mount deploy/grafana/ into the Grafana container and provision a file-based dashboard provider. A complete example sits in deploy/grafana/README.md. Grafana picks up changes within updateIntervalSeconds of a JSON edit, so the dashboards become a normal infra-as-code artefact.

For the Kubernetes Grafana Helm chart, drop the JSONs into a ConfigMap and point dashboardsConfigMaps at it.

Brain (default port 7700, fail-secure metrics auth):

scrape_configs:
- job_name: z4j-brain
metrics_path: /metrics
static_configs:
- targets: ["brain.internal:7700"]
authorization:
type: Bearer
credentials: "<Z4J_METRICS_AUTH_TOKEN>"

Z4J_METRICS_AUTH_TOKEN is auto-minted on first boot and persisted to $Z4J_HOME/secret.env. On trusted-LAN deployments you can flip Z4J_METRICS_PUBLIC=true and drop the authorization block; the brain logs a loud WARNING at startup naming the risk.

Scheduler companion (only if z4j-scheduler is deployed):

- job_name: z4j-scheduler
metrics_path: /metrics
static_configs:
- targets: ["scheduler.internal:9100"]

Five small dashboards over one mega-dashboard is deliberate:

  • Each dashboard fits a single screen at 1080p without horizontal scroll.
  • The split matches the operator’s natural drill path: overview, then per-area (tasks / agents / notifications), then per-row (the top-N panels link the operator to the exact task name or channel that is misbehaving).
  • You can permission them independently in Grafana folders — the on-call team often does not need write access to the scheduler dashboard, for example.

A full table sits in deploy/grafana/README.md with starter thresholds for:

  • zero agents online
  • DB pool saturated
  • sustained Postgres deadlocks
  • a self-watch background task failing (audit retention, WAL checkpoint, etc.)
  • task failure rate above 5%
  • late command results (timeout-sweeper mistune)
  • notification channel failing above 10%
  • swallowed-exception spike by module
  • brain not reporting metrics at all (catch-all liveness)

The thresholds are starting points; tune to your fleet’s normal baseline.

A historical leak-investigation snapshot lives at docs/perf/grafana-dashboard.json (four panels covering RSS slope, deadlock rate, DB pool utilisation, and events context). It is the dashboard we used to validate the 1.5.1 connection-pool leak fix and is kept in the repo for reproducibility. The new deploy/grafana/z4j-overview.json covers the same signals plus much more; new deployments should use the deploy/grafana/ set.