Troubleshooting

Run the doctor first

Before anything else, run the framework-side doctor as the same user the service runs under. It probes everything below in order and reports the first specific failure.

Stack	Command
Django	`python manage.py z4j_doctor`
Flask	`python -m z4j_flask doctor`
FastAPI	`python -m z4j_fastapi doctor`
Bare Python	`python -m z4j_bare doctor`

Add --no-websocket to skip the WS round-trip when z4j is intentionally offline. Add --json for scripts. Exits 0 on all-green, 1 on any failure.

Agent shows offline

Run the doctor - it surfaces the most common failures with a specific reason. The list below covers the rest.
Check the agent’s app logs for [z4j] lines at boot. Look for handshake errors.
Verify Z4J_BRAIN_URL uses wss:// in production, ws:// in local dev.
Verify the token is the one shown at mint time (tokens are not recoverable - re-mint if lost).
Egress firewall: the agent’s host must reach z4j on TCP 443 (or wherever your proxy is).
Proxy WebSocket passthrough: nginx-ingress needs proxy_set_header Upgrade and proxy_set_header Connection.

Agent silently fails to start under gunicorn / uvicorn

PermissionError: ... /var/www/.z4j (or any path that’s not the running user’s writable home) in the service log. The service user has an unwritable $HOME. The agent auto-relocates the buffer to $TMPDIR/z4j-{uid}/buffer-{pid}.sqlite and logs a single WARNING instead of crashing. See service-user deployments.

Events don’t appear

Agent is online (Agents page shows online)?
Engine is auto-detected (agent drawer → engines list)?
For Django: INSTALLED_APPS includes z4j_django after any Celery apps?
For Flask: z4j.init_app(app, ...) was called on the app factory?
For FastAPI: agent is inside the lifespan context manager?
Task names in registry (agent drawer → registry)?

You’ve upgraded z4j to a newer major than the agent. Re-deploy agents with pip install -U z4j-*. Agents up to one major version behind still work but may lack new features.

Audit chain `verify` fails

Someone modified the audit_log table directly, or a backup restore is incomplete.

Identify first_broken_id.
If known-intentional (e.g., planned DB surgery), document the break externally. The chain from there is not recoverable.
If unexpected, treat as a compromise event - preserve the DB, alert security, investigate.

`409 conflict_duplicate_name` on mint-token

Another agent already has that (project_id, name). Pick a different name or delete the old agent first.

Schedules showing `read_only`

The scheduler backend doesn’t support writes (e.g. celery-beat with PersistentScheduler, rq-scheduler currently). See schedulers overview.

Password reset email not arriving

Email channel configured? Use the channel test endpoint (POST /api/v1/projects/{slug}/notifications/channels/{channel_id}/test) or the Test button on the dashboard’s Notifications, Channels page.
From: domain has SPF / DKIM / DMARC?
Sender reputation good? Gmail drops many SMTP senders silently.

Very high CPU

Check rate(z4j_events_persisted_total[1m]) - is one agent emitting millions of events?
Redaction patterns not looping on giant payloads? - the redactor has a 2 MiB payload cap.
Hot endpoint - check z4j_http_request_duration_seconds.

`503 agent_offline` on a retry

The target agent dropped between UI showing the action and z4j dispatching. Wait for reconnect or pick a different agent handling the same engine.

When stuck

Include z4j logs (with X-Request-Id), agent logs, and a description of the sequence - file at github.com/z4jdev/z4j/issues.