Incident response
A control plane needs a clear “what do I do” for the outages you can’t prevent. This page covers the four most likely incidents operators will face.
Brain is down
Section titled “Brain is down”Symptoms: dashboard unreachable, agents reporting disconnected, z4j check fails.
Step 1: is the process running?
Section titled “Step 1: is the process running?”# systemdsudo systemctl status z4jsudo journalctl -u z4j -n 100 --no-pager
# Dockerdocker compose psdocker compose logs --tail=100 z4jProcess crashed → look for a Python traceback in the last 100 log lines. Common causes:
- DB unreachable (Postgres) - network / credential / PG-down. Fix Postgres first, then
systemctl restart z4j. - Migration failure - alembic error on boot.
Z4J_AUTO_MIGRATE=false z4j serveto start without migrating, then inspect + runz4j migrate history/z4j migrate current. - Disk full (SQLite) -
df -hshows 100% on the z4j data volume. Free space or rotate backups. SQLite has a minimum free-space requirement for WAL commits. - Port conflict - something else grabbed 7700.
ss -tlnp | grep 7700to identify.
Step 2: is it accessible?
Section titled “Step 2: is it accessible?”Brain process is running but dashboard returns 400 / 502 / connection-refused.
- 400
invalid_host- see allowed hosts.z4j allowed-hosts add <name>and restart. - 502 bad gateway - reverse proxy can’t reach z4j. Check reverse proxy logs; verify z4j bound to the port the proxy expects (
Z4J_BIND_HOST/Z4J_BIND_PORT). - connection-refused - firewall / security group blocking inbound. Verify with
telnet <brain-host> 7700from the client side.
Step 3: nothing obvious, restore
Section titled “Step 3: nothing obvious, restore”If z4j is broken beyond systemctl restart, restore from the most recent backup:
# Locate the latestls -lt /var/backups/ | head -5
# Restore (brain must be stopped)sudo systemctl stop z4jsudo -u z4j /srv/venv/bin/z4j restore /var/backups/z4j-2026-04-24.db --forcesudo systemctl start z4jsudo -u z4j /srv/venv/bin/z4j checksudo -u z4j /srv/venv/bin/z4j statusSee backup and restore.
Audit chain tampered
Section titled “Audit chain tampered”The audit log is HMAC-chained (row_hmac + prev_row_hmac). Any row edit / delete breaks the chain. z4j audit verify walks the chain and reports the first row that fails.
Detection
Section titled “Detection”Run periodically (daily cron, or after any suspicious event):
z4j audit verifyOutput on a healthy chain:
z4j audit verify: OK (walked 5421 rows, chain intact)Output on a tampered chain:
z4j audit verify: FAIL at row 1234 (audit_id=4f2b...) expected prev_hmac: a1b2c3... got prev_hmac: 00000000... first-broken-row timestamp: 2026-04-23 14:22:05 UTCResponse playbook
Section titled “Response playbook”- Preserve evidence. Snapshot the DB and both
~/.z4j/secret.env+ theZ4J_SECRETenv var used when the break happened.cp ~/.z4j/z4j.db z4j-evidence-$(date +%s).db- don’t runVACUUM INTOhere, it rewrites the file. - Check server access logs (
journalctl -u z4j, reverse-proxy logs, cloud provider flow logs). The audit chain broke at row N; correlate the timestamp with access patterns. - Check for unauthorised DB writes. For Postgres:
pg_waldump+ role audit. For SQLite: the file was modified outside z4j. - Rotate
Z4J_SECRETif you believe the master HMAC was exposed. This will:- Invalidate every existing agent token (all agents need new credentials)
- Make the existing audit chain un-verifiable against the new secret (the old one still verifies with the compromised one)
- Invalidate all user sessions (everyone re-logs-in)
- Restore from a backup taken before the tamper timestamp if the current DB is compromised.
- File an incident report - what was touched, what period, what data exfiltrated (audit chain rows would tell you).
Why HMAC chaining matters
Section titled “Why HMAC chaining matters”The audit log is z4j’s tamper-evident record. A malicious operator (or a compromised operator account) that modifies the audit log can’t hide it - the chain fails verification, and the first-broken-row timestamp tells you when. Without audit verify being run periodically, tampering can go undetected for months.
Leaked agent token
Section titled “Leaked agent token”An agent bearer token (the plaintext value returned by the mint dialog once) was pasted into a public Slack / committed to a public git repo / exposed in a build log.
Immediate response
Section titled “Immediate response”- Revoke the token in the dashboard:
/projects/<slug>/agents→ click the agent → revoke. - Mint a replacement for the legitimate agent (same agent row, new token + new hmac_secret).
- Update the legitimate agent’s config with the new credentials. Restart the agent process.
Revoking the token removes its row from the agents table; any WebSocket session using that token is dropped on the next heartbeat (within ~30s) and subsequent connection attempts are rejected at the gateway.
Damage assessment
Section titled “Damage assessment”Inspect the audit log for the window the token was exposed:
# SQL query - against z4j's Postgres / SQLiteSELECT occurred_at, action, target_type, source_ipFROM audit_logWHERE actor_type = 'agent' AND actor_id = '<revoked-agent-uuid>' AND occurred_at > '2026-04-23 12:00:00'ORDER BY occurred_at;An agent token grants:
- Submit/cancel/retry tasks on that project
- Update schedules (if the agent’s engine supports it)
- Read task history + engine state for that project
An agent token does NOT grant:
- Read other projects’ data
- Mint more agents (that’s a user operation)
- Edit users / memberships / projects
- Read the audit log
So the blast radius of a leaked token is “anything you can do in that one project”.
Lost last admin
Section titled “Lost last admin”The first-boot setup token is single-use. Once used, the admin user is the only way back in. If you lose that user (forgot password, email no longer works, removed by mistake), you need to recover from the CLI.
Password forgotten
Section titled “Password forgotten”Type or pipe the new password. This works without an admin login - it’s an operator-side recovery tool. The command:
- Hashes the new password with Argon2id
- Bumps
password_changed_at, invalidating every existing session for that user
Last admin deleted
Section titled “Last admin deleted”If no admin users exist at all (somebody deleted the only owner from Settings → Memberships):
sudo -u z4j /srv/venv/bin/z4j createsuperuser --email [email protected] --display-name "You" --password-stdinThis creates a fresh owner user on the existing install. Existing projects / agents / task history / audit chain are untouched.
Lost email access
Section titled “Lost email access”If your admin email no longer works and you need to change the address:
- Use
z4j createsuperuserto create a fresh owner with the new email. - Log in as the new owner, open the old user’s row in Settings → Admin → Users, delete it.
There is no “rename email” in the CLI - intentionally. A user’s email is their identity for audit purposes; renaming would mask the trail.
What to practice
Section titled “What to practice”- Once a quarter: do a restore drill. Take a backup, provision a scratch VM, restore the backup there, verify
z4j check && z4j status && z4j audit verifypass. This catches backup-process regressions before you need them. - Monthly: run
z4j audit verifyon the live install. On-boot is too late - the break could have happened weeks ago. - Before any upgrade: take a backup. Every time. See upgrade and rollback.