Incident response

A control plane needs a clear “what do I do” for the outages you can’t prevent. This page covers the four most likely incidents operators will face.

Brain is down

Symptoms: dashboard unreachable, agents reporting disconnected, z4j check fails.

Step 1: is the process running?

# systemd
sudo systemctl status z4j
sudo journalctl -u z4j -n 100 --no-pager

# Docker
docker compose ps
docker compose logs --tail=100 z4j

Process crashed → look for a Python traceback in the last 100 log lines. Common causes:

DB unreachable (Postgres) - network / credential / PG-down. Fix Postgres first, then systemctl restart z4j.
Migration failure - alembic error on boot. Z4J_AUTO_MIGRATE=false z4j serve to start without migrating, then inspect + run z4j migrate history / z4j migrate current.
Disk full (SQLite) - df -h shows 100% on the z4j data volume. Free space or rotate backups. SQLite has a minimum free-space requirement for WAL commits.
Port conflict - something else grabbed 7700. ss -tlnp | grep 7700 to identify.

Step 2: is it accessible?

Brain process is running but dashboard returns 400 / 502 / connection-refused.

400 invalid_host - see allowed hosts. z4j allowed-hosts add <name> and restart.
502 bad gateway - reverse proxy can’t reach z4j. Check reverse proxy logs; verify z4j bound to the port the proxy expects (Z4J_BIND_HOST / Z4J_BIND_PORT).
connection-refused - firewall / security group blocking inbound. Verify with telnet <brain-host> 7700 from the client side.

Step 3: nothing obvious, restore

If z4j is broken beyond systemctl restart, restore from the most recent backup:

# Locate the latest
ls -lt /var/backups/ | head -5

# Restore (brain must be stopped)
sudo systemctl stop z4j
sudo -u z4j /srv/venv/bin/z4j restore /var/backups/z4j-2026-04-24.db --force
sudo systemctl start z4j
sudo -u z4j /srv/venv/bin/z4j check
sudo -u z4j /srv/venv/bin/z4j status

See backup and restore.

Audit chain tampered

The audit log is HMAC-chained (row_hmac + prev_row_hmac). Any row edit / delete breaks the chain. z4j audit verify walks the chain and reports the first row that fails.

Detection

Run periodically (daily cron, or after any suspicious event):

z4j audit verify

Output on a healthy chain:

z4j audit verify: OK (walked 5421 rows, chain intact)

Output on a tampered chain:

z4j audit verify: FAIL at row 1234 (audit_id=4f2b...)
  expected prev_hmac: a1b2c3...
  got     prev_hmac: 00000000...
  first-broken-row timestamp: 2026-04-23 14:22:05 UTC

Response playbook

Preserve evidence. Snapshot the DB and both ~/.z4j/secret.env + the Z4J_SECRET env var used when the break happened. cp ~/.z4j/z4j.db z4j-evidence-$(date +%s).db - don’t run VACUUM INTO here, it rewrites the file.
Check server access logs (journalctl -u z4j, reverse-proxy logs, cloud provider flow logs). The audit chain broke at row N; correlate the timestamp with access patterns.
Check for unauthorised DB writes. For Postgres: pg_waldump + role audit. For SQLite: the file was modified outside z4j.
Rotate Z4J_SECRET if you believe the master HMAC was exposed. This will:
- Invalidate every existing agent token (all agents need new credentials)
- Make the existing audit chain un-verifiable against the new secret (the old one still verifies with the compromised one)
- Invalidate all user sessions (everyone re-logs-in)
Restore from a backup taken before the tamper timestamp if the current DB is compromised.
File an incident report - what was touched, what period, what data exfiltrated (audit chain rows would tell you).

Why HMAC chaining matters

The audit log is z4j’s tamper-evident record. A malicious operator (or a compromised operator account) that modifies the audit log can’t hide it - the chain fails verification, and the first-broken-row timestamp tells you when. Without audit verify being run periodically, tampering can go undetected for months.

Leaked agent token

An agent bearer token (the plaintext value returned by the mint dialog once) was pasted into a public Slack / committed to a public git repo / exposed in a build log.

Immediate response

Revoke the token in the dashboard: /projects/<slug>/agents → click the agent → revoke.
Mint a replacement for the legitimate agent (same agent row, new token + new hmac_secret).
Update the legitimate agent’s config with the new credentials. Restart the agent process.

Revoking the token removes its row from the agents table; any WebSocket session using that token is dropped on the next heartbeat (within ~30s) and subsequent connection attempts are rejected at the gateway.

Damage assessment

Inspect the audit log for the window the token was exposed:

# SQL query - against z4j's Postgres / SQLite
SELECT occurred_at, action, target_type, source_ip
FROM audit_log
WHERE actor_type = 'agent'
  AND actor_id = '<revoked-agent-uuid>'
  AND occurred_at > '2026-04-23 12:00:00'
ORDER BY occurred_at;

An agent token grants:

Submit/cancel/retry tasks on that project
Update schedules (if the agent’s engine supports it)
Read task history + engine state for that project

An agent token does NOT grant:

Read other projects’ data
Mint more agents (that’s a user operation)
Edit users / memberships / projects
Read the audit log

So the blast radius of a leaked token is “anything you can do in that one project”.

Lost last admin

The first-boot setup token is single-use. Once used, the admin user is the only way back in. If you lose that user (forgot password, email no longer works, removed by mistake), you need to recover from the CLI.

Password forgotten

sudo -u z4j /srv/venv/bin/z4j changepassword [email protected] --password-stdin

Type or pipe the new password. This works without an admin login - it’s an operator-side recovery tool. The command:

Hashes the new password with Argon2id
Bumps password_changed_at, invalidating every existing session for that user

Last admin deleted

If no admin users exist at all (somebody deleted the only owner from Settings → Memberships):

sudo -u z4j /srv/venv/bin/z4j createsuperuser --email [email protected] --display-name "You" --password-stdin

This creates a fresh owner user on the existing install. Existing projects / agents / task history / audit chain are untouched.

Lost email access

If your admin email no longer works and you need to change the address:

Use z4j createsuperuser to create a fresh owner with the new email.
Log in as the new owner, open the old user’s row in Settings → Admin → Users, delete it.

There is no “rename email” in the CLI - intentionally. A user’s email is their identity for audit purposes; renaming would mask the trail.

What to practice

Once a quarter: do a restore drill. Take a backup, provision a scratch VM, restore the backup there, verify z4j check && z4j status && z4j audit verify pass. This catches backup-process regressions before you need them.
Monthly: run z4j audit verify on the live install. On-boot is too late - the break could have happened weeks ago.
Before any upgrade: take a backup. Every time. See upgrade and rollback.