Service-user deployments

Production deployments often run the host process under a dedicated service user - www-data for nginx + gunicorn, app for systemd-managed uvicorn, or an ephemeral uid via DynamicUser=yes. This page covers what changes about the agent in those environments and how to debug when it doesn’t come online.

What we handle for you

Service users typically don’t have a writable $HOME. www-data resolves to /var/www, nobody to /nonexistent, DynamicUser to a transient mount. The agent’s on-disk SQLite buffer would normally land in ~/.z4j/, which the process can’t create.

The agent detects this at startup and relocates the buffer to a per-uid tmp directory:

$TMPDIR/z4j-{uid}/buffer-{pid}.sqlite     # mode 0700

A WARNING is logged once, naming the original path and the chosen fallback. The buffer is still crash-safe (SQLite WAL mode), still per-process (no counter drift), and still bounded - only the location moved.

When you want a persistent buffer location

The tmp fallback works but is wiped on tmpfs clears. If your host has a tmpfs /tmp, set an explicit path so events buffered during a brain outage survive a host reboot:

# 1. Create a writable dir owned by the service user
sudo mkdir -p /var/lib/picker/.z4j
sudo chown www-data:www-data /var/lib/picker/.z4j

# 2. Point z4j at it via systemd unit override
sudo systemctl edit gunicorn

Add to the override:

[Service]
Environment=Z4J_BUFFER_PATH=/var/lib/picker/.z4j/buffer.sqlite

sudo systemctl daemon-reload
sudo systemctl restart gunicorn

The path must live under one of the two allowed roots (~/.z4j or the per-uid tmp fallback) - the security clamp rejects anything else, so a typo like Z4J_BUFFER_PATH=/etc/passwd is structurally impossible.

The doctor command

When an agent doesn’t come online, run the framework-side doctor first. It probes the same things the agent runtime would but synchronously and without starting the persistent connection - so you get specific failure reasons instead of “the dashboard shows unknown”.

Django:

sudo -u www-data /srv/picker/venv/bin/python /srv/picker/picker/manage.py z4j_doctor

Flask:

sudo -u www-data /srv/picker/venv/bin/python -m z4j_flask doctor

FastAPI:

sudo -u www-data /srv/picker/venv/bin/python -m z4j_fastapi doctor

Standalone Celery (no framework):

sudo -u www-data /srv/picker/venv/bin/python -m z4j_bare doctor

Always run the doctor as the same user the service runs under. Otherwise Path.home() resolves differently, the buffer-path probe reads a different writable status, and the result doesn’t reflect what the actual service sees.

Sample output

z4j-doctor (django)
===================
  brain_url:   https://tasks.example.com/
  project_id:  picker
  agent_name:  picker_django
  buffer_path: /tmp/z4j-33/buffer-7281.sqlite
  transport:   auto

  [OK]   buffer_path  OK: buffer dir /tmp/z4j-33 is writable
  [OK]   dns          OK: tasks.example.com -> 198.51.100.42
  [OK]   tcp          OK: TCP connect to tasks.example.com:443
  [OK]   tls          OK: TLS TLSv1.3 to tasks.example.com (cert CN='tasks.example.com')
  [OK]   websocket    OK: ws upgrade to https://tasks.example.com/ succeeded

  engines auto-detected: celery

Add --no-websocket to skip the WS round-trip when z4j is intentionally offline. Add --json for scripting.

What each probe catches

Probe	Catches
`buffer_path`	Service user can’t write to its `$HOME`; tmp fallback also unwritable; explicit `Z4J_BUFFER_PATH` typo
`dns`	Bad brain hostname, split-DNS, stale `/etc/hosts`
`tcp`	Egress firewall blocks z4j port, NAT timeout, no route
`tls`	Cert hostname mismatch, expired cert, untrusted CA, intermediate cert missing
`websocket`	Reverse proxy strips `Upgrade:` header, wrong token, wrong project_id, missing HMAC, brain refuses protocol version

systemd unit checklist

Common problems that the doctor will surface:

Environment= vs EnvironmentFile= - env vars set in your shell or a project .env file don’t propagate to the service. Either declare them in the unit override (sudo systemctl edit <service>) or set EnvironmentFile=/path/to/.env so systemd loads it.
ProtectHome=true / PrivateTmp=true - these restrict where the process can write. The agent’s tmp fallback still works under PrivateTmp=true (it’s a per-service tmpfs mount), but a custom Z4J_BUFFER_PATH outside the allowed view will fail at boot. Either drop the protection or move the buffer back to ~/.z4j (with a writable HOME) or $TMPDIR/z4j-{uid}.
User= vs running as root - token files and buffer paths created while testing as root won’t be readable when the service drops to its real user. Always test the doctor as the service user.

Allowed hosts - adding the public DNS name to z4j’s host allow-list
Backup and restore - preserving the buffer across host moves
Troubleshooting - general agent symptom guide