Skip to content

Service-user deployments

Production deployments often run the host process under a dedicated service user - www-data for nginx + gunicorn, app for systemd-managed uvicorn, or an ephemeral uid via DynamicUser=yes. This page covers what changes about the agent in those environments and how to debug when it doesn’t come online.

Service users typically don’t have a writable $HOME. www-data resolves to /var/www, nobody to /nonexistent, DynamicUser to a transient mount. The agent’s on-disk SQLite buffer would normally land in ~/.z4j/, which the process can’t create.

The agent detects this at startup and relocates the buffer to a per-uid tmp directory:

$TMPDIR/z4j-{uid}/buffer-{pid}.sqlite # mode 0700

A WARNING is logged once, naming the original path and the chosen fallback. The buffer is still crash-safe (SQLite WAL mode), still per-process (no counter drift), and still bounded - only the location moved.

When you want a persistent buffer location

Section titled “When you want a persistent buffer location”

The tmp fallback works but is wiped on tmpfs clears. If your host has a tmpfs /tmp, set an explicit path so events buffered during a brain outage survive a host reboot:

Terminal window
# 1. Create a writable dir owned by the service user
sudo mkdir -p /var/lib/picker/.z4j
sudo chown www-data:www-data /var/lib/picker/.z4j
# 2. Point z4j at it via systemd unit override
sudo systemctl edit gunicorn

Add to the override:

[Service]
Environment=Z4J_BUFFER_PATH=/var/lib/picker/.z4j/buffer.sqlite
Terminal window
sudo systemctl daemon-reload
sudo systemctl restart gunicorn

The path must live under one of the two allowed roots (~/.z4j or the per-uid tmp fallback) - the security clamp rejects anything else, so a typo like Z4J_BUFFER_PATH=/etc/passwd is structurally impossible.

When an agent doesn’t come online, run the framework-side doctor first. It probes the same things the agent runtime would but synchronously and without starting the persistent connection - so you get specific failure reasons instead of “the dashboard shows unknown”.

Django:

Terminal window
sudo -u www-data /srv/picker/venv/bin/python /srv/picker/picker/manage.py z4j_doctor

Flask:

Terminal window
sudo -u www-data /srv/picker/venv/bin/python -m z4j_flask doctor

FastAPI:

Terminal window
sudo -u www-data /srv/picker/venv/bin/python -m z4j_fastapi doctor

Standalone Celery (no framework):

Terminal window
sudo -u www-data /srv/picker/venv/bin/python -m z4j_bare doctor

Always run the doctor as the same user the service runs under. Otherwise Path.home() resolves differently, the buffer-path probe reads a different writable status, and the result doesn’t reflect what the actual service sees.

z4j-doctor (django)
===================
brain_url: https://tasks.example.com/
project_id: picker
agent_name: picker_django
buffer_path: /tmp/z4j-33/buffer-7281.sqlite
transport: auto
[OK] buffer_path OK: buffer dir /tmp/z4j-33 is writable
[OK] dns OK: tasks.example.com -> 198.51.100.42
[OK] tcp OK: TCP connect to tasks.example.com:443
[OK] tls OK: TLS TLSv1.3 to tasks.example.com (cert CN='tasks.example.com')
[OK] websocket OK: ws upgrade to https://tasks.example.com/ succeeded
engines auto-detected: celery

Add --no-websocket to skip the WS round-trip when z4j is intentionally offline. Add --json for scripting.

ProbeCatches
buffer_pathService user can’t write to its $HOME; tmp fallback also unwritable; explicit Z4J_BUFFER_PATH typo
dnsBad brain hostname, split-DNS, stale /etc/hosts
tcpEgress firewall blocks z4j port, NAT timeout, no route
tlsCert hostname mismatch, expired cert, untrusted CA, intermediate cert missing
websocketReverse proxy strips Upgrade: header, wrong token, wrong project_id, missing HMAC, brain refuses protocol version

Common problems that the doctor will surface:

  • Environment= vs EnvironmentFile= - env vars set in your shell or a project .env file don’t propagate to the service. Either declare them in the unit override (sudo systemctl edit <service>) or set EnvironmentFile=/path/to/.env so systemd loads it.
  • ProtectHome=true / PrivateTmp=true - these restrict where the process can write. The agent’s tmp fallback still works under PrivateTmp=true (it’s a per-service tmpfs mount), but a custom Z4J_BUFFER_PATH outside the allowed view will fail at boot. Either drop the protection or move the buffer back to ~/.z4j (with a writable HOME) or $TMPDIR/z4j-{uid}.
  • User= vs running as root - token files and buffer paths created while testing as root won’t be readable when the service drops to its real user. Always test the doctor as the service user.