Production hardening

z4j ships with backwards-compatible defaults that work out of the box but lean permissive. For production deployments, opt into the fail-closed mode of each subsystem below. None of these are required for z4j to function – they’re defense-in-depth for operators who want hardened defaults instead of trust-the-CA / trust-the-operator-config.

Scheduler gRPC – require explicit CN allow-list

When Z4J_SCHEDULER_GRPC_ENABLED=true, brain accepts mTLS-authenticated gRPC connections from any client cert that the configured CA bundle validates – the “trust the CA” deployment model. For production, populate Z4J_SCHEDULER_GRPC_ALLOWED_CNS with the explicit list of CNs you’ve minted via z4j mint-scheduler-cert, AND set the require flag so a misconfigured boot fails closed instead of falling back to trust-the-CA:

Z4J_SCHEDULER_GRPC_ENABLED=true
Z4J_SCHEDULER_GRPC_ALLOWED_CNS='["scheduler-prod-1","scheduler-prod-2"]'
Z4J_SCHEDULER_GRPC_REQUIRE_ALLOWLIST=true

For the brain to push schedule triggers to the scheduler, set the outbound gRPC client variables on the brain. Without Z4J_SCHEDULER_TRIGGER_URL the brain uses its in-process scheduler path and the TLS variables are ignored:

Z4J_SCHEDULER_TRIGGER_URL=scheduler.example.com:50051
Z4J_SCHEDULER_TRIGGER_TLS_CERT=/etc/z4j/pki/brain.crt
Z4J_SCHEDULER_TRIGGER_TLS_KEY=/etc/z4j/pki/brain.key
Z4J_SCHEDULER_TRIGGER_TLS_CA=/etc/z4j/pki/scheduler-ca.crt

The allow-list for the brain’s CN lives on the scheduler side (SCHEDULER_GRPC_ALLOWED_CNS in the scheduler’s own env), not on the brain.

CN project bindings (multi-project deployments)

If you run schedulers per-project (one scheduler instance per tenant, each with its own CN), bind each CN to its project list so a leaked cert can only act on the projects it was minted for:

Z4J_SCHEDULER_GRPC_CN_PROJECT_BINDINGS='{"scheduler-acme":["acme"],"scheduler-globex":["globex"]}'

Without bindings (the default), every allow-listed CN can drive RPCs for any project. With bindings, requests outside the bound project list return PERMISSION_DENIED.

Notification webhooks – HTTPS-only by default

z4j defaults to HTTPS-only for generic webhook channels. Operator-configured http:// URLs are rejected at config-time and at dispatch-time to prevent payload + custom-header leakage in transit.

If you have a legitimate internal-network http endpoint (intranet receiver, dev rig), opt back in:

Z4J_NOTIFICATIONS_WEBHOOK_ALLOW_HTTP=true

Slack / PagerDuty / Discord / Telegram channels always use the provider’s HTTPS endpoint and are unaffected by this setting.

Audit log retention

The audit log is HMAC-chained and append-only. By default the brain keeps every row forever. For storage management without losing chain integrity, use the audit retention sweeper:

Z4J_AUDIT_RETENTION_DAYS=365   # rolls daily; trims rows older than 365d

The sweeper preserves chain continuity by recording a “summary” row for each batch it deletes; verifying the chain across a sweep is documented in hmac-audit-chain.

Production environment flag

z4j detects “production” via two independent signals: a non-dev Z4J_ENVIRONMENT value AND a non-empty Z4J_ALLOWED_HOSTS config. Both together gate the strict-error / no-debug-endpoint behavior. See allowed-hosts for the four-layer host header allow-list.

Recommended summary

For a production deployment, set:

# Environment
Z4J_ENVIRONMENT=production
Z4J_ALLOWED_HOSTS='["z4j.example.com"]'

# Scheduler gRPC server (brain accepts inbound from z4j-scheduler)
Z4J_SCHEDULER_GRPC_ENABLED=true
Z4J_SCHEDULER_GRPC_REQUIRE_ALLOWLIST=true
Z4J_SCHEDULER_GRPC_ALLOWED_CNS='["..."]'

# Scheduler trigger client (brain dials z4j-scheduler outbound)
Z4J_SCHEDULER_TRIGGER_URL=scheduler.example.com:50051
Z4J_SCHEDULER_TRIGGER_TLS_CERT=/etc/z4j/pki/brain.crt
Z4J_SCHEDULER_TRIGGER_TLS_KEY=/etc/z4j/pki/brain.key
Z4J_SCHEDULER_TRIGGER_TLS_CA=/etc/z4j/pki/scheduler-ca.crt

# Webhooks: HTTPS-only is the default; only set this if you need
# plaintext for an internal endpoint.
# Z4J_NOTIFICATIONS_WEBHOOK_ALLOW_HTTP=true

# Audit retention
Z4J_AUDIT_RETENTION_DAYS=365

Each setting is documented individually in the env vars reference.