Skip to content

OpenTelemetry

z4j brain ships an optional OpenTelemetry hook. With an OTLP endpoint configured, FastAPI HTTP server requests, SQLAlchemy queries, and outbound httpx calls are traced and exported to a collector of your choosing. The integration is off by default, opt-in via a single env var, and the SDK is loaded lazily so a misconfiguration cannot prevent boot.

SourceSpan kindDefault sample rate
FastAPI HTTP requestsserver0 (set Z4J_OTEL_TRACES_SAMPLER_ARG=0.05 to collect 5%)
SQLAlchemy queries against the brain’s primary engineclient (DB)Inherits the parent span’s sampling decision
Outbound httpx calls (notification dispatchers, version check)client (HTTP)Inherits the parent span’s sampling decision
WebSocket dispatch, command issuance, task ingestionnot yetDeferred; the wire protocol needs a trace-context header for cross-process spans to be useful. Candidate for a later minor.

/health* and /metrics are excluded by default. They carry too much background traffic for any sampling budget to be meaningful; tracing them swamps the operator’s collector with noise. Flip Z4J_OTEL_INCLUDE_HEALTH=true to re-enable.

Install the optional dependency, then set the endpoint:

Terminal window
pip install 'z4j[otel]'
Terminal window
# In your env file or systemd unit
Z4J_OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io/v1/traces
Z4J_OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=<your-api-key>
Z4J_OTEL_TRACES_SAMPLER_ARG=0.05

Restart the brain. On boot you should see:

INFO z4j.brain.observability.otel: OpenTelemetry initialised (endpoint=https://api.honeycomb.io/v1/traces, protocol=http/protobuf, sampler_arg=0.050, include_health=False)

If the SDK is not installed when the endpoint is set, the brain logs a single WARNING explaining what to install and continues running without OTel.

All settings are prefixed Z4J_ and read from your env file or the process environment.

VariableDefaultNotes
Z4J_OTEL_EXPORTER_OTLP_ENDPOINTunsetWhen unset OR empty, every other knob below is ignored. SecretStr at the Pydantic layer so a path-embedded API key never lands in startup logs.
Z4J_OTEL_PROTOCOLhttp/protobufOne of http/protobuf, http (alias), grpc. gRPC additionally needs pip install opentelemetry-exporter-otlp-proto-grpc.
Z4J_OTEL_EXPORTER_OTLP_HEADERSunsetComma-separated key=value pairs forwarded as the OTLP exporter’s headers. The standard place to set x-honeycomb-team, authorization, etc. SecretStr.
Z4J_OTEL_SERVICE_NAMEz4j-brainResource attribute service.name. Multi-brain deployments set this to distinguish them in the collector UI.
Z4J_OTEL_SERVICE_NAMESPACEz4jResource attribute service.namespace. Groups every z4j service together in the collector.
Z4J_OTEL_ENVIRONMENTunsetResource attribute deployment.environment. Defaults to Z4J_ENVIRONMENT (production / staging / dev).
Z4J_OTEL_TRACES_SAMPLER_ARG0.0TraceIdRatioBased sampler argument, 0.0..1.0. Default 0 = no traces sampled. Wrapped in ParentBased(remote_parent_sampled=ALWAYS_OFF, remote_parent_not_sampled=ALWAYS_OFF) so a spoofed inbound traceparent cannot force-sample requests; only this brain’s own ratio sampler decides whether to record. Cross-process trace context propagation is deferred to a later minor.
Z4J_OTEL_INCLUDE_HEALTHfalseWhen true, /health* and /metrics are traced like any other endpoint. Default false. The full default exclude list is /health, /api/v1/health, /api/v1/health/, /metrics, /api/v1/auth (auth routes carry credentials in POST bodies; excluded by default).
Z4J_OTEL_EXCLUDED_URL_PATTERNSemptyComma-separated URL substrings to additionally exclude. Layered on top of the health-exclusion default.

Out-of-range sampler args fail validation at startup. A typo like Z4J_OTEL_TRACES_SAMPLER_ARG=1.5 raises a Pydantic ValidationError before the FastAPI app is built.

Every span carries:

service.name = z4j-brain (or your override)
service.namespace = z4j (or your override)
service.version = <z4j package version> (omitted on editable installs without metadata)
deployment.environment = <Z4J_ENVIRONMENT> (or otel_environment override)

The build_resource_attributes helper is exposed and unit-tested so the attribute set is pinned at the source rather than the collector side.

Tested OTLP endpoints:

  • Honeycomb: https://api.honeycomb.io/v1/traces + x-honeycomb-team header. HTTP only.
  • Grafana Tempo: https://tempo-prod-04-prod-us-east-0.grafana.net/tempo + basic-auth header. HTTP or gRPC.
  • Local Jaeger (jaegertracing/all-in-one:latest): http://localhost:4318/v1/traces (HTTP) or localhost:4317 (gRPC).
  • Local OpenTelemetry Collector (otel/opentelemetry-collector-contrib): http://localhost:4318/v1/traces.

The brain only ships traces over OTLP. Metrics export over OTLP is not enabled in this release: the Prometheus /metrics endpoint is the canonical metric surface and dual-exporting would just create reconciliation work.

The OTLP exporter ships span attributes (URLs, DB statement fragments, response codes, header names if the instrumentation captures them) to a collector outside the brain. Pre-1.6 deployments expect no outbound traffic from the brain to anywhere except configured notification destinations; enabling OTel changes that contract. Review what the auto-instrumentations attach BEFORE pointing at a multi-tenant collector. In particular:

  • The FastAPI instrumentation attaches the request path and method. Path parameters become span attributes; if you embed an opaque token in the URL it will appear in spans. Move it to a header.
  • The SQLAlchemy instrumentation runs with enable_commenter=False so it does not embed SQL comments in your queries. The statement text itself is captured; sensitive queries (the brain has none by default, but operator-added schema might) leak by name.
  • The httpx instrumentation captures outbound URLs. Webhook dispatch to Slack / Discord / Teams sends to URLs that embed credentials in the path; without scrubbing, the OTLP exporter ships those URLs to whatever collector you configure. The brain installs a request_hook + response_hook pair on the httpx instrumentation that overwrites http.url, http.target, url.full, url.path, url.query with /[redacted by z4j] when the destination host suffix-matches the credential-bearing set: outlook.office.com, *.webhook.office.com, *.logic.azure.com, hooks.slack.com, discord.com, discordapp.com, *.slack.com, *.discordapp.com, *.discord.com, *.pagerduty.com. The hook is fail-closed: if the scrubber itself raises (a future SDK upgrade renames request.url etc.) the URL is overwritten to a generic https://[unknown]/[redacted by z4j] marker and a WARNING is logged. The unscrubbed URL never reaches the OTLP exporter.

Unset Z4J_OTEL_EXPORTER_OTLP_ENDPOINT and restart. The SDK does not need to be uninstalled; an unset endpoint is a complete no-op.

Adapter-side workers (z4j-celery, z4j-django, etc.) and the scheduler companion currently do NOT initialise their own OTel SDK; only the brain process does. Cross-process trace context propagation (so a span starting on an agent connects to a span on the brain) requires a wire-protocol header that is a candidate for a later minor.