Deployment & Container Runtime¶
SynthOrg ships as six container images to ghcr.io/aureliolo/synthorg-{backend,web,sandbox,sidecar,fine-tune-gpu,fine-tune-cpu}. The backend and web images are managed as Docker Compose services by the CLI. The sandbox, sidecar, and fine-tune-{gpu,cpu} images are not Compose services; the CLI pre-pulls sandbox when requested, and the backend spawns sandbox/sidecar/fine-tune containers on demand via the Docker API. The CLI verifies cosign signatures for all enabled images (both Compose-managed and on-demand) before starting.
Images we publish¶
| Image | Purpose | Base |
|---|---|---|
backend |
SynthOrg orchestration engine (Litestar + uvicorn) | apko-composed Wolfi base (docker/backend/apko.yaml, python-3.14 resolved via apko lockfile); thin docker/backend/Dockerfile layers the uv-built venv on top |
web |
React SPA and built docs, served by Caddy | Pure apko (no Dockerfile); composes caddy + ca-certificates-bundle + melange-built synthorg-web-assets apk + /etc/synthorg/Caddyfile |
sandbox |
Ephemeral agent code execution image spawned on demand by the backend | apko-composed Wolfi base (docker/sandbox/apko.yaml) with busybox and git; fully rootless (UID 10001, cap_drop: ALL). Network enforcement handled by a separate sidecar proxy container |
sidecar |
Transparent network proxy sidecar for sandbox containers | apko-composed Wolfi base (docker/sidecar/apko.yaml) with iptables and busybox; Go binary providing dual-layer DNS + DNAT enforcement of allowed_hosts |
fine-tune-gpu |
Ephemeral embedding fine-tuning container (GPU variant, ~4 GB: torch with bundled CUDA runtime). Default when fine-tuning is enabled. amd64 only; requires an NVIDIA GPU + compatible host driver for practical training speed. | apko-composed Wolfi base (docker/fine-tune/apko.yaml) with Python 3.14 + openblas; thin docker/fine-tune/Dockerfile layers torch + sentence-transformers on top with FINE_TUNE_EXTRA=fine-tune-gpu |
fine-tune-cpu |
Ephemeral embedding fine-tuning container (CPU variant, ~1.7 GB: torch without CUDA). Safer default for hosts without an NVIDIA GPU; training is slower. amd64 only | Same base + Dockerfile as fine-tune-gpu; torch comes from download.pytorch.org/whl/cpu via [tool.uv.sources] when built with FINE_TUNE_EXTRA=fine-tune-cpu |
Each published image is signed with cosign keyless via GitHub OIDC in .github/workflows/docker.yml and attested with SLSA Level 3 provenance. The signature is bound to the manifest list digest by the main-push run; on release tag-push the workflow's retag jobs apply the version tags ({{version}}, dev, {{major}}.{{minor}}) to the same digest via docker buildx imagetools create, so every tag of a single commit shares the main-run's signature without re-signing. CycloneDX SBOMs are generated per image and uploaded as GitHub Release artifacts. At pull/start time, cli/internal/verify/verify.go verifies cosign signatures and SLSA provenance (bypassable with --skip-verify); SBOM contents are not validated at runtime.
Dev / not-yet-published images¶
| Image | Purpose | Base |
|---|---|---|
desktop |
Headless virtual-desktop sandbox the agent drives via the desktop tool (Xvfb + fluxbox + xdotool + scrot, plus Python/Tk for GUI deliverables). Spawned on demand by the backend; the desktop_image_pin setting defaults to ghcr.io/aureliolo/synthorg-desktop:latest |
debian:trixie-slim pinned by digest in docker/desktop/Dockerfile. Debian rather than apko/Wolfi because the X11/GUI toolchain (Xvfb, fluxbox, Tk) is packaged for glibc Debian, not Wolfi |
Unlike the published images above, desktop is not yet built or published by .github/workflows/docker.yml, so it is not cosign-signed or SLSA-attested. Its base-image digest is still kept fresh by Renovate (the dockerfile manager scans every Dockerfile). It is not yet wired into the publish + signing matrix (tracked in #2033), so the desktop tool's desktop_image_pin default does not resolve until that lands.
apko-composed base images¶
The backend, sandbox, and sidecar images use a Hybrid A pattern: apko composes the base image declaratively from Wolfi packages (python-3.14, git, etc.) with exact versions resolved via apko.lock.json, and a thin Dockerfile layers the application on top (FROM apko-base@sha256:..., COPY .venv, COPY src, ENTRYPOINT). The sidecar image adds iptables for DNAT setup but the sandbox image is minimal (no iptables, no elevated privileges). The web image is pure apko (no Dockerfile), composing Caddy plus a melange-packaged static site bundle.
Wolfi is a separate distribution from Alpine. It reuses the apk package format but is built against glibc, not musl, so Python manylinux wheels install natively without source rebuilds and uv runs at full speed. This is the decisive reason Wolfi wins over both Alpine and Debian-slim for our workload.
Reconciliation mechanisms:
| Mechanism | Target | Cadence |
|---|---|---|
| Renovate (Docker ecosystem + digest pinning) | Thin Dockerfile FROM lines (apko-base digest) |
Daily |
apko lock cron (.github/workflows/apko-lock.yml) |
docker/*/apko.lock.json (backend, sandbox, sidecar, fine-tune). docker/web/apko.yaml is intentionally skipped: it depends on the workflow-build-time synthorg-web-assets@local melange package, which has no stable upstream to lock against |
Weekly (Mon 06:00 UTC); the single fine-tune apko base is shared by both -gpu and -cpu runtime images |
Image verification at launch¶
flowchart LR
A[synthorg start] --> B[Resolve tags to digests]
B --> C[Verify cosign signature]
C --> D[Verify SLSA provenance]
D --> E[Write verified digests to state]
E --> F[Regenerate compose.yml with @digest pins]
F --> G[docker compose pull backend web]
G --> H{Sandbox?}
H -- yes --> I[docker pull sandbox digest ref]
H -- no --> J[docker compose up -d]
I --> J
J --> K[Wait for backend healthy]
synthorg start runs cli/internal/verify/verify.go which resolves each tag to a digest, verifies the cosign signature and SLSA provenance, and writes the verified digest into state.VerifiedDigests. The digest-pinned references are then rendered into compose.yml so the started containers run exactly the image the CLI verified. --skip-verify bypasses this for air-gapped environments.
Sandbox image resolution¶
When --sandbox is enabled, the CLI verifies the sandbox image alongside the others, pre-pulls it via docker pull <digest-ref> (the sandbox is not a compose service; the backend spawns ephemeral sandbox containers on demand via aiodocker), and passes the digest-pinned reference to the backend container as SYNTHORG_SANDBOX_IMAGE. The backend's DockerSandboxConfig.image field reads this env var as its default via a Pydantic default_factory; explicit YAML under sandboxing.docker.image still wins when set. This keeps the CLI pin and the backend pin version-locked.
The backend gets /var/run/docker.sock mounted read-write (it needs create, start, stop, and exec on the daemon). The sandbox image retains a full shell plus git but no iptables; it is fully rootless (UID 10001, cap_drop: ALL, no-new-privileges, read-only root filesystem). Per-host:port allowed_hosts network enforcement is handled by a separate sidecar proxy container that shares the sandbox's network namespace. The sidecar runs with NET_ADMIN (for iptables DNAT setup) and provides dual-layer enforcement: DNS filtering (allowed hostnames forwarded, denied get NXDOMAIN) and transparent TCP proxying (connections to unauthorized hosts are dropped with TCP RST).
Graceful shutdown¶
The backend tears down in three stages so requests are not cancelled mid-transaction during a rolling restart (#1600 Phase 3):
- HTTP request drain (25 s budget):
RequestDrainMiddleware(src/synthorg/api/drain.py) is wrapped around the Litestar ASGI app as the outermost layer. The firston_shutdownhook flips the drain gate; new requests after that return503 Service UnavailablewithRetry-After: 5, while in-flight requests have up to 25 s to finish. A drain that exceeds the budget is logged at WARNING (api.app.drain.timeout) and service teardown begins regardless. The budget lives at_DRAIN_TIMEOUT_SECONDSinsrc/synthorg/api/lifecycle.py. - Service teardown (~26 s typical, 42 s absolute cap):
_safe_shutdownruns the per-service shutdown budgets insrc/synthorg/api/lifecycle.pyin this order: approval timeout (1 s), meeting (2 s), TaskEngine drain (8 s, 17 s outer cap with slack), perf (2 s), backup (5 s), settings (2 s), bridge (2 s), distributed queue (3 s), message bus (3 s), persistence (5 s). Most services return well under their cap in practice, hence the ~26 s typical figure; the 42 s absolute cap fires only when every service simultaneously hits its individual budget. - Uvicorn graceful close:
uvicorn.runis invoked withtimeout_graceful_shutdown=75, which covers the drain budget plus the full service teardown sequence with ~8 s headroom over the absolute worst case.
Recommended terminationGracePeriodSeconds: 75 for both Kubernetes pods and Docker Compose stacks. The realistic budget is 25 (drain) + ~26 (services) ≈ 51 s and the absolute worst case is ~67 s if every service hits its individual cap; 75 s leaves ~8 s headroom over the absolute worst case so the orchestrator never SIGKILLs the process mid-teardown. Operators that consistently hit drain timeouts should raise the grace and document the incident motivating the change.
Kubernetes example:
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 75
containers:
- name: backend
image: ghcr.io/aureliolo/synthorg-backend@sha256:...
Docker Compose example:
services:
backend:
image: ghcr.io/aureliolo/synthorg-backend@sha256:...
stop_grace_period: 75s
stop_signal: SIGTERM
The drain emits three observability log events documented in docs/design/observability.md § "Telemetry collector lifecycle" (sibling table for api.app.drain.*): api.app.drain.started, api.app.drain.completed, api.app.drain.timeout. Tail those during a deploy to confirm a clean drain.
Web server¶
The web image runs Caddy inside a pure-apko Wolfi image. Caddy serves the React SPA at /, the built documentation at /docs, proxies REST requests at /api/ and WebSocket connections at /api/v1/ws to the backend, and emits a per-request CSP nonce via the templates directive + {http.request.uuid} placeholder. The full security-header set (CSP, HSTS, X-Frame-Options, Referrer-Policy, Permissions-Policy) is configured in web/Caddyfile. Pre-compressed .gz siblings built by melange are served via Caddy's precompressed gzip file_server option.
See Also¶
- Tools: sandbox backends, lifecycle strategies
- Backup: persistence snapshots and restore
- Design Overview: full index