salt's pip.installed flagged so_pillar_psycopg2_in_salt_python as
failed because pip exits non-zero when it can't find the patchelf
binary to rewrite the psycopg2 wheel's RPATH after extraction. The
wheel is fully installed and importable regardless — the patchelf
step is a cosmetic post-install rewrite, not a build dependency. But
salt's failure cascade then short-circuited so_pillar_initial_import
and the so-yaml mode flip, leaving the install in dual-pillar mode
instead of PG-canonical.
Replaced with cmd.run that runs pip with `|| true` and uses an
`import psycopg2` check as the actual readiness gate — same idea as
how salt's own bootstrap does it. Also fixed the require: ref on
so_pillar_initial_import (was `pip:`, needs to be `cmd:` for the new
state type).
Supersedes the pre-install placement (right after secrets_pillar) from
the previous commit, which was broken: salt's ext_pillar overlay
shadowed disk pillar's elasticsearch subtree before so-pillar-import
had populated PG, so elasticsearch.enabled.sls failed rendering on
ELASTICSEARCHMERGED.auth.users.so_elastic_user.pass — that key lives
in elasticsearch/auth.sls, which is on the importer's secrets
allowlist and never makes it into so_pillar.pillar_entry. The install
would then hang forever waiting for the elasticsearch container that
the broken state never deployed.
The new placement is right after the final state.highstate completes:
1. drop adv_postgres.sls flipping the flag to True
2. salt-call saltutil.refresh_pillar so the next state sees it
3. salt-call state.apply postgres.schema_pillar — deploys schema,
ALTERs role login passwords, installs psycopg2 into salt's
bundled python, runs so-pillar-import, writes
/opt/so/conf/so-yaml/mode=postgres
4. salt-call state.apply salt.master — re-renders engines.conf
with the pg_notify_pillar engine block, drops master.d
ext_pillar config, watch_in restarts salt-master and ext_pillar
takes over
verify_setup runs after this so its final checks see PG-canonical
mode in place. Same end state as the previous commit's intent, just
without the bootstrap chicken-and-egg.
Drops a local pillar override (postgres.so_pillar.enabled = True) right
after secrets_pillar so the install-time highstate brings up
schema_pillar, ext_pillar_postgres, and the pg_notify_pillar engine
without operator intervention. Without this the whole PG-canonical
stack stays gated off on the default-False flag and the install lands
in legacy disk-pillar mode — which defeats the point of being on the
postsalt branch at all.
The new enable_so_pillar_postgres() function in so-functions is
idempotent (overwrites adv_postgres.sls with a fixed body) and the
generated file is mode 0644 socore:socore so it merges into pillar
under the existing local-pillar directory ownership convention.
Rollback path: edit /opt/so/saltstack/local/pillar/postgres/adv_postgres.sls
to set enabled: False, or delete the file. The schema and engine
config states will tear themselves down on the next highstate via
their existing else-branch absent states.
Five blockers turned up the first time the so_pillar schema was applied
against a fresh standalone install. Fixing them in order:
1. 006_rls.sql ordering bug
006 GRANTed on so_pillar.change_queue and its sequence, but the table
isn't created until 008_change_notify.sql. 006 errored mid-file with
"relation so_pillar.change_queue does not exist", short-circuiting the
rest of the pillar staging chain. Moved the three change_queue grants
into 008 alongside the table creation so each file is self-contained.
2. so_pillar_* roles unable to log in
006 created the roles as NOLOGIN and set no password. Salt-master's
ext_pillar (postgres) and the pg_notify_pillar engine both connect as
so_pillar_master via TCP, so both came up with "password authentication
failed for user so_pillar_master". Added a templated cmd.run step in
schema_pillar.sls (so_pillar_role_login_passwords) that ALTERs all three
roles WITH LOGIN PASSWORD pulling from secrets:pillar_master_pass — the
same password ext_pillar_postgres.conf.jinja and the engines.conf
pg_notify_pillar block render with.
3. Missing GRANT CONNECT ON DATABASE securityonion
USAGE on the schema is granted in 006 but CONNECT on the database isn't.
Engine + ext_pillar succeeded auth then died with "permission denied
for database securityonion". Added the explicit GRANT CONNECT in 006.
4. psycopg2 missing from salt's bundled python
/opt/saltstack/salt/bin/python3 doesn't ship psycopg by default, so
when salt-master tries to load the pg_notify_pillar engine its
`import psycopg2` fails inside salt's loader and the engine silently
doesn't start (no error in the salt log — you only notice when nothing
ever drains so_pillar.change_queue). Added a pip.installed state in
schema_pillar.sls bound to that interpreter via bin_env.
5. engines.conf vs pg_notify_pillar_engine.conf list-replace
Salt's master.d/*.conf merge replaces top-level lists rather than
concatenating them. The engine config used to live in its own
master.d/pg_notify_pillar_engine.conf with `engines: [pg_notify_pillar]`
alongside the legacy `engines.conf` carrying `engines: [checkmine,
pillarWatch]`. Whichever loaded last won, so the engine never showed
up in the loaded set even when the file existed. Fold the
pg_notify_pillar declaration into engines.conf (now jinja-rendered,
gated on postgres:so_pillar:enabled), drop the standalone state from
pg_notify_pillar_engine.sls, and delete the now-orphaned conf jinja.
End state validated against a live standalone-net install on the dev rig:
salt-master ext_pillar reads from so_pillar.* with no errors, the
pg_notify_pillar engine LISTENs on so_pillar_change and drains the
change_queue (134-row backlog → 0 within seconds), and a so-yaml replace
on a pillar key flows disk → PG → ext_pillar → salt pillar.get with the
new value visible after a saltutil.refresh_pillar.
Same `sls.split('.')[0]` pattern as ext_pillar_postgres + pg_notify_pillar_engine.
For sls='postgres.schema_pillar' the split happened to evaluate 'postgres',
which is in manager_states, so the guard worked accidentally — but it would
break silently if anyone ever moved the file under a deeper SLS path. Switch
to a literal `{% if 'postgres' in allowed_states %}` for the same intent-
revealing pattern as the master.d guards.
Both SLS files used `sls.split('.')[0]` to derive what to look up in
allowed_states. For these files (sls='salt.master.ext_pillar_postgres'
and sls='salt.master.pg_notify_pillar_engine') that returns 'salt',
which is never in any role's allowed_states list — only specific keys
like 'salt.master', 'salt.minion', 'salt.cloud' are. The guard's else
branch fired on every highstate, emitting two cosmetic
ID: <sls>_state_not_allowed
Function: test.fail_without_changes
Comment: Failure!
entries that polluted the so-setup error summary even on green installs.
Both states drop config under /etc/salt/master.d/ and watch_in the
salt-master service, so the natural intent is "only run when this node
hosts the salt master". Switching the guard to a literal
{% if 'salt.master' in allowed_states %}
expresses that directly without string-parsing the SLS path, and
matches the existing membership in manager_states (which is in turn
included in every manager-bearing role: so-eval, so-manager,
so-managerhype, so-managersearch, so-standalone, so-import).
The unscoped `umask 077` on postsalt's secrets_pillar path leaked into
every subsequent file write by so-setup (and the salt-call processes
it spawned) for the rest of the install. Every state-rendered config
file under /opt/so/conf landed at mode 0600 instead of 0644, which
broke any container that bind-mounts its config read-only and runs as
a non-root user after the entrypoint's gosu drop. The first concrete
casualty was the influxdb container, which exits with
"failed to load config file: open /conf/config.yaml: permission denied"
after init mode completes and re-execs as the influxdb user.
The chmod 0400 immediately after the printf already enforces the
intended file mode, so the umask was redundant for the key file
itself; scoping it to a subshell preserves the defense-in-depth
between the printf and the chmod without polluting the parent shell.
Two coupled changes that together let so_pillar.* be the canonical
config store, with config edits driving service reloads automatically:
so-yaml PG-canonical mode
- Adds /opt/so/conf/so-yaml/mode (and SO_YAML_BACKEND env override) with
three values: dual (legacy), postgres (PG-only for managed paths),
disk (emergency rollback). Bootstrap files (secrets.sls, ca/init.sls,
*.nodes.sls, top.sls, ...) stay disk-only regardless via the existing
SkipPath allowlist in so_yaml_postgres.locate.
- loadYaml/writeYaml/purgeFile now route to so_pillar.* in postgres
mode: replace/add/get all read+write the database with no disk file
ever appearing. PG failure is fatal in postgres mode (no silent
fallback); dual mode preserves the prior best-effort mirror.
- so_yaml_postgres gains read_yaml(path), is_pg_managed(path), and
is_enabled() so so-yaml can answer "is this path PG-managed and is
PG up" without reaching into private helpers.
- schema_pillar.sls writes /opt/so/conf/so-yaml/mode = postgres after
the importer succeeds, so flipping postgres:so_pillar:enabled flips
so-yaml's behavior in lockstep with the schema being live.
pg_notify-driven change fan-out
- 008_change_notify.sql adds so_pillar.change_queue + an AFTER trigger
on pillar_entry that enqueues the locator and pg_notifies
'so_pillar_change'. Queue is drained at-least-once so engine restarts
don't lose events; pg_notify is just the wakeup signal.
- New salt-master engine pg_notify_pillar.py LISTENs on the channel,
drains the queue with FOR UPDATE SKIP LOCKED, debounces bursts, and
fires 'so/pillar/changed' events grouped by (scope, role, minion).
- Reactor so_pillar_changed.sls catches the tag and dispatches to
orch.so_pillar_reload, which carries a DISPATCH map of pillar-path
prefix -> (state sls, role grain set) so adding a new service to
the auto-reload list is a one-line edit instead of a new reactor.
- Engine + reactor wiring is gated on the same postgres:so_pillar:enabled
flag as the schema and ext_pillar config so the whole stack flips
on/off together.
Tests: 21 new cases (112 total, all passing) covering mode resolution,
PG-managed detection, and PG-canonical read/write/purge routing with
the PG client stubbed.
Hooks every so-yaml.py write through a new so_yaml_postgres helper that
mirrors disk YAML mutations into so_pillar.pillar_entry via docker exec
psql. Disk remains canonical during the transition; PG mirror failures
are logged only when a real write error occurs (skipped paths and
postgres-unreachable cases stay silent so existing callers don't see
new noise on stderr).
Adds a `purge YAML_FILE` verb on so-yaml that deletes the file from
disk and removes the matching pillar_entry rows. For minion files it
also drops the so_pillar.minion row, which CASCADEs to pillar_entry +
role_member. Designed for so-minion's delete path (replaces rm -f) so
the audit log captures the deletion.
setup/so-functions::generate_passwords + secrets_pillar generate
secrets:pillar_master_pass and /opt/so/conf/postgres/so_pillar.key on
fresh installs, and append the password to existing secrets.sls files
on upgrade.
- salt/manager/tools/sbin/so_yaml_postgres.py: locate(), write_yaml(),
purge_yaml(), and a small CLI for diagnostics. Skips bootstrap and
mine-driven paths via the same allowlist used by so-pillar-import.
- salt/manager/tools/sbin/so-yaml.py: import the helper, hook
writeYaml() to mirror after every disk write, add purgeFile() and
the purge verb.
- salt/manager/tools/sbin/so-yaml_test.py: 16 new tests covering the
purge verb and the path-locator / write contract of so_yaml_postgres
without contacting Postgres. All 91 tests pass.
- setup/so-functions: generate_passwords adds PILLARMASTERPASS and
SO_PILLAR_KEY; secrets_pillar writes pillar_master_pass and the
pgcrypto master key file.
Idempotent importer that schema_pillar.sls runs once at end of postgres
state on first install, and that so-minion can call per-minion on add /
delete. UPSERTs into so_pillar.pillar_entry; the audit trigger handles
versioning so re-runs without SLS edits produce no version bumps.
Connects via docker exec so-postgres psql, so no DSN config is required
at first-install time. Skips bootstrap files (secrets.sls, postgres/
auth.sls, etc.), mine-driven nodes.sls files, and any file containing
Jinja templates — those stay disk-authoritative and ext_pillar_first:
False means they render before the PG overlay.
Auto-syncs to /usr/sbin via the existing manager_sbin file.recurse.
Lays the database-backed pillar foundation for the postsalt branch. Salt
continues to read on-disk SLS first; the new ext_pillar config overlays
values from the so_pillar.* schema in so-postgres.
- salt/postgres/files/schema/pillar/00{1..7}_*.sql: idempotent DDL for
scope/role/role_member/minion/pillar_entry/pillar_entry_history/
drift_log, secret pgcrypto helpers, RLS, pg_cron retention.
- salt/postgres/schema_pillar.sls: applies the SQL files inside the
so-postgres container after it's healthy, configures the master_key
GUC, and runs so-pillar-import once. Gated on
postgres:so_pillar:enabled feature flag (default false).
- salt/salt/master/ext_pillar_postgres.{sls,conf.jinja}: drops
/etc/salt/master.d/ext_pillar_postgres.conf with list-form ext_pillar
queries (global/role/minion/secrets) and ext_pillar_first: False so
bootstrap pillars on disk render before the PG overlay.
- salt/postgres/init.sls + salt/salt/master.sls: include the new states.
Both new state branches are guarded so a default install with the flag
off is a no-op.
The static defaults only listed postgres on each role's self-hostgroup,
leaving sensor/searchnode/heavynode/receiver/fleet/idh/desktop/hypervisor
hostgroups unable to reach the manager's so-postgres in distributed
grids. A dynamic block in firewall/map.jinja added postgres to those
hostgroups only when telegraf.output was switched to POSTGRES/BOTH,
which left postgres unreachable by default.
Mirror influxdb statically across manager/managerhype/managersearch/
standalone for every hostgroup that already lists influxdb, and drop
the now-redundant telegraf-gated dynamic block from firewall/map.jinja.
When so-postgres was wired in (868cd1187), the import role's firewall
defaults were missed while every other manager-class role (manager,
managerhype, managersearch, standalone, eval) had postgres added to
their DOCKER-USER manager-hostgroup portgroups. As a result, on a
fresh import install the so-postgres container starts but tcp/5432 is
dropped at DOCKER-USER, so soc/kratos/telegraf can't reach it.
Add postgres alongside the existing influxdb entry so import nodes
match the other roles.
The soc binary on 3/dev does not register a postgres module, so injecting
postgres into soc.config.server.modules makes soc abort at launch with
'Module does not exist: postgres'. The soc-side module is staged on
feature/postgres but is not landing this release. Drop the injection
until the module ships; salt/postgres state and pillars are unchanged.
The digest-pull logic was added to make `docker push` work for multi-arch
upstream tags. Now that the push step is `docker buildx imagetools create`
pinned to the gpg-verified RepoDigest, the registry-to-registry copy
handles single- and multi-arch sources without help. Reverts the pull
back to the original line and removes the unused PLATFORM_OS/_ARCH
detection.
Replaces `docker push` with a registry-to-registry copy. On Docker 29.x
with the containerd image store, `docker push` of a freshly-pulled image
hits a path that wraps single-platform manifests in a synthetic index
and then can't push the layers it claims to reference, producing
`NotFound: content digest ...` even when the image is fully present.
Keep the local `docker tag` so so-image-pull's `docker images | grep :5000`
existence check continues to work.
docker pull of a multi-arch tag on Docker 29.x leaves the local tag
pointing at the image index rather than the platform-specific manifest.
The subsequent docker push then tries to push every sub-manifest the
index references and fails on layers we never fetched.
Resolve the local-platform manifest digest from the upstream index via
docker buildx imagetools inspect, pull by that digest, and re-tag locally
to the canonical tag. The signing flow and the existing tag/push to the
embedded registry are unchanged.