Two coupled changes that together let so_pillar.* be the canonical
config store, with config edits driving service reloads automatically:
so-yaml PG-canonical mode
- Adds /opt/so/conf/so-yaml/mode (and SO_YAML_BACKEND env override) with
three values: dual (legacy), postgres (PG-only for managed paths),
disk (emergency rollback). Bootstrap files (secrets.sls, ca/init.sls,
*.nodes.sls, top.sls, ...) stay disk-only regardless via the existing
SkipPath allowlist in so_yaml_postgres.locate.
- loadYaml/writeYaml/purgeFile now route to so_pillar.* in postgres
mode: replace/add/get all read+write the database with no disk file
ever appearing. PG failure is fatal in postgres mode (no silent
fallback); dual mode preserves the prior best-effort mirror.
- so_yaml_postgres gains read_yaml(path), is_pg_managed(path), and
is_enabled() so so-yaml can answer "is this path PG-managed and is
PG up" without reaching into private helpers.
- schema_pillar.sls writes /opt/so/conf/so-yaml/mode = postgres after
the importer succeeds, so flipping postgres:so_pillar:enabled flips
so-yaml's behavior in lockstep with the schema being live.
pg_notify-driven change fan-out
- 008_change_notify.sql adds so_pillar.change_queue + an AFTER trigger
on pillar_entry that enqueues the locator and pg_notifies
'so_pillar_change'. Queue is drained at-least-once so engine restarts
don't lose events; pg_notify is just the wakeup signal.
- New salt-master engine pg_notify_pillar.py LISTENs on the channel,
drains the queue with FOR UPDATE SKIP LOCKED, debounces bursts, and
fires 'so/pillar/changed' events grouped by (scope, role, minion).
- Reactor so_pillar_changed.sls catches the tag and dispatches to
orch.so_pillar_reload, which carries a DISPATCH map of pillar-path
prefix -> (state sls, role grain set) so adding a new service to
the auto-reload list is a one-line edit instead of a new reactor.
- Engine + reactor wiring is gated on the same postgres:so_pillar:enabled
flag as the schema and ext_pillar config so the whole stack flips
on/off together.
Tests: 21 new cases (112 total, all passing) covering mode resolution,
PG-managed detection, and PG-canonical read/write/purge routing with
the PG client stubbed.
The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.
Split responsibilities:
- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
flock, atomic YAML write, PyYAML safe_dump so passwords never
round-trip through so-yaml.py's type coercion. Idempotent add, quiet
remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
refreshes the new minion's pillar before the new node highstates,
so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
keys (idempotent) and runs postgres.telegraf_users once to reconcile
the DB.
Simpler, race-free replacement for the reactor + orch + fan-out chain.
- salt/manager/tools/sbin/so-minion: expand add_telegraf_to_minion to
generate a random 72-char password, reuse any existing password from
the aggregate pillar, write postgres.telegraf.{user,pass} into the
minion's own pillar file, and update the aggregate pillar so
postgres.telegraf_users can CREATE ROLE on the next manager apply.
Every create<ROLE> function already calls this hook, so add / addVM /
setup dispatches are all covered identically and synchronously.
- salt/postgres/auth.sls: strip the fanout_targets loop and the
postgres_telegraf_minion_pillar_<safe> cmd.run block — it's now
redundant. The state still manages the so_postgres admin user and
writes the aggregate pillar for postgres.telegraf_users to consume.
- salt/reactor/telegraf_user_sync.sls: deleted.
- salt/orch/telegraf_postgres_sync.sls: deleted.
- salt/salt/master.sls: drop the reactor_config_telegraf block that
registered the reactor on /etc/salt/master.d/reactor_telegraf.conf.
- salt/orch/deploy_newnode.sls: drop the manager_fanout_postgres_telegraf
step and the require: it added to the newnode highstate. Back to its
original 3/dev shape.
No more ephemeral postgres_fanout_minion pillar, no more async salt/key
reactor, no more so-minion setupMinionFiles race: the pillar write
happens inline inside setupMinionFiles itself.
Two fixes on the postgres telegraf fan-out path:
1. postgres.auth cmd.run leaked the password to the console because
Salt always prints the Name: field and `show_changes: False` does
not apply to cmd.run. Move the user and password into the `env:`
attribute so the shell body still sees them via $PG_USER / $PG_PASS
but Salt's state reporter never renders them.
2. so-minion's addMinion -> setupMinionFiles sequence removes the
minion pillar file and rewrites it from scratch, which wipes the
postgres.telegraf.* entries the reactor may have already written on
salt-key accept. Add a postgres.auth fan-out step to
orch.deploy_newnode (the orch so-minion kicks off after
setupMinionFiles) and require it from the new minion's highstate.
Idempotent via the existing unless: guard in postgres.auth.
postgres.auth was running an `unless` shell check per up-minion on every
manager highstate, even when nothing had changed — N fork+python starts
of so-yaml.py add up on large grids. The work is only needed when a
specific minion's key is accepted.
- salt/postgres/auth.sls: fan out only when postgres_fanout_minion
pillar is set (targets that single minion). Manager highstates with
no pillar take a zero-N code path.
- salt/reactor/telegraf_user_sync.sls: re-pass the accepted minion id
as postgres_fanout_minion to the orch.
- salt/orch/telegraf_postgres_sync.sls: forward the pillar to the
salt.state invocation so the state render sees it.
- salt/manager/tools/sbin/soup: for the one-time 3.1.0 backfill, drop
the per-minion state.apply and do an in-shell loop over the minion
pillar files using so-yaml.py directly. Skips minions that already
have postgres.telegraf.user set.
The previous MANAGER resolution used pillar.get('setup:manager') with a
fallback to grains.get('master'). Neither works from the reactor:
setup:manager is only populated by the setup workflow (not by reactor
runs), and grains.master returns the minion's master-hostname setting,
not a targetable minion id.
Match the pattern used by orch/delete_hypervisor.sls: compound-target
whichever minion is the manager via role grain.
New minions run highstate as part of onboarding, which already applies
the telegraf state with the fresh pillar entry we just wrote. Pushing
telegraf a second time from the reactor is redundant.
- Remove the MINION-scoped salt.state block from the orch; keep only
the manager-side postgres.auth + postgres.telegraf_users provisioning.
- Stop passing minion_id as pillar in the reactor; the orch doesn't
reference it anymore.
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.
- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
postgres.telegraf.retention_days (default 14)