The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.
Split responsibilities:
- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
flock, atomic YAML write, PyYAML safe_dump so passwords never
round-trip through so-yaml.py's type coercion. Idempotent add, quiet
remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
refreshes the new minion's pillar before the new node highstates,
so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
keys (idempotent) and runs postgres.telegraf_users once to reconcile
the DB.
Simpler, race-free replacement for the reactor + orch + fan-out chain.
- salt/manager/tools/sbin/so-minion: expand add_telegraf_to_minion to
generate a random 72-char password, reuse any existing password from
the aggregate pillar, write postgres.telegraf.{user,pass} into the
minion's own pillar file, and update the aggregate pillar so
postgres.telegraf_users can CREATE ROLE on the next manager apply.
Every create<ROLE> function already calls this hook, so add / addVM /
setup dispatches are all covered identically and synchronously.
- salt/postgres/auth.sls: strip the fanout_targets loop and the
postgres_telegraf_minion_pillar_<safe> cmd.run block — it's now
redundant. The state still manages the so_postgres admin user and
writes the aggregate pillar for postgres.telegraf_users to consume.
- salt/reactor/telegraf_user_sync.sls: deleted.
- salt/orch/telegraf_postgres_sync.sls: deleted.
- salt/salt/master.sls: drop the reactor_config_telegraf block that
registered the reactor on /etc/salt/master.d/reactor_telegraf.conf.
- salt/orch/deploy_newnode.sls: drop the manager_fanout_postgres_telegraf
step and the require: it added to the newnode highstate. Back to its
original 3/dev shape.
No more ephemeral postgres_fanout_minion pillar, no more async salt/key
reactor, no more so-minion setupMinionFiles race: the pillar write
happens inline inside setupMinionFiles itself.
Two fixes on the postgres telegraf fan-out path:
1. postgres.auth cmd.run leaked the password to the console because
Salt always prints the Name: field and `show_changes: False` does
not apply to cmd.run. Move the user and password into the `env:`
attribute so the shell body still sees them via $PG_USER / $PG_PASS
but Salt's state reporter never renders them.
2. so-minion's addMinion -> setupMinionFiles sequence removes the
minion pillar file and rewrites it from scratch, which wipes the
postgres.telegraf.* entries the reactor may have already written on
salt-key accept. Add a postgres.auth fan-out step to
orch.deploy_newnode (the orch so-minion kicks off after
setupMinionFiles) and require it from the new minion's highstate.
Idempotent via the existing unless: guard in postgres.auth.
postgres.auth was running an `unless` shell check per up-minion on every
manager highstate, even when nothing had changed — N fork+python starts
of so-yaml.py add up on large grids. The work is only needed when a
specific minion's key is accepted.
- salt/postgres/auth.sls: fan out only when postgres_fanout_minion
pillar is set (targets that single minion). Manager highstates with
no pillar take a zero-N code path.
- salt/reactor/telegraf_user_sync.sls: re-pass the accepted minion id
as postgres_fanout_minion to the orch.
- salt/orch/telegraf_postgres_sync.sls: forward the pillar to the
salt.state invocation so the state render sees it.
- salt/manager/tools/sbin/soup: for the one-time 3.1.0 backfill, drop
the per-minion state.apply and do an in-shell loop over the minion
pillar files using so-yaml.py directly. Skips minions that already
have postgres.telegraf.user set.
The previous MANAGER resolution used pillar.get('setup:manager') with a
fallback to grains.get('master'). Neither works from the reactor:
setup:manager is only populated by the setup workflow (not by reactor
runs), and grains.master returns the minion's master-hostname setting,
not a targetable minion id.
Match the pattern used by orch/delete_hypervisor.sls: compound-target
whichever minion is the manager via role grain.
New minions run highstate as part of onboarding, which already applies
the telegraf state with the fresh pillar entry we just wrote. Pushing
telegraf a second time from the reactor is redundant.
- Remove the MINION-scoped salt.state block from the orch; keep only
the manager-side postgres.auth + postgres.telegraf_users provisioning.
- Stop passing minion_id as pillar in the reactor; the orch doesn't
reference it anymore.
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.
- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
postgres.telegraf.retention_days (default 14)