A complete mine is not enough: elasticsearch:nodes, redis:nodes,
logstash:nodes (tgt_type=pillar) and hypervisor:nodes (tgt_type=compound)
resolve their target against the master's per-minion data cache
(grains+pillar in data.p), which is populated only when a minion's pillar
is recompiled -- separately from the mine. After a reboot a node can be in
the mine (so node_data/glob sees it) yet absent from that cache, so it
fails the elasticsearch:enabled:true pillar match and is dropped from
elasticsearch:nodes -> so-elasticsearch ExtraHosts -> container recreate.
After the mine-completeness wait, run salt '*' saltutil.refresh_pillar
wait=True to synchronously cache every up node's pillar (the same lever
deploy_newnode.sls uses), then verify with salt-run cache.pillar and retry
stragglers, bounded by MINE_UPDATE_MAX_WAIT. Also log elasticsearch:nodes
alongside node_data for inspection.
Mine-backed pillars (node_data, elasticsearch:nodes, redis:nodes,
logstash:nodes, hypervisor:nodes) include a node only if it returned an
IP from the mine, and the configs they build are rebuilt fresh every
highstate. After a manager reboot with a flushed mine, the first boot
highstate could run before an up node re-reported network.ip_addrs,
dropping it from e.g. so-elasticsearch ExtraHosts and forcing a
container recreate.
After the initial broad mine.update, poll until every currently-up
minion actually has network.ip_addrs in the mine, re-pushing mine.update
to stragglers, before releasing the boot highstate. Shares the existing
MINE_UPDATE_MAX_WAIT backstop so a slow/down node never blocks boot, and
still logs the rendered node_data for inspection.
Dump the actual rendered node_data pillar (pretty-printed JSON) to the
journal instead of just a rendered/empty verdict, so the boot-time render
attempt is fully inspectable. Empty renders print false/null and still
emit the WARNING.
After the boot-time mine.update, have the manager actually render the
node_data pillar and log whether it came back populated. node_data: False
makes salt/top.sls apply the bootstrap recovery branch instead of the
manager's real config, so surfacing this in the journal makes the
condition visible before so-boot-highstate runs. Best-effort and
non-blocking: always exits 0 so highstate proceeds regardless.
so-boot-mine-update.service is a manager-only Type=oneshot unit that runs
once per boot after salt-master/salt-minion start and before
so-boot-highstate.service. It pushes mine.update to all reachable minions
so mine-backed pillars (node IPs, ES/Redis/Logstash discovery) are fresh
before the boot highstate renders them.
The helper waits for the responsive minion set to settle (plateau) rather
than for every accepted key to report up, so an intentionally powered-off
minion doesn't block the update; MAX_WAIT remains as a backstop.
Before removing from apply_hotfix function first verify that older installs < 3.1.0 are still upgradable when referencing 'so/0013_input_lumberjack_fleet.conf' via pillar. Failure to do so will prevent logstash from starting
The manager's /etc/salt/minion (written by so-functions:configure_minion)
has no file_roots, so salt-call --local falls back to Salt's default
/srv/salt and fails with "No matching sls found for 'postgres.telegraf_users'
in env 'base'". || true was silently swallowing the error, which meant the
DB roles for the pillar entries just populated by the so-telegraf-cred
backfill loop never actually got created.
Route through salt-master instead; its file_roots already points at the
default/local salt trees.
pillar/top.sls now references postgres.soc_postgres / postgres.adv_postgres
unconditionally, but make_some_dirs only runs at install time so managers
upgrading from 3.0.0 have no local/pillar/postgres/ and salt-master fails
pillar render on the first post-upgrade restart. Similarly, secrets_pillar
is a no-op on upgrade (secrets.sls already exists), so secrets:postgres_pass
never gets seeded and the postgres container's POSTGRES_PASSWORD_FILE and
SOC's PG_ADMIN_PASS would land empty after highstate.
Add ensure_postgres_local_pillar and ensure_postgres_secret to up_to_3.1.0
so the stubs and secret exist before masterlock/salt-master restart. Both
are idempotent and safe to re-run.
Exercises the FileNotFoundError and generic-exception branches added to
loadYaml in the previous commit, restoring 100% coverage required by
the build.
so-telegraf-cred was committed with mode 644, causing
`so-telegraf-cred add "$MINION_ID"` in so-minion's add_telegraf_to_minion
to fail with "Permission denied" and log "Failed to provision postgres
telegraf cred for <minion>". Mark it executable.
Also bail early in seed_creds_file if mkdir/printf/chmod fail, and in
so-yaml.py loadYaml surface a clear stderr message with the filename
instead of an unhandled FileNotFoundError traceback.
Swap the ~150-line Python implementation for a 48-line bash script that
delegates YAML mutation to so-yaml.py — the same helper so-minion and
soup already use. Same semantics: seed the creds pillar on first use,
idempotent add, silent remove.
SO minion ids are dot-free by construction (setup/so-functions:1884
strips everything after the first '.'), so using the raw id as the
so-yaml.py key path is safe.
The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.
Split responsibilities:
- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
flock, atomic YAML write, PyYAML safe_dump so passwords never
round-trip through so-yaml.py's type coercion. Idempotent add, quiet
remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
refreshes the new minion's pillar before the new node highstates,
so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
keys (idempotent) and runs postgres.telegraf_users once to reconcile
the DB.
The reactor path is gone; so-minion now owns add/delete for new
minions. The backfill itself is unchanged — postgres.auth's up_minions
fallback fills the aggregate, postgres.telegraf_users creates the
roles, and the bash loop fans to per-minion pillar files — so the
pre-feature upgrade story still works end-to-end. Just refresh the
comment so it isn't misleading.
Paired with the add path in add_telegraf_to_minion: when a minion is
removed, drop its entry from the aggregate postgres pillar and drop the
matching so_telegraf_<safe> role from the database. Without this, stale
entries and DB roles accumulate over time.
Makes rotate-password and compromise-recovery both a clean delete+add:
so-minion -o=delete -m=<id>
so-minion -o=add -m=<id>
The first call drops the role and clears the aggregate pillar; the
second generates a brand-new password.
The cleanup is best-effort — if so-postgres isn't running or the DROP
ROLE fails (e.g., the role owns unexpected objects), we log a warning
and continue so the minion delete itself never gets blocked by postgres
state. Admins can mop up stray roles manually if that happens.
Simpler, race-free replacement for the reactor + orch + fan-out chain.
- salt/manager/tools/sbin/so-minion: expand add_telegraf_to_minion to
generate a random 72-char password, reuse any existing password from
the aggregate pillar, write postgres.telegraf.{user,pass} into the
minion's own pillar file, and update the aggregate pillar so
postgres.telegraf_users can CREATE ROLE on the next manager apply.
Every create<ROLE> function already calls this hook, so add / addVM /
setup dispatches are all covered identically and synchronously.
- salt/postgres/auth.sls: strip the fanout_targets loop and the
postgres_telegraf_minion_pillar_<safe> cmd.run block — it's now
redundant. The state still manages the so_postgres admin user and
writes the aggregate pillar for postgres.telegraf_users to consume.
- salt/reactor/telegraf_user_sync.sls: deleted.
- salt/orch/telegraf_postgres_sync.sls: deleted.
- salt/salt/master.sls: drop the reactor_config_telegraf block that
registered the reactor on /etc/salt/master.d/reactor_telegraf.conf.
- salt/orch/deploy_newnode.sls: drop the manager_fanout_postgres_telegraf
step and the require: it added to the newnode highstate. Back to its
original 3/dev shape.
No more ephemeral postgres_fanout_minion pillar, no more async salt/key
reactor, no more so-minion setupMinionFiles race: the pillar write
happens inline inside setupMinionFiles itself.
replace calls removeKey before addKey, so running `so-yaml.py replace`
on a new dotted key whose parent doesn't exist — e.g., postgres.auth
fanning postgres.telegraf.user into a minion pillar file that has
never carried any postgres.* keys — crashed with
KeyError: 'postgres'
from removeKey recursing into a missing parent dict.
Make removeKey a no-op when an intermediate key is absent so that:
- `remove` has the natural "remove if exists" semantics, and
- `replace` works for brand-new nested keys.
postgres.auth was running an `unless` shell check per up-minion on every
manager highstate, even when nothing had changed — N fork+python starts
of so-yaml.py add up on large grids. The work is only needed when a
specific minion's key is accepted.
- salt/postgres/auth.sls: fan out only when postgres_fanout_minion
pillar is set (targets that single minion). Manager highstates with
no pillar take a zero-N code path.
- salt/reactor/telegraf_user_sync.sls: re-pass the accepted minion id
as postgres_fanout_minion to the orch.
- salt/orch/telegraf_postgres_sync.sls: forward the pillar to the
salt.state invocation so the state render sees it.
- salt/manager/tools/sbin/soup: for the one-time 3.1.0 backfill, drop
the per-minion state.apply and do an in-shell loop over the minion
pillar files using so-yaml.py directly. Skips minions that already
have postgres.telegraf.user set.
state.apply takes a single mods argument; space-separated names are not
a list, so `state.apply postgres.auth postgres.telegraf_users` was only
applying postgres.auth and silently dropping the telegraf_users state.
Use comma-separated mods and add queue=True to match the rest of soup.
feature/postgres had rewritten the 3.1.0 upgrade block, dropping the
elastic upgrade work 3/dev landed for 9.0.8→9.3.3: elasticsearch_backup_index_templates,
the component template state cleanup, and the /usr/sbin/so-kibana-space-defaults
post-upgrade call. It also carried an older ES upgrade mapping
(8.18.8→9.0.8) that was superseded on 3/dev (9.0.8→9.3.3 for
3.0.0-20260331), and a handful of latent shell-quoting regressions in
verify_es_version_compatibility and the intermediate-upgrade helpers.
Adopt the 3/dev soup verbatim and only add the new Telegraf Postgres
provisioning to post_to_3.1.0 on top of so-kibana-space-defaults.