A complete mine is not enough: elasticsearch:nodes, redis:nodes,
logstash:nodes (tgt_type=pillar) and hypervisor:nodes (tgt_type=compound)
resolve their target against the master's per-minion data cache
(grains+pillar in data.p), which is populated only when a minion's pillar
is recompiled -- separately from the mine. After a reboot a node can be in
the mine (so node_data/glob sees it) yet absent from that cache, so it
fails the elasticsearch:enabled:true pillar match and is dropped from
elasticsearch:nodes -> so-elasticsearch ExtraHosts -> container recreate.
After the mine-completeness wait, run salt '*' saltutil.refresh_pillar
wait=True to synchronously cache every up node's pillar (the same lever
deploy_newnode.sls uses), then verify with salt-run cache.pillar and retry
stragglers, bounded by MINE_UPDATE_MAX_WAIT. Also log elasticsearch:nodes
alongside node_data for inspection.
Mine-backed pillars (node_data, elasticsearch:nodes, redis:nodes,
logstash:nodes, hypervisor:nodes) include a node only if it returned an
IP from the mine, and the configs they build are rebuilt fresh every
highstate. After a manager reboot with a flushed mine, the first boot
highstate could run before an up node re-reported network.ip_addrs,
dropping it from e.g. so-elasticsearch ExtraHosts and forcing a
container recreate.
After the initial broad mine.update, poll until every currently-up
minion actually has network.ip_addrs in the mine, re-pushing mine.update
to stragglers, before releasing the boot highstate. Shares the existing
MINE_UPDATE_MAX_WAIT backstop so a slow/down node never blocks boot, and
still logs the rendered node_data for inspection.
Dump the actual rendered node_data pillar (pretty-printed JSON) to the
journal instead of just a rendered/empty verdict, so the boot-time render
attempt is fully inspectable. Empty renders print false/null and still
emit the WARNING.
After the boot-time mine.update, have the manager actually render the
node_data pillar and log whether it came back populated. node_data: False
makes salt/top.sls apply the bootstrap recovery branch instead of the
manager's real config, so surfacing this in the journal makes the
condition visible before so-boot-highstate runs. Best-effort and
non-blocking: always exits 0 so highstate proceeds regardless.
so-boot-mine-update.service is a manager-only Type=oneshot unit that runs
once per boot after salt-master/salt-minion start and before
so-boot-highstate.service. It pushes mine.update to all reachable minions
so mine-backed pillars (node IPs, ES/Redis/Logstash discovery) are fresh
before the boot highstate renders them.
The helper waits for the responsive minion set to settle (plateau) rather
than for every accepted key to report up, so an intentionally powered-off
minion doesn't block the update; MAX_WAIT remains as a backstop.
Before removing from apply_hotfix function first verify that older installs < 3.1.0 are still upgradable when referencing 'so/0013_input_lumberjack_fleet.conf' via pillar. Failure to do so will prevent logstash from starting
The manager's /etc/salt/minion (written by so-functions:configure_minion)
has no file_roots, so salt-call --local falls back to Salt's default
/srv/salt and fails with "No matching sls found for 'postgres.telegraf_users'
in env 'base'". || true was silently swallowing the error, which meant the
DB roles for the pillar entries just populated by the so-telegraf-cred
backfill loop never actually got created.
Route through salt-master instead; its file_roots already points at the
default/local salt trees.
pillar/top.sls now references postgres.soc_postgres / postgres.adv_postgres
unconditionally, but make_some_dirs only runs at install time so managers
upgrading from 3.0.0 have no local/pillar/postgres/ and salt-master fails
pillar render on the first post-upgrade restart. Similarly, secrets_pillar
is a no-op on upgrade (secrets.sls already exists), so secrets:postgres_pass
never gets seeded and the postgres container's POSTGRES_PASSWORD_FILE and
SOC's PG_ADMIN_PASS would land empty after highstate.
Add ensure_postgres_local_pillar and ensure_postgres_secret to up_to_3.1.0
so the stubs and secret exist before masterlock/salt-master restart. Both
are idempotent and safe to re-run.
Exercises the FileNotFoundError and generic-exception branches added to
loadYaml in the previous commit, restoring 100% coverage required by
the build.
so-telegraf-cred was committed with mode 644, causing
`so-telegraf-cred add "$MINION_ID"` in so-minion's add_telegraf_to_minion
to fail with "Permission denied" and log "Failed to provision postgres
telegraf cred for <minion>". Mark it executable.
Also bail early in seed_creds_file if mkdir/printf/chmod fail, and in
so-yaml.py loadYaml surface a clear stderr message with the filename
instead of an unhandled FileNotFoundError traceback.