A complete mine is not enough: elasticsearch:nodes, redis:nodes,
logstash:nodes (tgt_type=pillar) and hypervisor:nodes (tgt_type=compound)
resolve their target against the master's per-minion data cache
(grains+pillar in data.p), which is populated only when a minion's pillar
is recompiled -- separately from the mine. After a reboot a node can be in
the mine (so node_data/glob sees it) yet absent from that cache, so it
fails the elasticsearch:enabled:true pillar match and is dropped from
elasticsearch:nodes -> so-elasticsearch ExtraHosts -> container recreate.
After the mine-completeness wait, run salt '*' saltutil.refresh_pillar
wait=True to synchronously cache every up node's pillar (the same lever
deploy_newnode.sls uses), then verify with salt-run cache.pillar and retry
stragglers, bounded by MINE_UPDATE_MAX_WAIT. Also log elasticsearch:nodes
alongside node_data for inspection.
Mine-backed pillars (node_data, elasticsearch:nodes, redis:nodes,
logstash:nodes, hypervisor:nodes) include a node only if it returned an
IP from the mine, and the configs they build are rebuilt fresh every
highstate. After a manager reboot with a flushed mine, the first boot
highstate could run before an up node re-reported network.ip_addrs,
dropping it from e.g. so-elasticsearch ExtraHosts and forcing a
container recreate.
After the initial broad mine.update, poll until every currently-up
minion actually has network.ip_addrs in the mine, re-pushing mine.update
to stragglers, before releasing the boot highstate. Shares the existing
MINE_UPDATE_MAX_WAIT backstop so a slow/down node never blocks boot, and
still logs the rendered node_data for inspection.
Dump the actual rendered node_data pillar (pretty-printed JSON) to the
journal instead of just a rendered/empty verdict, so the boot-time render
attempt is fully inspectable. Empty renders print false/null and still
emit the WARNING.
After the boot-time mine.update, have the manager actually render the
node_data pillar and log whether it came back populated. node_data: False
makes salt/top.sls apply the bootstrap recovery branch instead of the
manager's real config, so surfacing this in the journal makes the
condition visible before so-boot-highstate runs. Best-effort and
non-blocking: always exits 0 so highstate proceeds regardless.
so-boot-mine-update.service is a manager-only Type=oneshot unit that runs
once per boot after salt-master/salt-minion start and before
so-boot-highstate.service. It pushes mine.update to all reachable minions
so mine-backed pillars (node IPs, ES/Redis/Logstash discovery) are fresh
before the boot highstate renders them.
The helper waits for the responsive minion set to settle (plateau) rather
than for every accepted key to report up, so an intentionally powered-off
minion doesn't block the update; MAX_WAIT remains as a backstop.
The setup-complete marker is a runtime-state file, not config, so move it
to /opt/so/state/setup-complete. Updates both writers (mark_setup_complete
in setup/so-functions and the upgrade-path state in minion/init.sls) and the
three readers (so-boot-highstate.service ConditionPathExists, boot_highstate.sls
enable gate, and the so-user_sync cron gate).
Add docker.service to After= and Wants= so the boot-time highstate
starts after docker is up. Uses Wants (soft) so highstate still runs
if docker fails to start.
The active-push feature detected pillar/settings changes via an inotify
beacon on the manager watching /opt/so/saltstack/local/pillar. Replace
that pillar watch with a custom salt beacon (pillar_db) that polls the
SOC so_soc.audit_settings table on a monotonic id watermark, so changes
made through SOC drive immediate pushes from the database instead of the
files. The suricata/strelka rule inotify watches (and pyinotify) are kept
unchanged, since rule-file edits are not recorded in audit_settings.
- salt/_beacons/pillar_db.py: new beacon. Polls audit_settings via
`docker exec so-postgres psql` (unix-socket trust auth), tracks the last
processed id in /opt/so/state/pillar_db_watch.id, seeds to MAX(id) on
first run (no history replay), and emits one event per new row.
- salt/reactor/push_pillar.sls: consume setting_id/node_id from the beacon
event instead of a file path. App = first dotted segment of setting_id,
looked up in pillar_push_map.yaml. Empty node_id -> grid-wide actions as
is; populated node_id -> the app's state(s) retargeted to that one node.
- salt/manager/files/beacons_pushstate.conf.jinja: drop the pillar inotify
block, add the pillar_db beacon (interval = push.drain_interval); keep
the suricata/strelka inotify watches.
- salt/salt/files/reactor_pushstate.conf: map salt/beacon/*/pillar_db/
audit_settings to push_pillar.sls; remove the pillar inotify reactor
lines; keep suricata/strelka.
The intent -> so-push-drainer -> orch.push_batch pipeline is unchanged.
Verified end-to-end on a standalone: a grid-wide telegraf.output change
re-applied telegraf fleetwide (container replaced), and a per-host
ntp.config.servers change applied ntp to only that node.