Salt's stock inotify beacon leaks one kernel inotify instance every time
the minion rebuilds the beacon loader's __context__ (the orphaned
pyinotify.Notifier is never stopped), accumulating against
fs.inotify.max_user_instances=128 until inotify_init() fails with EMFILE
and rule-change push detection silently stops. This is independent of
disable_during_state_run.
Add a custom poll-based beacon (salt/_beacons/rules_db.py) modeled on
pillar_db.py: it fingerprints the suricata/strelka rule dirs each interval
(relpath + mtime_ns + size, temp files excluded) against a per-dir
watermark, emitting an event only on change. It holds zero inotify
instances, so the leak is impossible, and it keeps firing during state
runs. Swap the inotify beacon config and reactor tag mappings accordingly;
the push_suricata/push_strelka reactors are unchanged (they read only
data['path']).
highstate_interval_hours describes the per-minion highstate schedule, not the
active-push pipeline, so relocate it from salt.auto_apply to a new salt.schedule
settings subtree. Repoint so-salt-minion-check at the new pillar path (it had
been left on the stale global:push path) so its restart grace period tracks the
schedule again.
- Add salt.schedule.highstate_interval_hours to defaults.yaml/soc_salt.yaml and a
side-effect-free salt/salt/schedule.map.jinja (SCHEDULEMERGED), matching the
*MERGED map convention. Consumers read SCHEDULEMERGED.highstate_interval_hours.
- Split salt/schedule.sls into salt/salt/highstate_schedule.sls (every minion) and
salt/salt/push_drain_schedule.sls (managers); update top.sls to apply the
highstate schedule via '*' and the drainer schedule via the configured-manager
block. Remove the now-empty schedule.sls aggregator.
- pillar_push_map.yaml and so-push-drainer: comment/doc updates only.
The SOC postgres database was renamed so_soc -> securityonion (see
POSTGRES_DB in salt/postgres/enabled.sls and the SOC postgres config in
salt/soc/defaults.yaml). The pillar_db beacon still hardcoded so_soc, so
every poll failed with 'database "so_soc" does not exist' (rc=2),
silently disabling active-push detection of audit_settings changes.
Update DATABASE to 'securityonion' and refresh the now-stale so_soc
references in the beacon and push_pillar reactor comments.
The active-push tunables (enabled, highstate_interval_hours, debounce_seconds,
drain_interval, batch, batch_wait) described how Salt auto-applies changes, not
general grid config, so relocate them from the global namespace to a new
salt.auto_apply settings module.
- Add salt/salt/{defaults.yaml,auto_apply.map.jinja,soc_salt.yaml,adv_salt.yaml}.
auto_apply.map.jinja is a dedicated, side-effect-free merge map (the existing
salt/salt/map.jinja dereferences pillar.host.mainint at import time).
- Remove the push blocks from salt/global/{defaults,soc_global}.yaml.
- Register salt.soc_salt/salt.adv_salt in pillar/top.sls; seed the local pillar
stubs for fresh installs (make_some_dirs) and upgrades (ensure_salt_local_pillar
in soup, wired into up_to_3.2.0).
- Repoint all consumers: GLOBALMERGED.push.* -> AUTOAPPLY.* (schedule, salt
master, manager beacons, beacons_pushstate, orch.push_batch) and
pillar.get('global:push...') -> 'salt:auto_apply...' (push reactors,
so-push-drainer).
- Add a salt: fleetwide-highstate entry to pillar_push_map.yaml so edits keep
applying immediately, matching the prior global-namespace behavior.
elastic_fleet_load_integrations_dir now buffers each concurrent job's
output (header + API response) to a per-job file and prints them in
submission order after wait, restoring the readable serial-style output
while keeping concurrent writes.
Add --retry-all-errors to the integration create/update curl calls so
transient 409 conflicts from concurrent writes to the same agent policy
are retried (curl --retry alone does not retry 409).
Fetch each agent policy once and extract integration name/package/version/id
locally via a single jq pass instead of re-fetching the identical policy JSON
1+3N times. Memoize epm/packages latest-version lookups so each package is
queried once instead of per (policy, integration). Dispatch the per-integration
dry-run+upgrade as throttled background jobs (MAX_FLEET_JOBS) with
flock-serialized output and a FAIL_FILE marker, mirroring
elastic_fleet_load_integrations_dir.
Behavior preserved: same elastic-defend-endpoints/fleet_server skips, same
AUTO_UPGRADE_INTEGRATIONS default-package gating (moved into jq, using $defaults
to avoid the jq $def keyword collision), and exit 1 on any failure so salt
retries.
Load component and index templates as throttled background jobs (max 10
concurrent) instead of sequential curl PUTs, matching the bounded-concurrency
+ flock-serialized-output pattern used by the fleet/ILM load scripts. Keeps a
wait barrier between the component phase and the index phase so index
templates never load before their referenced component templates. Failures are
tracked via per-job marker files since counter increments can't escape
background subshells.