elastic_fleet_load_integrations_dir now buffers each concurrent job's
output (header + API response) to a per-job file and prints them in
submission order after wait, restoring the readable serial-style output
while keeping concurrent writes.
Add --retry-all-errors to the integration create/update curl calls so
transient 409 conflicts from concurrent writes to the same agent policy
are retried (curl --retry alone does not retry 409).
Fetch each agent policy once and extract integration name/package/version/id
locally via a single jq pass instead of re-fetching the identical policy JSON
1+3N times. Memoize epm/packages latest-version lookups so each package is
queried once instead of per (policy, integration). Dispatch the per-integration
dry-run+upgrade as throttled background jobs (MAX_FLEET_JOBS) with
flock-serialized output and a FAIL_FILE marker, mirroring
elastic_fleet_load_integrations_dir.
Behavior preserved: same elastic-defend-endpoints/fleet_server skips, same
AUTO_UPGRADE_INTEGRATIONS default-package gating (moved into jq, using $defaults
to avoid the jq $def keyword collision), and exit 1 on any failure so salt
retries.
Fetch each agent policy once per group instead of refetching the full
policy (plus a fresh Kibana session cookie) for every integration file,
and dispatch the create/update writes as throttled background jobs.
Adds elastic_fleet_load_integrations_dir and elastic_fleet_throttle to
so-elastic-fleet-common, reusing the bounded-concurrency pattern from
so-elasticsearch-ilm-policy-load. Replaces the four serial loops in the
loader with one call per agent policy.
The agent-policy enumeration passed --argjson def, creating a jq
variable $def. 'def' is a reserved keyword in jq and the deployed jq
version rejects it, so the program failed to compile and
in_use_integrations was left empty (silently disabling the in-use
upgrade guard). Rename the arg to $defaults.
Replace the per-package decision loop (which forked ~10 processes per
package and rebuilt a growing JSON file on every add -> O(n^2)) with two
jq passes: one prints the status messages, one builds the bulk install
list. A vnum/needs() jq definition reproduces the previous
version_conversion/compare_versions and excluded/subscription/installed/
upgrade/in-use logic exactly. Also fetch each agent policy once and
extract non-default package names locally instead of re-fetching the
policy per integration (1+K -> 1 GET per policy). Install behavior is
unchanged.
- schedule highstate every 2 hours (was 15 minutes); interval lives in
global:push:highstate_interval_hours so the SOC admin UI can tune it and
so-salt-minion-check derives its threshold as (interval + 1) * 3600
- add inotify beacon on the manager + master reactor + orch.push_batch that
writes per-app intent files, with a so-push-drainer schedule on the manager
that debounces, dedupes, and dispatches a single orchestration
- pillar_push_map.yaml allowlists the apps whose pillar changes trigger an
immediate targeted state.apply (targets verified against salt/top.sls);
edits under pillar/minions/ trigger a state.highstate on that one minion
- host-batch every push orchestration (batch: 25%, batch_wait: 15) so rule
changes don't thundering-herd large fleets
- new global:push:enabled kill-switch tears down the beacon, reactor config,
and drainer schedule on the next highstate for operators who want to keep
highstate-only behavior
- set restart_policy: unless-stopped on 23 container states so docker
recovers crashes without waiting for the next highstate; leave registry
(always), strelka/backend (on-failure), kratos, and hydra alone with
inline comments explaining why