Commit Graph

11867 Commits

Author SHA1 Message Date
Josh Patterson d0bea2ebcb Restore grouped per-integration logging and retry 409s in fleet integration loader
elastic_fleet_load_integrations_dir now buffers each concurrent job's
output (header + API response) to a per-job file and prints them in
submission order after wait, restoring the readable serial-style output
while keeping concurrent writes.

Add --retry-all-errors to the integration create/update curl calls so
transient 409 conflicts from concurrent writes to the same agent policy
are retried (curl --retry alone does not retry 409).
2026-06-18 11:19:36 -04:00
Josh Patterson 62c01a9756 Merge remote-tracking branch 'origin/3/dev' into soupmod2 2026-06-18 09:53:44 -04:00
reyesj2 16149df71f formatting 2026-06-16 18:21:28 -05:00
reyesj2 6a18f35020 add context to soup errors and optional soup debug log with xtrace output 2026-06-16 18:21:28 -05:00
Jason Ertel aa58225e8f Merge pull request #15974 from Security-Onion-Solutions/jertel/wip
es|ql defaults
2026-06-16 14:27:54 -04:00
Josh Patterson 8e33d0e1e9 Merge remote-tracking branch 'origin/3/dev' into soupmod2 2026-06-16 12:54:18 -04:00
reyesj2 3daed551df use --fail flag without set -x, since elasticsearch can return a 404 on the template lookup 2026-06-16 11:17:04 -05:00
reyesj2 4456bde1c8 check if template exists without --fail flag 2026-06-16 10:45:53 -05:00
Jorge Reyes 4a6c675223 skip kibana backport if the template doesn't exist 2026-06-16 10:33:11 -05:00
reyesj2 a769d4c680 another unneeded default 2026-06-16 09:32:37 -05:00
reyesj2 f68e3e47a1 remove pillar merge 2026-06-16 09:19:10 -05:00
Jorge Reyes b81257bf45 Merge pull request #15973 from Security-Onion-Solutions/reyesj2/dlm-support
Data stream lifecycle management support
2026-06-15 14:47:51 -05:00
reyesj2 1a423a2434 update message 2026-06-15 14:17:34 -05:00
reyesj2 95cae4c734 remove so-elasticsearch-indices-delete cron when using DLM 2026-06-15 13:32:45 -05:00
reyesj2 596471e140 using new annotation config 2026-06-15 13:31:53 -05:00
reyesj2 d10f21399c remove comments 2026-06-15 13:31:23 -05:00
Jason Ertel ae1ddf3817 es|ql defaults 2026-06-15 12:33:08 -04:00
Josh Patterson 1ee555957a Speed up so-elastic-fleet-integration-upgrade
Fetch each agent policy once and extract integration name/package/version/id
locally via a single jq pass instead of re-fetching the identical policy JSON
1+3N times. Memoize epm/packages latest-version lookups so each package is
queried once instead of per (policy, integration). Dispatch the per-integration
dry-run+upgrade as throttled background jobs (MAX_FLEET_JOBS) with
flock-serialized output and a FAIL_FILE marker, mirroring
elastic_fleet_load_integrations_dir.

Behavior preserved: same elastic-defend-endpoints/fleet_server skips, same
AUTO_UPGRADE_INTEGRATIONS default-package gating (moved into jq, using $defaults
to avoid the jq $def keyword collision), and exit 1 on any failure so salt
retries.
2026-06-12 15:23:43 -04:00
Josh Patterson 43f72c1f9f Parallelize so-elasticsearch-templates-load template PUTs
Load component and index templates as throttled background jobs (max 10
concurrent) instead of sequential curl PUTs, matching the bounded-concurrency
+ flock-serialized-output pattern used by the fleet/ILM load scripts. Keeps a
wait barrier between the component phase and the index phase so index
templates never load before their referenced component templates. Failures are
tracked via per-job marker files since counter increments can't escape
background subshells.
2026-06-12 15:11:34 -04:00
Josh Brower 9031c1fd22 userid vs names 2026-06-12 11:18:59 -04:00
Josh Patterson ae6a705ce1 Speed up so-elastic-fleet-integration-policy-load
Fetch each agent policy once per group instead of refetching the full
policy (plus a fresh Kibana session cookie) for every integration file,
and dispatch the create/update writes as throttled background jobs.

Adds elastic_fleet_load_integrations_dir and elastic_fleet_throttle to
so-elastic-fleet-common, reusing the bounded-concurrency pattern from
so-elasticsearch-ilm-policy-load. Replaces the four serial loops in the
loader with one call per agent policy.
2026-06-12 09:38:41 -04:00
reyesj2 c505160480 set default DLM retention 90d 2026-06-11 15:13:28 -05:00
reyesj2 d9f6cde4e1 remove global setting from data_retention annotation 2026-06-11 15:11:29 -05:00
Josh Patterson b1273573ed Fix jq $def keyword collision in optional-integrations-load
The agent-policy enumeration passed --argjson def, creating a jq
variable $def. 'def' is a reserved keyword in jq and the deployed jq
version rejects it, so the program failed to compile and
in_use_integrations was left empty (silently disabling the in-use
upgrade guard). Rename the arg to $defaults.
2026-06-11 15:50:53 -04:00
Josh Patterson 6c42c419e2 Serialize ILM policy-load output with flock to stop interleaving
A single printf per block was not actually one write() call, so
concurrent jobs still occasionally interleaved their label and response
lines. Hold an flock around just the printf (curl still runs in
parallel) so each policy's block prints intact, keeping live
completion-order streaming.
2026-06-11 15:42:41 -04:00
Josh Patterson f23652397c Speed up so-elastic-fleet-optional-integrations-load decision logic
Replace the per-package decision loop (which forked ~10 processes per
package and rebuilt a growing JSON file on every add -> O(n^2)) with two
jq passes: one prints the status messages, one builds the bulk install
list. A vnum/needs() jq definition reproduces the previous
version_conversion/compare_versions and excluded/subscription/installed/
upgrade/in-use logic exactly. Also fetch each agent policy once and
extract non-default package names locally instead of re-fetching the
policy per integration (1+K -> 1 GET per policy). Install behavior is
unchanged.
2026-06-11 13:57:56 -04:00
Josh Patterson 07d3b148b5 fix output 2026-06-11 13:37:26 -04:00
Josh Patterson 780d9faf0d Parallelize so-elasticsearch-ilm-policy-load PUTs
Run the ~300 ILM policy PUTs concurrently (bounded to 10 in flight via a
throttle gate) instead of one serial curl per policy. Adds a put_policy
helper and waits for all background jobs before exiting. Preserves policy
parity; only the scheduling changes. Drops the dead empty sid cookie arg
(falls back to basic auth from curl.config as before).
2026-06-11 12:08:32 -04:00
reyesj2 4741cc92bd fleet manager start kibana if it isn't already running and wait for healthly status 2026-06-10 17:52:08 -05:00
reyesj2 46655860e9 http 2026-06-10 17:27:23 -05:00
reyesj2 289ddda5e8 kibana health check for fleet scripts 2026-06-10 17:06:22 -05:00
Josh Patterson 83aaa76f98 allow full highstate on manager when locked 2026-06-10 16:34:10 -04:00
reyesj2 f905afbc6f logging 2026-06-10 15:01:22 -05:00
reyesj2 bd5e77afc5 increase delay in so-elastic-fleet-package-upgrade attempts 2026-06-10 14:59:29 -05:00
reyesj2 944e773759 save exit until all packages have been attempted 2026-06-10 14:58:49 -05:00
reyesj2 cf456dc58c reuse existing index templates 2026-06-09 23:21:43 -05:00
reyesj2 9aa9ea3255 Iniitial DLM support 2026-06-09 23:19:26 -05:00
Josh Patterson 448668a72e Merge remote-tracking branch 'origin/3/dev' into nostartupstates 2026-06-09 14:02:00 -04:00
Josh Patterson f088a27159 so-boot-mine-update: warm master pillar cache before highstate
A complete mine is not enough: elasticsearch:nodes, redis:nodes,
logstash:nodes (tgt_type=pillar) and hypervisor:nodes (tgt_type=compound)
resolve their target against the master's per-minion data cache
(grains+pillar in data.p), which is populated only when a minion's pillar
is recompiled -- separately from the mine. After a reboot a node can be in
the mine (so node_data/glob sees it) yet absent from that cache, so it
fails the elasticsearch:enabled:true pillar match and is dropped from
elasticsearch:nodes -> so-elasticsearch ExtraHosts -> container recreate.

After the mine-completeness wait, run salt '*' saltutil.refresh_pillar
wait=True to synchronously cache every up node's pillar (the same lever
deploy_newnode.sls uses), then verify with salt-run cache.pillar and retry
stragglers, bounded by MINE_UPDATE_MAX_WAIT. Also log elasticsearch:nodes
alongside node_data for inspection.
2026-06-09 13:52:19 -04:00
Josh Patterson 27c7702325 so-boot-mine-update: wait for a complete mine before highstate
Mine-backed pillars (node_data, elasticsearch:nodes, redis:nodes,
logstash:nodes, hypervisor:nodes) include a node only if it returned an
IP from the mine, and the configs they build are rebuilt fresh every
highstate. After a manager reboot with a flushed mine, the first boot
highstate could run before an up node re-reported network.ip_addrs,
dropping it from e.g. so-elasticsearch ExtraHosts and forcing a
container recreate.

After the initial broad mine.update, poll until every currently-up
minion actually has network.ip_addrs in the mine, re-pushing mine.update
to stragglers, before releasing the boot highstate. Shares the existing
MINE_UPDATE_MAX_WAIT backstop so a slow/down node never blocks boot, and
still logs the rendered node_data for inspection.
2026-06-09 10:10:32 -04:00
Josh Patterson 8c306eb37d so-boot-mine-update: log the rendered node_data content
Dump the actual rendered node_data pillar (pretty-printed JSON) to the
journal instead of just a rendered/empty verdict, so the boot-time render
attempt is fully inspectable. Empty renders print false/null and still
emit the WARNING.
2026-06-09 09:49:19 -04:00
Josh Patterson e536ffa363 so-boot-mine-update: render node_data after mine.update before highstate
After the boot-time mine.update, have the manager actually render the
node_data pillar and log whether it came back populated. node_data: False
makes salt/top.sls apply the bootstrap recovery branch instead of the
manager's real config, so surfacing this in the journal makes the
condition visible before so-boot-highstate runs. Best-effort and
non-blocking: always exits 0 so highstate proceeds regardless.
2026-06-09 09:35:24 -04:00
Jorge Reyes d7aa7ab228 Merge pull request #15961 from Security-Onion-Solutions/reyesj2/fleet-autoconfigure
respect elasticfleet enable_auto_configuration setting for so-elastic…
2026-06-08 15:09:58 -05:00
Jorge Reyes fe0b68d24c Merge pull request #15958 from Security-Onion-Solutions/reyesj2-patch-template
fix elasticsearch template generation issue
2026-06-08 15:07:49 -05:00
reyesj2 6ad345730b respect elasticfleet enable_auto_configuration setting for so-elastic-fleet-urls-update 2026-06-08 15:02:57 -05:00
Josh Patterson 9580976ba2 Add manager boot-time grid mine.update oneshot before highstate
so-boot-mine-update.service is a manager-only Type=oneshot unit that runs
once per boot after salt-master/salt-minion start and before
so-boot-highstate.service. It pushes mine.update to all reachable minions
so mine-backed pillars (node IPs, ES/Redis/Logstash discovery) are fresh
before the boot highstate renders them.

The helper waits for the responsive minion set to settle (plateau) rather
than for every accepted key to report up, so an intentionally powered-off
minion doesn't block the update; MAX_WAIT remains as a backstop.
2026-06-08 11:05:13 -04:00
reyesj2 ac907ba45f fix elasticsearch template generation issue 2026-06-05 16:42:08 -05:00
Josh Patterson cb3631da81 Move setup-complete marker from /opt/so/conf to /opt/so/state
The setup-complete marker is a runtime-state file, not config, so move it
to /opt/so/state/setup-complete. Updates both writers (mark_setup_complete
in setup/so-functions and the upgrade-path state in minion/init.sls) and the
three readers (so-boot-highstate.service ConditionPathExists, boot_highstate.sls
enable gate, and the so-user_sync cron gate).
2026-06-04 15:07:27 -04:00
Josh Patterson f5d63f585e Merge remote-tracking branch 'origin/3/dev' into nostartupstates 2026-06-04 09:19:01 -04:00
Josh Patterson 13f8be40b5 so-boot-highstate: wait for docker before running highstate
Add docker.service to After= and Wants= so the boot-time highstate
starts after docker is up. Uses Wants (soft) so highstate still runs
if docker fails to start.
2026-06-04 08:46:35 -04:00