fix: stop pip's patchelf 'ERROR' line from polluting sosetup.log

The cmd.run for psycopg2 install was already tolerating pip's non-zero exit with `|| true`, but pip's stderr — which contains the literal string "ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'patchelf'" — was still being captured into salt's state-result dict. so-setup logs salt state output to /root/sosetup.log, and verify_setup() then greps for the substring "ERROR" to build /root/errors.log. The patchelf line then shows up at the end of every install as "WARNING: Errors detected during setup" even though the install is in fact green. Redirect pip's combined stdout/stderr to /opt/so/log/so_pillar/psycopg2_install.log so the noise lives in a dedicated, predictable triage location instead of leaking into salt's state result. The `unless: import psycopg2` check is still the actual readiness gate, so a real install failure (rather than just the patchelf RPATH-rewrite step that has no functional effect on the wheel) would still surface via the state being re-run on every apply and `import psycopg2` failing.
fix: tolerate pip's non-zero exit on psycopg2 patchelf step
2026-05-07 11:58:21 +02:00 · 2026-05-05 10:38:57 -04:00 · 2026-05-04 22:08:31 -04:00 · 2026-05-04 21:02:08 -04:00 · 2026-05-04 19:56:14 -04:00 · 2026-05-04 19:47:38 -04:00
21 changed files with 385 additions and 85 deletions
@@ -164,8 +164,8 @@ update_docker_containers() {
    # Pull down the trusted docker image
    run_check_net_err \
    "docker pull $CONTAINER_REGISTRY/$IMAGEREPO/$image" \
-    "Could not pull $image, please ensure connectivity to $CONTAINER_REGISTRY" >> "$LOG_FILE" 2>&1 
-    
+    "Could not pull $image, please ensure connectivity to $CONTAINER_REGISTRY" >> "$LOG_FILE" 2>&1
+
    # Get signature
    run_check_net_err \
    "curl --retry 5 --retry-delay 60 -A '$CURLTYPE/$CURRENTVERSION/$OS/$(uname -r)' $sig_url --output $SIGNPATH/$image.sig" \
@@ -189,11 +189,24 @@ update_docker_containers() {
          HOSTNAME=$(hostname)
        fi
        docker tag $CONTAINER_REGISTRY/$IMAGEREPO/$image $HOSTNAME:5000/$IMAGEREPO/$image >> "$LOG_FILE" 2>&1 || {
-          echo "Unable to tag $image" >> "$LOG_FILE" 2>&1 
+          echo "Unable to tag $image" >> "$LOG_FILE" 2>&1
          exit 1
        }
-        docker push $HOSTNAME:5000/$IMAGEREPO/$image >> "$LOG_FILE" 2>&1 || {
-          echo "Unable to push $image" >> "$LOG_FILE" 2>&1 
+        # Push to the embedded registry via a registry-to-registry copy. Avoids
+        # `docker push`, which on Docker 29.x with the containerd image store
+        # represents freshly-pulled images as an index whose layer content
+        # isn't reachable through the push path. The local `docker tag` above
+        # is preserved so so-image-pull's `:5000` existence check still works.
+        # Pin to the digest already gpg-verified above so we copy exactly the
+        # bytes we approved.
+        local VERIFIED_REF
+        VERIFIED_REF=$(echo "$DOCKERINSPECT" | jq -r ".[0].RepoDigests[] | select(. | contains(\"$CONTAINER_REGISTRY\"))" | head -n 1)
+        if [ -z "$VERIFIED_REF" ] || [ "$VERIFIED_REF" = "null" ]; then
+          echo "Unable to determine verified digest for $image" >> "$LOG_FILE" 2>&1
+          exit 1
+        fi
+        docker buildx imagetools create --tag $HOSTNAME:5000/$IMAGEREPO/$image "$VERIFIED_REF" >> "$LOG_FILE" 2>&1 || {
+          echo "Unable to copy $image to embedded registry" >> "$LOG_FILE" 2>&1
          exit 1
        }
      fi
@@ -227,7 +227,7 @@ if [[ $EXCLUDE_KNOWN_ERRORS == 'Y' ]]; then
    EXCLUDED_ERRORS="$EXCLUDED_ERRORS|from NIC checksum offloading" # zeek reporter.log
    EXCLUDED_ERRORS="$EXCLUDED_ERRORS|marked for removal"           # docker container getting recycled
    EXCLUDED_ERRORS="$EXCLUDED_ERRORS|tcp 127.0.0.1:6791: bind: address already in use" # so-elastic-fleet agent restarting. Seen starting w/ 8.18.8 https://github.com/elastic/kibana/issues/201459
-    EXCLUDED_ERRORS="$EXCLUDED_ERRORS|TransformTask\] \[logs-(tychon|aws_billing|microsoft_defender_endpoint|armis|o365_metrics|microsoft_sentinel|snyk).*user so_kibana lacks the required permissions \[(logs|metrics)-\1" # Known issue with integrations starting transform jobs that are explicitly not allowed to start as a system user. (installed as so_elastic / so_kibana)
+    EXCLUDED_ERRORS="$EXCLUDED_ERRORS|TransformTask\] \[logs-(tychon|aws_billing|microsoft_defender_endpoint|armis|o365_metrics|microsoft_sentinel|snyk|cyera|island_browser).*user so_kibana lacks the required permissions \[(logs|metrics)-\1" # Known issue with integrations starting transform jobs that are explicitly not allowed to start as a system user. This error should not be seen on fresh ES 9.3.3 installs or after SO 3.1.0 with soups addition of check_transform_health_and_reauthorize()
    EXCLUDED_ERRORS="$EXCLUDED_ERRORS|manifest unknown"             # appears in so-dockerregistry log for so-tcpreplay following docker upgrade to 29.2.1-1
 fi

@@ -51,6 +51,16 @@ so-elastic-fleet-package-registry:
      - {{ ULIMIT.name }}={{ ULIMIT.soft }}:{{ ULIMIT.hard }}
    {%   endfor %}
    {% endif %}
+
+wait_for_so-elastic-fleet-package-registry:
+  http.wait_for_successful_query:
+    - name: "http://localhost:8080/health"
+    - status: 200
+    - wait_for: 300
+    - request_interval: 15
+    - require:
+      - docker_container: so-elastic-fleet-package-registry
+
 delete_so-elastic-fleet-package-registry_so-status.disabled:
  file.uncomment:
    - name: /opt/so/conf/so-status/so-status.conf
@@ -18,17 +18,6 @@ so-elastic-fleet-auto-configure-logstash-outputs:
    - retry:
        attempts: 4
        interval: 30
-
-{# Separate from above in order to catch elasticfleet-logstash.crt changes and force update to fleet output policy #}
-so-elastic-fleet-auto-configure-logstash-outputs-force:
-  cmd.run:
-    - name: /usr/sbin/so-elastic-fleet-outputs-update --certs
-    - retry:
-        attempts: 4
-        interval: 30
-    - onchanges:
-        - x509: etc_elasticfleet_logstash_crt
-        - x509: elasticfleet_kafka_crt
 {% endif %}

 # If enabled, automatically update Fleet Server URLs & ES Connection
@@ -240,7 +240,7 @@ elastic_fleet_policy_create() {
        --arg DESC "$DESC" \
        --arg TIMEOUT $TIMEOUT \
        --arg FLEETSERVER "$FLEETSERVER" \
-            '{"name": $NAME,"id":$NAME,"description":$DESC,"namespace":"default","monitoring_enabled":["logs"],"inactivity_timeout":$TIMEOUT,"has_fleet_server":$FLEETSERVER}'
+            '{"name": $NAME,"id":$NAME,"description":$DESC,"namespace":"default","monitoring_enabled":["logs"],"inactivity_timeout":$TIMEOUT,"has_fleet_server":$FLEETSERVER,"advanced_settings":{"agent_logging_level": "warning"}}'
        )
    # Create Fleet Policy
    if ! fleet_api "agent_policies" -XPOST -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d "$JSON_STRING"; then
@@ -235,6 +235,16 @@ function update_kafka_outputs() {

 {% endif %}

+# Compare the current Elastic Fleet certificate against what is on disk
+POLICY_CERT_SHA=$(jq -r '.item.ssl.certificate' <<< $RAW_JSON | openssl x509 -noout -sha256 -fingerprint)
+DISK_CERT_SHA=$(openssl x509 -in /etc/pki/elasticfleet-logstash.crt -noout -sha256 -fingerprint)
+
+if [[ "$POLICY_CERT_SHA" != "$DISK_CERT_SHA" ]]; then
+    printf "Certificate on disk doesn't match certificate in policy - forcing update\n"
+    UPDATE_CERTS=true
+    FORCE_UPDATE=true
+fi
+
 # Sort & hash the new list of Logstash Outputs
 NEW_LIST_JSON=$(jq --compact-output --null-input '$ARGS.positional' --args -- "${NEW_LIST[@]}")
 NEW_HASH=$(sha256sum <<< "$NEW_LIST_JSON" | awk '{print $1}')
@@ -398,6 +398,7 @@ firewall:
                - elasticsearch_rest
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -410,6 +411,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -427,6 +429,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
            searchnode:
              portgroups:
@@ -437,6 +440,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -450,6 +454,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -459,6 +464,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -492,6 +498,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - elastic_agent_control
@@ -502,6 +509,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -610,6 +618,7 @@ firewall:
                - elasticsearch_rest
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -622,6 +631,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -639,6 +649,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
            searchnode:
              portgroups:
@@ -649,6 +660,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -662,6 +674,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -671,6 +684,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -702,6 +716,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - elastic_agent_control
@@ -712,6 +727,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -820,6 +836,7 @@ firewall:
                - elasticsearch_rest
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -832,6 +849,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -849,6 +867,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
            searchnode:
              portgroups:
@@ -858,6 +877,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -870,6 +890,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -879,6 +900,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -912,6 +934,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - elastic_agent_control
@@ -922,6 +945,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -1040,6 +1064,7 @@ firewall:
                - elasticsearch_rest
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -1052,6 +1077,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -1063,6 +1089,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - beats_5044
@@ -1074,6 +1101,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - redis
@@ -1083,6 +1111,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - redis
@@ -1093,6 +1122,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -1129,6 +1159,7 @@ firewall:
              portgroups:
                - docker_registry
                - influxdb
+                - postgres
                - sensoroni
                - yum
                - elastic_agent_control
@@ -1139,6 +1170,7 @@ firewall:
                - yum
                - docker_registry
                - influxdb
+                - postgres
                - elastic_agent_control
                - elastic_agent_data
                - elastic_agent_update
@@ -1482,6 +1514,7 @@ firewall:
                - kibana
                - redis
                - influxdb
+                - postgres
                - elasticsearch_rest
                - elasticsearch_node
                - elastic_agent_control
@@ -1,6 +1,5 @@
 {% from 'vars/globals.map.jinja' import GLOBALS %}
 {% from 'docker/docker.map.jinja' import DOCKERMERGED %}
-{% from 'telegraf/map.jinja' import TELEGRAFMERGED %}
 {% import_yaml 'firewall/defaults.yaml' as FIREWALL_DEFAULT %}

 {# add our ip to self #}
@@ -56,16 +55,4 @@

 {% endif %}

-{# Open Postgres (5432) to minion hostgroups when Telegraf is configured to write to Postgres #}
-{% set TG_OUT = TELEGRAFMERGED.output | upper %}
-{% if TG_OUT in ['POSTGRES', 'BOTH'] %}
-{%   if role.startswith('manager') or role == 'standalone' or role == 'eval' %}
-{%     for r in ['sensor', 'searchnode', 'heavynode', 'receiver', 'fleet', 'idh', 'desktop', 'import'] %}
-{%       if FIREWALL_DEFAULT.firewall.role[role].chain["DOCKER-USER"].hostgroups[r] is defined %}
-{%         do FIREWALL_DEFAULT.firewall.role[role].chain["DOCKER-USER"].hostgroups[r].portgroups.append('postgres') %}
-{%       endif %}
-{%     endfor %}
-{%   endif %}
-{% endif %}
-
 {% set FIREWALL_MERGED = salt['pillar.get']('firewall', FIREWALL_DEFAULT.firewall, merge=True) %}
@@ -485,6 +485,130 @@ elasticsearch_backup_index_templates() {
  tar -czf /nsm/backup/3.0.0_elasticsearch_index_templates.tar.gz -C /opt/so/conf/elasticsearch/templates/index/ .
 }

+elasticfleet_set_agent_logging_level_warn() {
+    . /usr/sbin/so-elastic-fleet-common
+
+    local current_agent_policies
+    if ! current_agent_policies=$(fleet_api "agent_policies?perPage=1000"); then
+        echo "Warning: unable to retrieve Fleet agent policies"
+        return 0
+    fi
+
+    # Only updating policies that are within Security Onion defaults and do not already have any user configured advanced_settings.
+    local policies_to_update
+    policies_to_update=$(jq -c '
+        .items[]
+        | select(has("advanced_settings") | not)
+        | select(
+            .id == "so-grid-nodes_general"
+            or .id == "so-grid-nodes_heavy"
+            or .id == "endpoints-initial"
+            or (.id | startswith("FleetServer_"))
+          )
+    ' <<< "$current_agent_policies")
+
+    if [[ -z "$policies_to_update" ]]; then
+        return 0
+    fi
+
+    while IFS= read -r policy; do
+        [[ -z "$policy" ]] && continue
+
+        local policy_id policy_name policy_namespace
+        policy_id=$(jq -r '.id' <<< "$policy")
+        policy_name=$(jq -r '.name' <<< "$policy")
+        policy_namespace=$(jq -r '.namespace' <<< "$policy")
+
+        local update_logging
+        update_logging=$(jq -n \
+            --arg name "$policy_name" \
+            --arg namespace "$policy_namespace" \
+            '{name: $name, namespace: $namespace, advanced_settings: {agent_logging_level: "warning"}}'
+        )
+
+        echo "Setting elastic agent_logging_level to warning on policy '$policy_name' ($policy_id)."
+        if ! fleet_api "agent_policies/$policy_id" -XPUT -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d "$update_logging" >/dev/null; then
+            echo "  warning: failed to update agent policy '$policy_name' ($policy_id)" >&2
+        fi
+    done <<< "$policies_to_update"
+}
+
+check_transform_health_and_reauthorize() {
+    . /usr/sbin/so-elastic-fleet-common
+
+    echo "Checking integration transform jobs for unhealthy / unauthorized status..."
+
+    local transforms_doc stats_doc installed_doc
+    if ! transforms_doc=$(so-elasticsearch-query "_transform/_all?size=1000" --fail --retry 3 --retry-delay 5 2>/dev/null); then
+        echo "Unable to query for transform jobs, skipping reauthorization."
+        return 0
+    fi
+    if ! stats_doc=$(so-elasticsearch-query "_transform/_all/_stats?size=1000" --fail --retry 3 --retry-delay 5 2>/dev/null); then
+        echo "Unable to query for transform job stats, skipping reauthorization."
+        return 0
+    fi
+    if ! installed_doc=$(fleet_api "epm/packages/installed?perPage=500"); then
+        echo "Unable to list installed Fleet packages, skipping reauthorization."
+        return 0
+    fi
+
+    # Get all transforms that meet the following
+    # - unhealthy (any non-green health status)
+    # - metadata has run_as_kibana_system: false (this fix is specific to transforms started prior to Kibana 9.3.3)
+    # - are not orphaned (integration is not somehow missing/corrupt/uninstalled)
+    local unhealthy_transforms
+    unhealthy_transforms=$(jq -c -n \
+        --argjson t "$transforms_doc" \
+        --argjson s "$stats_doc" \
+        --argjson i "$installed_doc" '
+        ($i.items | map({key: .name, value: .version}) | from_entries) as $pkg_ver
+        | ($s.transforms | map({key: .id, value: .health.status}) | from_entries) as $health
+        | [ $t.transforms[]
+            | select(._meta.run_as_kibana_system == false)
+            | select(($health[.id] // "unknown") != "green")
+            | {id, pkg: ._meta.package.name, ver: ($pkg_ver[._meta.package.name])}
+          ]
+        | if length == 0 then empty else . end
+        | (map(select(.ver == null)) | map({orphan: .id})[]),
+          (map(select(.ver != null))
+           | group_by(.pkg)
+           | map({pkg: .[0].pkg, ver: .[0].ver, transformIds: map(.id)})[])
+    ')
+
+    if [[ -z "$unhealthy_transforms" ]]; then
+        return 0
+    fi
+
+    local unhealthy_count
+    unhealthy_count=$(jq -s '[.[].transformIds? // empty | .[]] | length' <<< "$unhealthy_transforms")
+    echo "Found $unhealthy_count transform(s) needing reauthorization."
+
+    local total_failures=0
+    while IFS= read -r transform; do
+        [[ -z "$transform" ]] && continue
+        if jq -e 'has("orphan")' <<< "$transform" >/dev/null 2>&1; then
+            echo "Skipping transform not owned by any installed Fleet package: $(jq -r '.orphan' <<< "$transform")"
+            continue
+        fi
+
+        local pkg ver body resp
+        pkg=$(jq -r '.pkg' <<< "$transform")
+        ver=$(jq -r '.ver' <<< "$transform")
+        body=$(jq -c '{transforms: (.transformIds | map({transformId: .}))}' <<< "$transform")
+
+        echo "Reauthorizing transform(s) for ${pkg}-${ver}..."
+        resp=$(fleet_api "epm/packages/${pkg}/${ver}/transforms/authorize" \
+                        -XPOST -H 'kbn-xsrf: true' -H 'Content-Type: application/json' \
+                        -d "$body") || { echo "Could not reauthorize transform(s) for ${pkg}-${ver}"; continue; }
+
+        (( total_failures += $(jq 'map(select(.success != true)) | length' <<< "$resp" 2>/dev/null) ))
+    done <<< "$unhealthy_transforms"
+
+    if [[ "$total_failures" -gt 0 ]]; then
+        echo "Some transform(s) failed to reauthorize."
+    fi
+}
+
 ensure_postgres_local_pillar() {
  # Postgres was added as a service after 3.0.0, so the new pillar/top.sls
  # references postgres.soc_postgres / postgres.adv_postgres unconditionally.
@@ -553,6 +677,12 @@ post_to_3.1.0() {
  # file_roots of its own and --local would fail with "No matching sls found".
  salt-call state.apply postgres.telegraf_users queue=True || true

+  # Update default agent policies to use logging level warn.
+  elasticfleet_set_agent_logging_level_warn || true
+
+  # Check for unhealthy / unauthorized integration transform jobs and attempt reauthorizations
+  check_transform_health_and_reauthorize || true
+
  POSTVERSION=3.1.0
 }

@@ -28,6 +28,14 @@ BEGIN
 END
 $$;

+-- USAGE on the schema is the bare minimum needed to reference its tables.
+-- CONNECT on the database is needed before the role can establish a session
+-- at all (default privileges on a new DB grant CONNECT to PUBLIC, but if the
+-- securityonion database is restricted that grant has to be explicit).
+-- Password + LOGIN privileges are set later in schema_pillar.sls because
+-- the password lives in pillar (secrets:pillar_master_pass) and plain SQL
+-- can't substitute pillar values.
+GRANT CONNECT ON DATABASE securityonion TO so_pillar_master, so_pillar_writer, so_pillar_secret_owner;
 GRANT USAGE ON SCHEMA so_pillar TO so_pillar_master, so_pillar_writer, so_pillar_secret_owner;

 -- Read access for ext_pillar through the views only.
@@ -37,15 +45,8 @@ GRANT SELECT ON so_pillar.v_pillar_global,
    TO so_pillar_master;
 GRANT EXECUTE ON FUNCTION so_pillar.fn_pillar_secrets(text) TO so_pillar_master;

-- Engine reads + drains the change queue from the salt-master process. It
-- needs SELECT to find unprocessed rows and UPDATE to mark them processed.
-- The queue contains only locator metadata (no pillar data), so the master
-- role's existing privilege footprint is unchanged in practice.
-GRANT SELECT, UPDATE ON so_pillar.change_queue TO so_pillar_master;
-GRANT USAGE ON SEQUENCE so_pillar.change_queue_id_seq TO so_pillar_master;
-- Writer needs INSERT (the trigger runs as table owner, so this is just for
-- direct testing / manual replays from psql).
-GRANT INSERT ON so_pillar.change_queue TO so_pillar_writer;
+-- (change_queue grants live in 008_change_notify.sql alongside the table itself,
+-- since the table doesn't exist until 008 runs.)

 -- Writer needs CRUD on pillar_entry/minion/role_member plus access to seed tables.
 GRANT SELECT, INSERT, UPDATE, DELETE
@@ -75,3 +75,15 @@ CREATE TRIGGER tg_pillar_entry_notify
    ON so_pillar.pillar_entry
    FOR EACH ROW
    EXECUTE FUNCTION so_pillar.fn_pillar_entry_notify();
+
+-- Role grants on the change_queue table. Lived in 006_rls.sql historically but
+-- moved here so the GRANT references resolve — 006 runs before this file does.
+-- Engine reads + drains the change queue from the salt-master process. It
+-- needs SELECT to find unprocessed rows and UPDATE to mark them processed.
+-- The queue contains only locator metadata (no pillar data), so the master
+-- role's existing privilege footprint is unchanged in practice.
+GRANT SELECT, UPDATE ON so_pillar.change_queue TO so_pillar_master;
+GRANT USAGE ON SEQUENCE so_pillar.change_queue_id_seq TO so_pillar_master;
+-- Writer needs INSERT (the trigger runs as table owner, so this is just for
+-- direct testing / manual replays from psql).
+GRANT INSERT ON so_pillar.change_queue TO so_pillar_writer;
@@ -4,7 +4,7 @@
 # Elastic License 2.0.

 {% from 'allowed_states.map.jinja' import allowed_states %}
-{% if sls.split('.')[0] in allowed_states %}
+{% if 'postgres' in allowed_states %}
 {%   from 'vars/globals.map.jinja' import GLOBALS %}

 # Deploys the so_pillar schema (tables, views, audit triggers, secrets,
@@ -93,13 +93,60 @@ so_pillar_master_key_configure:
    - require:
      - cmd: so_pillar_apply_{{ sql_files[-1] | replace('.', '_') }}

+# Set login passwords on the so_pillar_* roles. 006_rls.sql creates the roles
+# as NOLOGIN with no password (plain SQL can't substitute pillar values), so
+# the salt-master ext_pillar and the pg_notify_pillar engine — both of which
+# connect as so_pillar_master via TCP — would fail with "password
+# authentication failed" without this step. The password lives in pillar
+# under secrets:pillar_master_pass (generated by setup/so-functions::secrets_pillar)
+# and is the same one rendered into ext_pillar_postgres.conf.jinja and the
+# engines.conf pg_notify_pillar block, so all three sides agree.
+so_pillar_role_login_passwords:
+  cmd.run:
+    - name: |
+        docker exec -i so-postgres psql -v ON_ERROR_STOP=1 -U postgres -d securityonion <<EOSQL
+        ALTER ROLE so_pillar_master       WITH LOGIN PASSWORD '{{ pillar['secrets']['pillar_master_pass'] }}';
+        ALTER ROLE so_pillar_writer       WITH LOGIN PASSWORD '{{ pillar['secrets']['pillar_master_pass'] }}';
+        ALTER ROLE so_pillar_secret_owner WITH LOGIN PASSWORD '{{ pillar['secrets']['pillar_master_pass'] }}';
+        EOSQL
+    - require:
+      - cmd: so_pillar_master_key_configure
+
+# Install psycopg2 into salt-master's bundled python so the pg_notify_pillar
+# engine module can `import psycopg2`. Without this the engine's import fails
+# silently in salt's loader and the engine just never starts. salt's bundled
+# python at /opt/saltstack/salt/bin/python3 doesn't ship psycopg by default.
+#
+# Uses cmd.run with an `unless` import-test rather than pip.installed because
+# pip exits non-zero if patchelf isn't on PATH (it tries to rewrite the
+# psycopg2 wheel's RPATH after extraction), even though the wheel is fully
+# installed and importable. salt's pip.installed surfaces the non-zero exit
+# as a state failure and the cascade kills schema_pillar's downstream work.
+# `import psycopg2` succeeds either way, so that's the actual readiness gate.
+#
+# Pip's stdout/stderr is redirected to /opt/so/log/so_pillar/psycopg2_install.log
+# so the literal "ERROR: ... patchelf" line doesn't get hoovered up into
+# /root/sosetup.log and then into /root/errors.log by verify_setup's
+# substring-grep for "ERROR". The redirect target is preserved for
+# triage if `import psycopg2` ever does fail.
+so_pillar_psycopg2_in_salt_python:
+  cmd.run:
+    - name: |
+        mkdir -p /opt/so/log/so_pillar
+        /opt/saltstack/salt/bin/pip3 install --quiet psycopg2-binary \
+          >/opt/so/log/so_pillar/psycopg2_install.log 2>&1 \
+          || true
+    - unless: /opt/saltstack/salt/bin/python3 -c "import psycopg2"
+    - require:
+      - cmd: so_pillar_role_login_passwords
+
 # Run the importer once after the schema is in place. Idempotent — re-runs
 # with no SLS edits produce zero row changes.
 so_pillar_initial_import:
  cmd.run:
    - name: /usr/sbin/so-pillar-import --yes --reason 'schema_pillar.sls initial import'
    - require:
-      - cmd: so_pillar_master_key_configure
+      - cmd: so_pillar_psycopg2_in_salt_python

 # Flip so-yaml from dual-write to PG-canonical for managed paths now that
 # the schema and importer are both in place. Bootstrap files (secrets.sls,
@@ -1,7 +1,27 @@
 engines_dirs:
  - /etc/salt/engines

+# All salt-master engines must be declared in this single file.
+# Salt's master.d/*.conf merge replaces top-level lists rather than
+# concatenating them, so a sibling .conf with its own `engines:` list
+# would silently overwrite this one (only the last loaded file's list
+# would survive). Anything new — including postsalt's pg_notify_pillar
+# engine, gated on postgres:so_pillar:enabled below — gets appended
+# here under the same `engines:` key.
 engines:
+{% if salt['pillar.get']('postgres:so_pillar:enabled', False) %}
+  - pg_notify_pillar:
+      host: {{ pillar.get('postgres', {}).get('host', '127.0.0.1') }}
+      port: {{ pillar.get('postgres', {}).get('port', 5432) }}
+      dbname: securityonion
+      user: so_pillar_master
+      password: {{ pillar['secrets']['pillar_master_pass'] }}
+      channel: so_pillar_change
+      debounce_ms: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_debounce_ms', 500) }}
+      reconnect_backoff: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_reconnect_backoff', 5) }}
+      backlog_interval: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_backlog_interval', 30) }}
+      batch_limit: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_batch_limit', 500) }}
+{% endif %}
  - checkmine:
      interval: 60
  - pillarWatch:
@@ -63,6 +63,7 @@ engines_config:
  file.managed:
    - name: /etc/salt/master.d/engines.conf
    - source: salt://salt/files/engines.conf
+    - template: jinja

 # update the bootstrap script when used for salt-cloud
 salt_bootstrap_cloud:
@@ -10,7 +10,7 @@
 # and the importer has run at least once.

 {% from 'allowed_states.map.jinja' import allowed_states %}
-{% if sls.split('.')[0] in allowed_states %}
+{% if 'salt.master' in allowed_states %}

 {% if salt['pillar.get']('postgres:so_pillar:enabled', False) %}

@@ -1,20 +0,0 @@
-# /etc/salt/master.d/pg_notify_pillar_engine.conf
-# Rendered by salt/salt/master/pg_notify_pillar_engine.sls.
-#
-# Subscribes the salt-master to so_pillar.change_queue via LISTEN
-# so_pillar_change. The engine drains queued changes and re-publishes
-# them on the event bus as 'so/pillar/changed'. Reactor wiring is in
-# so_pillar_reactor.conf.
-
-engines:
-  - pg_notify_pillar:
-      host: {{ pillar.get('postgres', {}).get('host', '127.0.0.1') }}
-      port: {{ pillar.get('postgres', {}).get('port', 5432) }}
-      dbname: securityonion
-      user: so_pillar_master
-      password: {{ pillar['secrets']['pillar_master_pass'] }}
-      channel: so_pillar_change
-      debounce_ms: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_debounce_ms', 500) }}
-      reconnect_backoff: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_reconnect_backoff', 5) }}
-      backlog_interval: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_backlog_interval', 30) }}
-      batch_limit: {{ pillar.get('postgres', {}).get('so_pillar', {}).get('engine_batch_limit', 500) }}
@@ -3,16 +3,22 @@
 # https://securityonion.net/license; you may not use this file except in compliance with the
 # Elastic License 2.0.

-# Deploys the pg_notify_pillar engine module + its master.d config so the
+# Deploys the pg_notify_pillar engine module + its reactor config so the
 # salt-master subscribes to so_pillar.change_queue and republishes changes
 # on the salt event bus as so/pillar/changed. Reactor (so_pillar_changed.sls)
 # matches that tag and dispatches the appropriate orch.
 #
+# The actual `engines:` declaration lives in salt/salt/files/engines.conf
+# (jinja-rendered, also gated on postgres:so_pillar:enabled). It has to live
+# in a single file because salt's master.d/*.conf merge replaces top-level
+# lists rather than concatenating them — splitting `engines:` across multiple
+# .conf files leaves only one loaded.
+#
 # Gated on the same postgres:so_pillar:enabled flag as the schema and
 # ext_pillar config so the three components flip together.

 {% from 'allowed_states.map.jinja' import allowed_states %}
-{% if sls.split('.')[0] in allowed_states %}
+{% if 'salt.master' in allowed_states %}

 {% if salt['pillar.get']('postgres:so_pillar:enabled', False) %}

@@ -27,17 +33,6 @@ pg_notify_pillar_engine_module:
    - watch_in:
      - service: salt_master_service

-pg_notify_pillar_engine_config:
-  file.managed:
-    - name: /etc/salt/master.d/pg_notify_pillar_engine.conf
-    - source: salt://salt/master/files/pg_notify_pillar_engine.conf.jinja
-    - template: jinja
-    - mode: '0640'
-    - user: root
-    - group: salt
-    - watch_in:
-      - service: salt_master_service
-
 pg_notify_pillar_reactor_config:
  file.managed:
    - name: /etc/salt/master.d/so_pillar_reactor.conf
@@ -59,6 +54,10 @@ pg_notify_pillar_engine_module_absent:
      - service: salt_master_service

 pg_notify_pillar_engine_config_absent:
+  # No-op now: the engine config used to live in master.d/pg_notify_pillar_engine.conf
+  # but was folded into engines.conf to work around salt's master.d list-replace
+  # merge. Keep this state alive (no-op test.nop) so any old installs that
+  # still have the file get it cleaned up.
  file.absent:
    - name: /etc/salt/master.d/pg_notify_pillar_engine.conf
    - watch_in:
@@ -24,11 +24,6 @@

 {% do SOCDEFAULTS.soc.config.server.modules.elastic.update({'username': GLOBALS.elasticsearch.auth.users.so_elastic_user.user, 'password': GLOBALS.elasticsearch.auth.users.so_elastic_user.pass}) %}

-{% if GLOBALS.postgres is defined and GLOBALS.postgres.auth is defined %}
-{%   set PG_ADMIN_PASS = salt['pillar.get']('secrets:postgres_pass', '') %}
-{% do SOCDEFAULTS.soc.config.server.modules.update({'postgres': {'hostUrl': GLOBALS.manager_ip, 'port': 5432, 'username': GLOBALS.postgres.auth.users.so_postgres_user.user, 'password': GLOBALS.postgres.auth.users.so_postgres_user.pass, 'adminUser': 'postgres', 'adminPassword': PG_ADMIN_PASS, 'dbname': 'securityonion', 'sslMode': 'require', 'assistantEnabled': true, 'esHostUrl': 'https://' ~ GLOBALS.manager_ip ~ ':9200', 'esUsername': GLOBALS.elasticsearch.auth.users.so_elastic_user.user, 'esPassword': GLOBALS.elasticsearch.auth.users.so_elastic_user.pass, 'esVerifyCert': false}}) %}
-{% endif %}
-
 {% do SOCDEFAULTS.soc.config.server.modules.influxdb.update({'hostUrl': 'https://' ~ GLOBALS.influxdb_host ~ ':8086'}) %}
 {% do SOCDEFAULTS.soc.config.server.modules.influxdb.update({'token': INFLUXDB_TOKEN}) %}
 {% for tool in SOCDEFAULTS.soc.config.server.client.tools %}
@@ -15,7 +15,7 @@ from watchdog.observers import Observer
 from watchdog.events import FileSystemEventHandler

 with open("/opt/so/conf/strelka/filecheck.yaml", "r") as ymlfile:
-    cfg = yaml.load(ymlfile, Loader=yaml.Loader)
+    cfg = yaml.safe_load(ymlfile)

 extract_path = cfg["filecheck"]["extract_path"]
 historypath = cfg["filecheck"]["historypath"]
@@ -1706,6 +1706,24 @@ remove_package() {
 	fi
 }

+ensure_pyyaml() {
+	title "Ensuring python3-pyyaml is installed"
+	if rpm -q python3-pyyaml >/dev/null 2>&1; then
+		info "python3-pyyaml already installed"
+		return 0
+	fi
+	info "python3-pyyaml not found, attempting to install"
+	set -o pipefail
+	dnf -y install python3-pyyaml 2>&1 | tee -a "$setup_log"
+	local result=$?
+	set +o pipefail
+	if [[ $result -ne 0 ]] || ! rpm -q python3-pyyaml >/dev/null 2>&1; then
+		error "Failed to install python3-pyyaml (exit=$result)"
+		fail_setup
+	fi
+	info "python3-pyyaml installed successfully"
+}
+
 # When updating the salt version, also update the version in securityonion-builds/images/iso-task/Dockerfile and salt/salt/master.defaults.yaml and salt/salt/minion.defaults.yaml
 # CAUTION! SALT VERSION UDDATES - READ BELOW
 # When updating the salt version, also update the version in:
@@ -1882,13 +1900,44 @@ secrets_pillar(){
 	if [ -z "$SO_PILLAR_KEY" ]; then
 	  SO_PILLAR_KEY=$(get_random_value 64)
 	fi
-	umask 077
-	printf '%s' "$SO_PILLAR_KEY" > /opt/so/conf/postgres/so_pillar.key
+	# Subshell-scope the umask so it doesn't leak into subsequent so-setup
+	# (and salt-call) file writes. Without the (...) wrapper the umask 077
+	# persists for the rest of the install and every state-rendered config
+	# file under /opt/so/conf lands at 0600 — which breaks containers that
+	# bind-mount their config and run as a non-root user (the influxdb
+	# container, in particular, exits with "permission denied" on
+	# /conf/config.yaml after the gosu drop).
+	(
+	  umask 077
+	  printf '%s' "$SO_PILLAR_KEY" > /opt/so/conf/postgres/so_pillar.key
+	)
 	chmod 0400 /opt/so/conf/postgres/so_pillar.key
 	chown root:root /opt/so/conf/postgres/so_pillar.key
  fi
 }

+# postsalt: flip postgres:so_pillar:enabled to True in the local pillar so
+# the schema_pillar / ext_pillar_postgres / pg_notify_pillar engine states
+# all activate during the install highstate. Without this the entire
+# PG-canonical pillar stack short-circuits on its default-False gate and
+# the install ends in legacy disk-pillar mode — defeating the point of
+# being on postsalt at all. The companion enabled=False rollback just
+# rewrites this file or removes the flag.
+enable_so_pillar_postgres() {
+	local pillar_dir=/opt/so/saltstack/local/pillar/postgres
+	mkdir -p "$pillar_dir"
+	cat > "$pillar_dir/adv_postgres.sls" <<'EOPILLAR'
+# postsalt: enable PG-canonical pillar mode. Generated by setup/so-functions
+# during install. Flip to False here (or delete this file) to roll back to
+# disk-pillar mode without wiping the so_pillar.* schema in so-postgres.
+postgres:
+  so_pillar:
+    enabled: True
+EOPILLAR
+	chown -R socore:socore "$pillar_dir"
+	chmod 0644 "$pillar_dir/adv_postgres.sls"
+}
+
 set_network_dev_status_list() {
 	readarray -t nmcli_dev_status_list <<< "$(nmcli -t -f DEVICE,STATE -c no dev status)"
 	export nmcli_dev_status_list
@@ -66,6 +66,9 @@ set_timezone
 # Let's see what OS we are dealing with here
 detect_os

+# Ensure python3-pyyaml is available before any code that may need so-yaml/PyYAML
+ensure_pyyaml
+

 # Check to see if this is the setup type of "desktop".
 is_desktop=
@@ -792,10 +795,31 @@ if ! [[ -f $install_opt_file ]]; then
 		checkin_at_boot
 		set_initial_firewall_access
        initialize_elasticsearch_indices "so-case so-casehistory so-assistant-session so-assistant-chat"
-		# run a final highstate before enabling scheduled highstates. 
+		# run a final highstate before enabling scheduled highstates.
 		# this will ensure so-elasticsearch-ilm-policy-load and so-elasticsearch-templates-load have a chance to run after elasticfleet is setup
 		info "Running final highstate for setup"
 		logCmd "salt-call state.highstate -l info"
+
+		# postsalt: enable PG-canonical pillar mode now that the install is
+		# fully on disk. We can't flip the flag earlier — ext_pillar overlay
+		# would replace the elasticsearch subtree (and others) with what's
+		# in PG before the importer has run, dropping secrets-allowlisted
+		# subkeys like elasticsearch.auth.users.so_elastic_user.pass that
+		# elasticsearch.enabled.sls needs to render. Order:
+		#   1. drop adv_postgres.sls flipping the flag
+		#   2. refresh_pillar so the next state sees enabled=True
+		#   3. apply postgres.schema_pillar — deploys schema, ALTERs role
+		#      passwords, installs psycopg2 into salt's bundled python,
+		#      runs so-pillar-import, writes /opt/so/conf/so-yaml/mode=postgres
+		#   4. apply salt.master — re-renders engines.conf with the
+		#      pg_notify_pillar engine block, drops master.d ext_pillar
+		#      config, watch_in restarts salt-master, ext_pillar takes over
+		info "Enabling postsalt PG-canonical pillar mode"
+		enable_so_pillar_postgres
+		logCmd "salt-call saltutil.refresh_pillar"
+		logCmd "salt-call state.apply postgres.schema_pillar -l info"
+		logCmd "salt-call state.apply salt.master -l info"
+
 		logCmd "salt-call schedule.enable -linfo --local"
 		verify_setup
 	else
Author	SHA1	Message	Date
Mike Reeves	6bca92da4a	fix: stop pip's patchelf 'ERROR' line from polluting sosetup.log The cmd.run for psycopg2 install was already tolerating pip's non-zero exit with `\|\| true`, but pip's stderr — which contains the literal string "ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'patchelf'" — was still being captured into salt's state-result dict. so-setup logs salt state output to /root/sosetup.log, and verify_setup() then greps for the substring "ERROR" to build /root/errors.log. The patchelf line then shows up at the end of every install as "WARNING: Errors detected during setup" even though the install is in fact green. Redirect pip's combined stdout/stderr to /opt/so/log/so_pillar/psycopg2_install.log so the noise lives in a dedicated, predictable triage location instead of leaking into salt's state result. The `unless: import psycopg2` check is still the actual readiness gate, so a real install failure (rather than just the patchelf RPATH-rewrite step that has no functional effect on the wheel) would still surface via the state being re-run on every apply and `import psycopg2` failing.	2026-05-05 10:38:57 -04:00
Mike Reeves	a7efabd90d	fix: tolerate pip's non-zero exit on psycopg2 patchelf step salt's pip.installed flagged so_pillar_psycopg2_in_salt_python as failed because pip exits non-zero when it can't find the patchelf binary to rewrite the psycopg2 wheel's RPATH after extraction. The wheel is fully installed and importable regardless — the patchelf step is a cosmetic post-install rewrite, not a build dependency. But salt's failure cascade then short-circuited so_pillar_initial_import and the so-yaml mode flip, leaving the install in dual-pillar mode instead of PG-canonical. Replaced with cmd.run that runs pip with `\|\| true` and uses an `import psycopg2` check as the actual readiness gate — same idea as how salt's own bootstrap does it. Also fixed the require: ref on so_pillar_initial_import (was `pip:`, needs to be `cmd:` for the new state type).	2026-05-04 22:08:31 -04:00
Mike Reeves	b25b221076	postsalt: move PG-canonical enable to AFTER the install highstate Supersedes the pre-install placement (right after secrets_pillar) from the previous commit, which was broken: salt's ext_pillar overlay shadowed disk pillar's elasticsearch subtree before so-pillar-import had populated PG, so elasticsearch.enabled.sls failed rendering on ELASTICSEARCHMERGED.auth.users.so_elastic_user.pass — that key lives in elasticsearch/auth.sls, which is on the importer's secrets allowlist and never makes it into so_pillar.pillar_entry. The install would then hang forever waiting for the elasticsearch container that the broken state never deployed. The new placement is right after the final state.highstate completes: 1. drop adv_postgres.sls flipping the flag to True 2. salt-call saltutil.refresh_pillar so the next state sees it 3. salt-call state.apply postgres.schema_pillar — deploys schema, ALTERs role login passwords, installs psycopg2 into salt's bundled python, runs so-pillar-import, writes /opt/so/conf/so-yaml/mode=postgres 4. salt-call state.apply salt.master — re-renders engines.conf with the pg_notify_pillar engine block, drops master.d ext_pillar config, watch_in restarts salt-master and ext_pillar takes over verify_setup runs after this so its final checks see PG-canonical mode in place. Same end state as the previous commit's intent, just without the bootstrap chicken-and-egg.	2026-05-04 21:02:08 -04:00
Mike Reeves	7b9ab2d9d1	postsalt: enable PG-canonical pillar mode by default during so-setup Drops a local pillar override (postgres.so_pillar.enabled = True) right after secrets_pillar so the install-time highstate brings up schema_pillar, ext_pillar_postgres, and the pg_notify_pillar engine without operator intervention. Without this the whole PG-canonical stack stays gated off on the default-False flag and the install lands in legacy disk-pillar mode — which defeats the point of being on the postsalt branch at all. The new enable_so_pillar_postgres() function in so-functions is idempotent (overwrites adv_postgres.sls with a fixed body) and the generated file is mode 0644 socore:socore so it merges into pillar under the existing local-pillar directory ownership convention. Rollback path: edit /opt/so/saltstack/local/pillar/postgres/adv_postgres.sls to set enabled: False, or delete the file. The schema and engine config states will tear themselves down on the next highstate via their existing else-branch absent states.	2026-05-04 19:56:14 -04:00
Mike Reeves	92a7bb3053	fix: get postsalt's PG-canonical pillar actually working end-to-end Five blockers turned up the first time the so_pillar schema was applied against a fresh standalone install. Fixing them in order: 1. 006_rls.sql ordering bug 006 GRANTed on so_pillar.change_queue and its sequence, but the table isn't created until 008_change_notify.sql. 006 errored mid-file with "relation so_pillar.change_queue does not exist", short-circuiting the rest of the pillar staging chain. Moved the three change_queue grants into 008 alongside the table creation so each file is self-contained. 2. so_pillar_* roles unable to log in 006 created the roles as NOLOGIN and set no password. Salt-master's ext_pillar (postgres) and the pg_notify_pillar engine both connect as so_pillar_master via TCP, so both came up with "password authentication failed for user so_pillar_master". Added a templated cmd.run step in schema_pillar.sls (so_pillar_role_login_passwords) that ALTERs all three roles WITH LOGIN PASSWORD pulling from secrets:pillar_master_pass — the same password ext_pillar_postgres.conf.jinja and the engines.conf pg_notify_pillar block render with. 3. Missing GRANT CONNECT ON DATABASE securityonion USAGE on the schema is granted in 006 but CONNECT on the database isn't. Engine + ext_pillar succeeded auth then died with "permission denied for database securityonion". Added the explicit GRANT CONNECT in 006. 4. psycopg2 missing from salt's bundled python /opt/saltstack/salt/bin/python3 doesn't ship psycopg by default, so when salt-master tries to load the pg_notify_pillar engine its `import psycopg2` fails inside salt's loader and the engine silently doesn't start (no error in the salt log — you only notice when nothing ever drains so_pillar.change_queue). Added a pip.installed state in schema_pillar.sls bound to that interpreter via bin_env. 5. engines.conf vs pg_notify_pillar_engine.conf list-replace Salt's master.d/.conf merge replaces top-level lists rather than concatenating them. The engine config used to live in its own master.d/pg_notify_pillar_engine.conf with `engines: [pg_notify_pillar]` alongside the legacy `engines.conf` carrying `engines: [checkmine, pillarWatch]`. Whichever loaded last won, so the engine never showed up in the loaded set even when the file existed. Fold the pg_notify_pillar declaration into engines.conf (now jinja-rendered, gated on postgres:so_pillar:enabled), drop the standalone state from pg_notify_pillar_engine.sls, and delete the now-orphaned conf jinja. End state validated against a live standalone-net install on the dev rig: salt-master ext_pillar reads from so_pillar. with no errors, the pg_notify_pillar engine LISTENs on so_pillar_change and drains the change_queue (134-row backlog → 0 within seconds), and a so-yaml replace on a pillar key flows disk → PG → ext_pillar → salt pillar.get with the new value visible after a saltutil.refresh_pillar.	2026-05-04 19:47:38 -04:00
Mike Reeves	155b5c5d66	fix: consistent allowed_states guard in postgres.schema_pillar Same `sls.split('.')[0]` pattern as ext_pillar_postgres + pg_notify_pillar_engine. For sls='postgres.schema_pillar' the split happened to evaluate 'postgres', which is in manager_states, so the guard worked accidentally — but it would break silently if anyone ever moved the file under a deeper SLS path. Switch to a literal `{% if 'postgres' in allowed_states %}` for the same intent- revealing pattern as the master.d guards.	2026-05-04 19:25:14 -04:00
Mike Reeves	f1746b0f59	fix: correct allowed_states guard in ext_pillar_postgres + pg_notify_pillar_engine Both SLS files used `sls.split('.')[0]` to derive what to look up in allowed_states. For these files (sls='salt.master.ext_pillar_postgres' and sls='salt.master.pg_notify_pillar_engine') that returns 'salt', which is never in any role's allowed_states list — only specific keys like 'salt.master', 'salt.minion', 'salt.cloud' are. The guard's else branch fired on every highstate, emitting two cosmetic ID: <sls>_state_not_allowed Function: test.fail_without_changes Comment: Failure! entries that polluted the so-setup error summary even on green installs. Both states drop config under /etc/salt/master.d/ and watch_in the salt-master service, so the natural intent is "only run when this node hosts the salt master". Switching the guard to a literal {% if 'salt.master' in allowed_states %} expresses that directly without string-parsing the SLS path, and matches the existing membership in manager_states (which is in turn included in every manager-bearing role: so-eval, so-manager, so-managerhype, so-managersearch, so-standalone, so-import).	2026-05-04 19:17:30 -04:00
Mike Reeves	2e411625c4	fix: subshell-scope umask 077 in so_pillar key generation The unscoped `umask 077` on postsalt's secrets_pillar path leaked into every subsequent file write by so-setup (and the salt-call processes it spawned) for the rest of the install. Every state-rendered config file under /opt/so/conf landed at mode 0600 instead of 0644, which broke any container that bind-mounts its config read-only and runs as a non-root user after the entrypoint's gosu drop. The first concrete casualty was the influxdb container, which exits with "failed to load config file: open /conf/config.yaml: permission denied" after init mode completes and re-execs as the influxdb user. The chmod 0400 immediately after the printf already enforces the intended file mode, so the umask was redundant for the key file itself; scoping it to a subshell preserves the defense-in-depth between the printf and the chmod without polluting the parent shell.	2026-05-04 18:02:58 -04:00
Mike Reeves	e43ad2ff74	Merge remote-tracking branch 'origin/feature/ensure-pyyaml' into postsalt	2026-05-04 16:37:42 -04:00
Mike Reeves	b39d259101	Merge remote-tracking branch 'origin/3/dev' into postsalt	2026-05-04 16:19:17 -04:00
Mike Reeves	5bca81d833	Merge pull request #15858 from Security-Onion-Solutions/security-fix Fix unsafe PyYAML load in filecheck	2026-05-04 16:16:40 -04:00
Mike Reeves	b701664e04	Fix unsafe PyYAML load in filecheck	2026-05-04 12:09:35 -04:00
Jorge Reyes	bc64f1431d	Merge pull request #15857 from Security-Onion-Solutions/reyesj2/package-registry-health fleet package registry health check	2026-05-04 11:05:23 -05:00
reyesj2	2203037ce7	fleet package registry health check	2026-05-04 10:52:37 -05:00
Jorge Reyes	77a4ad877e	Merge pull request #15851 from Security-Onion-Solutions/reyesj2/integration-transforms	2026-05-01 14:11:12 -05:00
reyesj2	702b3585cc	excluding additional integration transform job failures	2026-05-01 12:57:59 -05:00
reyesj2	86966d2778	reauthorize unhealthy transform jobs using kibana 9.3.3 auth flow	2026-05-01 12:44:08 -05:00
Jorge Reyes	ce3ad3a895	Merge pull request #15844 from Security-Onion-Solutions/reyesj2/elastic-agent-warning update default elastic agent logging level to warning	2026-04-30 09:46:28 -05:00
Mike Reeves	3a4b7b50de	ensure python3-pyyaml is installed before continuing setup	2026-04-30 10:15:09 -04:00
reyesj2	39d0947102	update default elastic agent logging level to warning	2026-04-29 17:38:40 -05:00
Jorge Reyes	0085d9a353	Merge pull request #15842 from Security-Onion-Solutions/reyesj2-patch-1 so-elastic-fleet-outputs-update now checks for cert drift. Remove run…	2026-04-29 12:37:04 -05:00
Jorge Reyes	2f01ce3b23	so-elastic-fleet-outputs-update now checks for cert drift. Remove running --cert arg on cert change to prevent highstate from running outputs-update 2x	2026-04-29 12:33:28 -05:00
Mike Reeves	71b19c1b5f	Merge pull request #15840 from Security-Onion-Solutions/fix/import-postgres-firewall Open postgres in DOCKER-USER firewall everywhere influxdb is open	2026-04-29 09:20:03 -04:00
Mike Reeves	82e55ae87f	Open postgres on every hostgroup that opens influxdb The static defaults only listed postgres on each role's self-hostgroup, leaving sensor/searchnode/heavynode/receiver/fleet/idh/desktop/hypervisor hostgroups unable to reach the manager's so-postgres in distributed grids. A dynamic block in firewall/map.jinja added postgres to those hostgroups only when telegraf.output was switched to POSTGRES/BOTH, which left postgres unreachable by default. Mirror influxdb statically across manager/managerhype/managersearch/ standalone for every hostgroup that already lists influxdb, and drop the now-redundant telegraf-gated dynamic block from firewall/map.jinja.	2026-04-29 09:09:50 -04:00
Mike Reeves	3e02001544	Open postgres port for import role in DOCKER-USER firewall When so-postgres was wired in (`868cd1187`), the import role's firewall defaults were missed while every other manager-class role (manager, managerhype, managersearch, standalone, eval) had postgres added to their DOCKER-USER manager-hostgroup portgroups. As a result, on a fresh import install the so-postgres container starts but tcp/5432 is dropped at DOCKER-USER, so soc/kratos/telegraf can't reach it. Add postgres alongside the existing influxdb entry so import nodes match the other roles.	2026-04-29 08:48:45 -04:00
Mike Reeves	82f70bb53a	Merge pull request #15839 from Security-Onion-Solutions/fix/drop-postgres-soc-module-injection drop postgres module from soc defaults injection	2026-04-28 15:48:49 -04:00
Mike Reeves	2dcded6cca	drop postgres module from soc defaults injection The soc binary on 3/dev does not register a postgres module, so injecting postgres into soc.config.server.modules makes soc abort at launch with 'Module does not exist: postgres'. The soc-side module is staged on feature/postgres but is not landing this release. Drop the injection until the module ships; salt/postgres state and pillars are unchanged.	2026-04-28 15:46:56 -04:00
Mike Reeves	8ca59e6f0c	Merge pull request #15838 from Security-Onion-Solutions/fix/docker-refresh-multiarch-pull Fix/docker refresh multiarch pull	2026-04-28 15:14:27 -04:00
Mike Reeves	82dac82d15	drop platform/digest pull resolution The digest-pull logic was added to make `docker push` work for multi-arch upstream tags. Now that the push step is `docker buildx imagetools create` pinned to the gpg-verified RepoDigest, the registry-to-registry copy handles single- and multi-arch sources without help. Reverts the pull back to the original line and removes the unused PLATFORM_OS/_ARCH detection.	2026-04-28 14:54:25 -04:00
Mike Reeves	288a823edf	push images via buildx imagetools create Replaces `docker push` with a registry-to-registry copy. On Docker 29.x with the containerd image store, `docker push` of a freshly-pulled image hits a path that wraps single-platform manifests in a synthetic index and then can't push the layers it claims to reference, producing `NotFound: content digest ...` even when the image is fully present. Keep the local `docker tag` so so-image-pull's `docker images \| grep :5000` existence check continues to work.	2026-04-28 14:49:02 -04:00
Jorge Reyes	f9e3d30a71	Merge pull request #15837 from Security-Onion-Solutions/reyesj2/elastic-fleet-cert-check check current fleet policy cert against cert on disk	2026-04-28 13:47:55 -05:00
reyesj2	9cec79b299	check current fleet policy cert against cert on disk Co-authored-by: Copilot <copilot@github.com>	2026-04-28 13:34:39 -05:00
Mike Reeves	c86399327b	fix so-docker-refresh push for multi-arch source images docker pull of a multi-arch tag on Docker 29.x leaves the local tag pointing at the image index rather than the platform-specific manifest. The subsequent docker push then tries to push every sub-manifest the index references and fails on layers we never fetched. Resolve the local-platform manifest digest from the upstream index via docker buildx imagetools inspect, pull by that digest, and re-tag locally to the canonical tag. The signing flow and the existing tag/push to the embedded registry are unchanged.	2026-04-28 14:27:59 -04:00