Commit Graph

275 Commits

Author SHA1 Message Date
Mike Reeves d5dc28e526 Fan postgres telegraf cred for manager on every auth run
The empty-pillar case produced a telegraf.conf with `user= password=`
which libpq misparses ("password=" gets consumed as the user value),
yielding `password authentication failed for user "password="` on
every manager without a prior fan-out (fresh install, not the salt-key
path the reactor handles).

Two fixes:

- salt/postgres/auth.sls: always fan for grains.id in addition to any
  postgres_fanout_minion from the reactor, so the manager's own pillar
  is populated on every postgres.auth run. The existing `unless` guard
  keeps re-runs idempotent.
- salt/telegraf/etc/telegraf.conf: gate the [[outputs.postgresql]]
  block on PG_USER and PG_PASS being non-empty. If a minion hasn't
  received its pillar yet the output block simply isn't rendered — the
  next highstate picks up the creds once the fan-out completes, and in
  the meantime telegraf keeps running the other outputs instead of
  erroring with a malformed connection string.
2026-04-21 14:40:19 -04:00
Mike Reeves bb71e44614 Write per-minion telegraf creds to each minion's own pillar file
pillar/top.sls only distributes postgres.auth to manager-class roles,
so sensors / heavynodes / searchnodes / receivers / fleet / idh /
hypervisor / desktop minions never received the postgres telegraf
password they need to write metrics. Broadcasting the aggregate
postgres.auth pillar to every role would leak the so_postgres admin
password and every other minion's cred.

Fan out per-minion credentials into each minion's own pillar file at
/opt/so/saltstack/local/pillar/minions/<id>.sls. That file is already
distributed by pillar/top.sls exclusively to the matching minion via
`- minions.{{ grains.id }}`, so each minion sees only its own
postgres.telegraf.{user,pass} and nothing else.

- salt/postgres/auth.sls: after writing the manager-scoped aggregate
  pillar, fan the per-minion creds out via so-yaml.py replace for every
  up-minion. Creates the minion pillar file if missing. Requires
  postgres_auth_pillar so the manager pillar lands first.
- salt/telegraf/etc/telegraf.conf: consume postgres:telegraf:user and
  postgres:telegraf:pass directly from the minion's own pillar instead
  of walking postgres:auth:users which isn't visible off the manager.
2026-04-21 09:57:35 -04:00
Mike Reeves 3ecd19d085 Move telegraf_output from global pillar to telegraf pillar
The Telegraf backend selector lived at global.telegraf_output but it is
a Telegraf-scoped setting, not a cross-cutting grid global. Move both
the value and the UI annotation under the telegraf pillar so it shows
up alongside the other Telegraf tuning knobs in the Configuration UI.

- salt/telegraf/defaults.yaml:    add telegraf.output: BOTH
- salt/telegraf/soc_telegraf.yaml: add telegraf.output annotation
- salt/global/defaults.yaml:      remove global.telegraf_output
- salt/global/soc_global.yaml:    remove global.telegraf_output annotation
- salt/vars/globals.map.jinja:    drop telegraf_output from GLOBALS
- salt/firewall/map.jinja:        read via pillar.get('telegraf:output')
- salt/postgres/telegraf_users.sls: read via pillar.get('telegraf:output')
- salt/telegraf/etc/telegraf.conf: read via TELEGRAFMERGED.output
- salt/postgres/tools/sbin/so-stats-show: update user-facing docs

No behavioral change — default stays BOTH.
2026-04-20 16:03:02 -04:00
Mike Reeves 31383bd9d0 Make Telegraf Postgres templates idempotent
Use CREATE TABLE IF NOT EXISTS and a WHERE-guarded create_parent() so
a Telegraf restart can re-run the templates safely after manual DB
surgery. Add an explicit tag_table_create_templates mirroring the
plugin default with IF NOT EXISTS for the same reason.
2026-04-17 15:43:50 -04:00
Mike Reeves f11e9da83a Mark time column NOT NULL before partman.create_parent
pg_partman 5.x requires the control column to be NOT NULL; Telegraf's
generated columns are nullable by default.
2026-04-17 15:27:06 -04:00
Mike Reeves 0fddcd8fe7 Pass unquoted schema.name to partman.create_parent
pg_partman 5.x splits p_parent_table on '.' and looks up the parts as
raw identifiers, so the literal must be 'schema.name' rather than the
double-quoted form quoteLiteral emits for .table.
2026-04-17 15:22:57 -04:00
Mike Reeves af9330a9dd Escape Go-template placeholders from Jinja in telegraf.conf 2026-04-17 15:04:37 -04:00
Mike Reeves b3fbd5c7a4 Use Go-template placeholders and shell-guarded CREATE DATABASE
- Telegraf's outputs.postgresql plugin uses Go text/template syntax,
  not uppercase tokens. The {TABLE}/{COLUMNS}/{TABLELITERAL} strings
  were passed through to Postgres literally, producing syntax errors
  on every metric's first write. Switch to {{ .table }}, {{ .columns }},
  and {{ .table|quoteLiteral }} so partitioned parents and the partman
  create_parent() call succeed.
- Replace the \gexec "CREATE DATABASE ... WHERE NOT EXISTS" idiom in
  both init-users.sh and telegraf_users.sls with an explicit shell
  conditional. The prior idiom occasionally fired CREATE DATABASE even
  when so_telegraf already existed, producing duplicate-key failures.
2026-04-17 14:55:13 -04:00
Mike Reeves 5228668be0 Fix Telegraf→Postgres table creation and state.apply race
- Telegraf's partman template passed p_type:='native', which pg_partman
  5.x (the version shipped by postgresql-17-partman on Debian) rejects.
  Switched to 'range' so partman.create_parent() actually creates
  partitions and Telegraf's INSERTs succeed.
- Added a postgres_wait_ready gate in telegraf_users.sls so psql execs
  don't race the init-time restart that docker-entrypoint.sh performs.
- so-verify now ignores the literal "-v ON_ERROR_STOP=1" token in the
  setup log. Dropped the matching entry from so-log-check, which scans
  container stdout where that token never appears.
2026-04-17 13:00:12 -04:00
Mike Reeves d9a9029ce5 Adopt pg_partman + pg_cron for Telegraf metric tables
Every telegraf.* metric table is now a daily time-range partitioned
parent managed by pg_partman. Retention drops old partitions instead
of the row-by-row DELETE that so-telegraf-trim used to run nightly,
and dashboards will benefit from partition pruning at query time.

- Load pg_cron at server start via shared_preload_libraries and point
  cron.database_name at so_telegraf so job metadata lives alongside
  the metrics
- Telegraf create_templates override makes every new metric table a
  PARTITION BY RANGE (time) parent registered with partman.create_parent
  in one transaction (1 day interval, 3 premade)
- postgres_telegraf_group_role now also creates pg_partman and pg_cron
  extensions and schedules hourly partman.run_maintenance_proc
- New retention reconcile state updates partman.part_config.retention
  from postgres.telegraf.retention_days on every apply
- so_telegraf_trim cron is now unconditionally absent; script stays on
  disk as a manual fallback
2026-04-16 17:27:15 -04:00
Mike Reeves 9fe53d9ccc Use JSONB for Telegraf fields/tags to avoid 1600-column limit
High-cardinality inputs (docker, procstat, kafka) trigger ALTER TABLE
ADD COLUMN on every new field name, and with all minions writing into
a shared 'telegraf' schema the metric tables hit Postgres's 1600-column
per-table ceiling quickly. Setting fields_as_jsonb and tags_as_jsonb on
the postgresql output keeps metric tables fixed at (time, tag_id,
fields jsonb) and tag tables at (tag_id, tags jsonb).

- so-stats-show rewritten to use JSONB accessors
  ((fields->>'x')::numeric, tags->>'host', etc.) and cast memory/disk
  sizes to bigint so pg_size_pretty works
- Drop regex/regexFailureMessage from telegraf_output SOC UI entry to
  match the convention upstream used when removing them from
  mdengine/pcapengine/pipeline; options: list drives validation
2026-04-16 17:02:21 -04:00
Mike Reeves 470b3bd4da Comingle Telegraf metrics into shared schema
Per-minion schemas cause table count to explode (N minions * M metrics)
and the per-minion revocation story isn't worth it when retention is
short. Move all minions to a shared 'telegraf' schema while keeping
per-minion login credentials for audit.

- New so_telegraf NOLOGIN group role owns the telegraf schema; each
  per-minion role is a member and inherits insert/select via role
  inheritance
- Telegraf connection string uses options='-c role=so_telegraf' so
  tables auto-created on first write belong to the group role
- so-telegraf-trim walks the flat telegraf.* table set instead of
  per-minion schemas
- so-stats-show filters by host tag; CLI arg is now the hostname as
  tagged by Telegraf rather than a sanitized schema suffix
- Also renames so-show-stats -> so-stats-show
2026-04-16 15:40:54 -04:00
Mike Reeves cefbe01333 Add telegraf_output selector for InfluxDB/Postgres dual-write
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.

- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
  minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
  postgres.telegraf.retention_days (default 14)
2026-04-15 14:32:10 -04:00
Josh Patterson 2186872317 update telegraf lower true/false 2026-03-20 09:19:22 -04:00
Josh Patterson 7ece93d7e0 ensure bool sliders telegraf 2026-03-19 15:12:47 -04:00
Josh Patterson c2c5aea244 ensure bool sliders for each state:enabled annotation 2026-03-19 12:35:38 -04:00
Josh Patterson 74ad2990a7 Merge remote-tracking branch 'origin/3/dev' into delta 2026-03-18 13:05:02 -04:00
Josh Patterson e19e83bebb allow user defined ulimits 2026-03-18 10:38:15 -04:00
Doug Burks 930985b770 update helpLink references for new documentation 2026-03-18 09:46:45 -04:00
Josh Patterson 341471d38e DOCKER to DOCKERMERGED 2026-03-17 16:19:36 -04:00
Josh Patterson 00986dc2fd Merge remote-tracking branch 'origin/delta' into customulimit 2026-03-17 16:04:09 -04:00
Mike Reeves 2d97dfc8a1 Add customizable ulimit settings for all Docker containers
Add ulimits as a configurable advanced setting for every container,
allowing customization through the web UI. Move hardcoded ulimits
from elasticsearch and zeek into defaults.yaml and fix elasticsearch
ulimits that were incorrectly nested under the environment key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 15:10:42 -04:00
Josh Patterson 4dc377c99f DOCKER to DOCKERMERGED 2026-03-17 15:06:06 -04:00
Josh Patterson 94f454c311 cleanup file.absent 2026-03-16 15:57:15 -04:00
Jason Ertel 71839bc87f remove steno 2026-03-06 15:45:36 -05:00
Jason Ertel 2c4d833a5b update 2.4 references to 3 2026-03-05 11:05:19 -05:00
reyesj2 12b3081a62 fix agentstatus script 2026-02-25 16:39:33 -06:00
reyesj2 a99c553ada use logstash merged values for logstash metric collection 2026-01-30 11:40:12 -06:00
reyesj2 e5226b50ed disable logstash metrics collection on nodes not running logstash + fleet nodes 2026-01-27 16:37:23 -06:00
Josh Patterson 00fbc1c259 add back individual signing policies 2026-01-12 09:25:15 -05:00
Josh Patterson ee70d94e15 remove old key/crt used for telegraf on non managers 2026-01-08 17:15:35 -05:00
Josh Patterson 9960db200c Merge remote-tracking branch 'origin/2.4/dev' into bravo 2025-12-11 17:30:43 -05:00
Josh Patterson b9ff1704b0 the great ssl refactor 2025-12-11 17:30:06 -05:00
DefensiveDepth 5ab6bda639 Fixup logic 2025-12-10 17:16:35 -05:00
DefensiveDepth 9304513ce8 Add support for suricata rules load status 2025-12-04 12:26:13 -05:00
reyesj2 835b2609b6 telegraf - increase esindexsize.sh script timeout 2025-10-29 13:45:55 -05:00
Josh Patterson b0a8191f59 Merge remote-tracking branch 'origin/2.4/dev' into vlb2 2025-05-19 10:02:26 -04:00
reyesj2 870a9ff80c dedup 2025-05-16 10:24:09 -05:00
reyesj2 689db57f5f logstash isn't running on receivers or manager when kafka is the global.pipeline 2025-05-16 10:05:38 -05:00
Josh Patterson 8c37a4454c merge and fix conflicts 2025-05-06 11:55:42 -04:00
reyesj2 b4214f73f4 typo 2025-05-06 09:01:22 -05:00
reyesj2 b9da7eb35b missing globals.is_manager swap 2025-05-06 08:58:47 -05:00
reyesj2 fd02950864 use globals.is_manager 2025-05-02 13:36:28 -05:00
reyesj2 044d230158 get 200 from es before collecting metrics 2025-04-30 13:05:36 -05:00
reyesj2 b918a5e256 old attempt 2025-04-29 16:05:55 -05:00
reyesj2 1ddc653a52 fix input error in agentstatus script 2025-04-29 13:40:39 -05:00
reyesj2 85f5f75c84 use salt location for es curl.config 2025-04-29 12:42:05 -05:00
reyesj2 3cb3281cd5 add metrics for es index sizes 2025-04-29 12:38:41 -05:00
Josh Patterson 142609ea67 Merge remote-tracking branch 'origin/2.4/dev' into vlb2 2025-04-24 09:41:27 -04:00
reyesj2 400739736d add monitored mounts, ignores docker overlays 2025-04-23 15:02:23 -05:00