106 Commits

Author SHA1 Message Date
Mike Reeves
614f32c5e0 Split postgres auth from per-minion telegraf creds
The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.

Split responsibilities:

- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
  admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
  pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
  flock, atomic YAML write, PyYAML safe_dump so passwords never
  round-trip through so-yaml.py's type coercion. Idempotent add, quiet
  remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
  instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
  roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
  refreshes the new minion's pillar before the new node highstates,
  so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
  keys (idempotent) and runs postgres.telegraf_users once to reconcile
  the DB.
2026-04-22 10:55:15 -04:00
Mike Reeves
d5dc28e526 Fan postgres telegraf cred for manager on every auth run
The empty-pillar case produced a telegraf.conf with `user= password=`
which libpq misparses ("password=" gets consumed as the user value),
yielding `password authentication failed for user "password="` on
every manager without a prior fan-out (fresh install, not the salt-key
path the reactor handles).

Two fixes:

- salt/postgres/auth.sls: always fan for grains.id in addition to any
  postgres_fanout_minion from the reactor, so the manager's own pillar
  is populated on every postgres.auth run. The existing `unless` guard
  keeps re-runs idempotent.
- salt/telegraf/etc/telegraf.conf: gate the [[outputs.postgresql]]
  block on PG_USER and PG_PASS being non-empty. If a minion hasn't
  received its pillar yet the output block simply isn't rendered — the
  next highstate picks up the creds once the fan-out completes, and in
  the meantime telegraf keeps running the other outputs instead of
  erroring with a malformed connection string.
2026-04-21 14:40:19 -04:00
Mike Reeves
bb71e44614 Write per-minion telegraf creds to each minion's own pillar file
pillar/top.sls only distributes postgres.auth to manager-class roles,
so sensors / heavynodes / searchnodes / receivers / fleet / idh /
hypervisor / desktop minions never received the postgres telegraf
password they need to write metrics. Broadcasting the aggregate
postgres.auth pillar to every role would leak the so_postgres admin
password and every other minion's cred.

Fan out per-minion credentials into each minion's own pillar file at
/opt/so/saltstack/local/pillar/minions/<id>.sls. That file is already
distributed by pillar/top.sls exclusively to the matching minion via
`- minions.{{ grains.id }}`, so each minion sees only its own
postgres.telegraf.{user,pass} and nothing else.

- salt/postgres/auth.sls: after writing the manager-scoped aggregate
  pillar, fan the per-minion creds out via so-yaml.py replace for every
  up-minion. Creates the minion pillar file if missing. Requires
  postgres_auth_pillar so the manager pillar lands first.
- salt/telegraf/etc/telegraf.conf: consume postgres:telegraf:user and
  postgres:telegraf:pass directly from the minion's own pillar instead
  of walking postgres:auth:users which isn't visible off the manager.
2026-04-21 09:57:35 -04:00
Mike Reeves
3ecd19d085 Move telegraf_output from global pillar to telegraf pillar
The Telegraf backend selector lived at global.telegraf_output but it is
a Telegraf-scoped setting, not a cross-cutting grid global. Move both
the value and the UI annotation under the telegraf pillar so it shows
up alongside the other Telegraf tuning knobs in the Configuration UI.

- salt/telegraf/defaults.yaml:    add telegraf.output: BOTH
- salt/telegraf/soc_telegraf.yaml: add telegraf.output annotation
- salt/global/defaults.yaml:      remove global.telegraf_output
- salt/global/soc_global.yaml:    remove global.telegraf_output annotation
- salt/vars/globals.map.jinja:    drop telegraf_output from GLOBALS
- salt/firewall/map.jinja:        read via pillar.get('telegraf:output')
- salt/postgres/telegraf_users.sls: read via pillar.get('telegraf:output')
- salt/telegraf/etc/telegraf.conf: read via TELEGRAFMERGED.output
- salt/postgres/tools/sbin/so-stats-show: update user-facing docs

No behavioral change — default stays BOTH.
2026-04-20 16:03:02 -04:00
Mike Reeves
31383bd9d0 Make Telegraf Postgres templates idempotent
Use CREATE TABLE IF NOT EXISTS and a WHERE-guarded create_parent() so
a Telegraf restart can re-run the templates safely after manual DB
surgery. Add an explicit tag_table_create_templates mirroring the
plugin default with IF NOT EXISTS for the same reason.
2026-04-17 15:43:50 -04:00
Mike Reeves
f11e9da83a Mark time column NOT NULL before partman.create_parent
pg_partman 5.x requires the control column to be NOT NULL; Telegraf's
generated columns are nullable by default.
2026-04-17 15:27:06 -04:00
Mike Reeves
0fddcd8fe7 Pass unquoted schema.name to partman.create_parent
pg_partman 5.x splits p_parent_table on '.' and looks up the parts as
raw identifiers, so the literal must be 'schema.name' rather than the
double-quoted form quoteLiteral emits for .table.
2026-04-17 15:22:57 -04:00
Mike Reeves
af9330a9dd Escape Go-template placeholders from Jinja in telegraf.conf 2026-04-17 15:04:37 -04:00
Mike Reeves
b3fbd5c7a4 Use Go-template placeholders and shell-guarded CREATE DATABASE
- Telegraf's outputs.postgresql plugin uses Go text/template syntax,
  not uppercase tokens. The {TABLE}/{COLUMNS}/{TABLELITERAL} strings
  were passed through to Postgres literally, producing syntax errors
  on every metric's first write. Switch to {{ .table }}, {{ .columns }},
  and {{ .table|quoteLiteral }} so partitioned parents and the partman
  create_parent() call succeed.
- Replace the \gexec "CREATE DATABASE ... WHERE NOT EXISTS" idiom in
  both init-users.sh and telegraf_users.sls with an explicit shell
  conditional. The prior idiom occasionally fired CREATE DATABASE even
  when so_telegraf already existed, producing duplicate-key failures.
2026-04-17 14:55:13 -04:00
Mike Reeves
5228668be0 Fix Telegraf→Postgres table creation and state.apply race
- Telegraf's partman template passed p_type:='native', which pg_partman
  5.x (the version shipped by postgresql-17-partman on Debian) rejects.
  Switched to 'range' so partman.create_parent() actually creates
  partitions and Telegraf's INSERTs succeed.
- Added a postgres_wait_ready gate in telegraf_users.sls so psql execs
  don't race the init-time restart that docker-entrypoint.sh performs.
- so-verify now ignores the literal "-v ON_ERROR_STOP=1" token in the
  setup log. Dropped the matching entry from so-log-check, which scans
  container stdout where that token never appears.
2026-04-17 13:00:12 -04:00
Mike Reeves
d9a9029ce5 Adopt pg_partman + pg_cron for Telegraf metric tables
Every telegraf.* metric table is now a daily time-range partitioned
parent managed by pg_partman. Retention drops old partitions instead
of the row-by-row DELETE that so-telegraf-trim used to run nightly,
and dashboards will benefit from partition pruning at query time.

- Load pg_cron at server start via shared_preload_libraries and point
  cron.database_name at so_telegraf so job metadata lives alongside
  the metrics
- Telegraf create_templates override makes every new metric table a
  PARTITION BY RANGE (time) parent registered with partman.create_parent
  in one transaction (1 day interval, 3 premade)
- postgres_telegraf_group_role now also creates pg_partman and pg_cron
  extensions and schedules hourly partman.run_maintenance_proc
- New retention reconcile state updates partman.part_config.retention
  from postgres.telegraf.retention_days on every apply
- so_telegraf_trim cron is now unconditionally absent; script stays on
  disk as a manual fallback
2026-04-16 17:27:15 -04:00
Mike Reeves
9fe53d9ccc Use JSONB for Telegraf fields/tags to avoid 1600-column limit
High-cardinality inputs (docker, procstat, kafka) trigger ALTER TABLE
ADD COLUMN on every new field name, and with all minions writing into
a shared 'telegraf' schema the metric tables hit Postgres's 1600-column
per-table ceiling quickly. Setting fields_as_jsonb and tags_as_jsonb on
the postgresql output keeps metric tables fixed at (time, tag_id,
fields jsonb) and tag tables at (tag_id, tags jsonb).

- so-stats-show rewritten to use JSONB accessors
  ((fields->>'x')::numeric, tags->>'host', etc.) and cast memory/disk
  sizes to bigint so pg_size_pretty works
- Drop regex/regexFailureMessage from telegraf_output SOC UI entry to
  match the convention upstream used when removing them from
  mdengine/pcapengine/pipeline; options: list drives validation
2026-04-16 17:02:21 -04:00
Mike Reeves
470b3bd4da Comingle Telegraf metrics into shared schema
Per-minion schemas cause table count to explode (N minions * M metrics)
and the per-minion revocation story isn't worth it when retention is
short. Move all minions to a shared 'telegraf' schema while keeping
per-minion login credentials for audit.

- New so_telegraf NOLOGIN group role owns the telegraf schema; each
  per-minion role is a member and inherits insert/select via role
  inheritance
- Telegraf connection string uses options='-c role=so_telegraf' so
  tables auto-created on first write belong to the group role
- so-telegraf-trim walks the flat telegraf.* table set instead of
  per-minion schemas
- so-stats-show filters by host tag; CLI arg is now the hostname as
  tagged by Telegraf rather than a sanitized schema suffix
- Also renames so-show-stats -> so-stats-show
2026-04-16 15:40:54 -04:00
Mike Reeves
cefbe01333 Add telegraf_output selector for InfluxDB/Postgres dual-write
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.

- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
  minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
  postgres.telegraf.retention_days (default 14)
2026-04-15 14:32:10 -04:00
Josh Patterson
2186872317 update telegraf lower true/false 2026-03-20 09:19:22 -04:00
Josh Patterson
7ece93d7e0 ensure bool sliders telegraf 2026-03-19 15:12:47 -04:00
reyesj2
a99c553ada use logstash merged values for logstash metric collection 2026-01-30 11:40:12 -06:00
reyesj2
e5226b50ed disable logstash metrics collection on nodes not running logstash + fleet nodes 2026-01-27 16:37:23 -06:00
reyesj2
835b2609b6 telegraf - increase esindexsize.sh script timeout 2025-10-29 13:45:55 -05:00
reyesj2
870a9ff80c dedup 2025-05-16 10:24:09 -05:00
reyesj2
689db57f5f logstash isn't running on receivers or manager when kafka is the global.pipeline 2025-05-16 10:05:38 -05:00
reyesj2
fd02950864 use globals.is_manager 2025-05-02 13:36:28 -05:00
reyesj2
b918a5e256 old attempt 2025-04-29 16:05:55 -05:00
reyesj2
3cb3281cd5 add metrics for es index sizes 2025-04-29 12:38:41 -05:00
reyesj2
400739736d add monitored mounts, ignores docker overlays 2025-04-23 15:02:23 -05:00
reyesj2
80b1d51f76 wrong location for global.pipeline check
Signed-off-by: reyesj2 <94730068+reyesj2@users.noreply.github.com>
2024-06-13 08:50:53 -04:00
reyesj2
9c31622598 telegraft should only include jolokia config when Kafka is set as the global.pipeline
Signed-off-by: reyesj2 <94730068+reyesj2@users.noreply.github.com>
2024-06-12 15:42:00 -04:00
reyesj2
59097070ef Revert "Remove unneeded jolokia aggregate metrics to reduce data ingested to influx"
This reverts commit 1c1a1a1d3f.
2024-05-28 12:17:43 -04:00
reyesj2
1c1a1a1d3f Remove unneeded jolokia aggregate metrics to reduce data ingested to influx
Signed-off-by: reyesj2 <94730068+reyesj2@users.noreply.github.com>
2024-05-28 11:14:19 -04:00
reyesj2
15a0b959aa Add jolokia metrics for influxdb dashboard
Signed-off-by: reyesj2 <94730068+reyesj2@users.noreply.github.com>
2024-05-28 10:51:39 -04:00
reyesj2
dff609d829 Add basic read-only metric collection from Kafka
Signed-off-by: reyesj2 <94730068+reyesj2@users.noreply.github.com>
2024-05-08 16:13:09 -04:00
Jason Ertel
25c39540c8 fix import stats 2023-12-11 14:48:46 -05:00
m0duspwnens
5278601e5d manage telegraf scripts with a defaults file assigned per node type 2023-08-07 11:18:35 -04:00
Jason Ertel
46371aaaf5 Monitor all mount points for simplicity 2023-06-09 09:14:36 -04:00
m0duspwnens
20f706f165 enable/disable telegraf in ui 2023-05-11 12:12:25 -04:00
m0duspwnens
b6d55bedc8 make influxdb token accessible to all nodes 2023-03-06 13:50:17 -05:00
Jason Ertel
0eec8b22a2 influx upgrade 2023-02-09 18:27:14 -05:00
Jason Ertel
0e50d36da6 upgrade influx 2023-02-09 16:18:04 -05:00
m0duspwnens
88107fe0df remove filebeat and redis(commented out) from telegraf config 2023-01-24 08:59:51 -05:00
Mike Reeves
308228620a Specify Influxdb host 2022-12-22 13:05:33 -05:00
Mike Reeves
676aec7576 Add config map 2022-12-16 11:22:53 -05:00
Mike Reeves
5badfb9cf5 Fix pillar 2022-12-16 08:38:31 -05:00
Mike Reeves
8a0991afd0 Fix pillar 2022-12-15 15:05:57 -05:00
Mike Reeves
121d07733f Merge the defaults and pillar for telegraf 2022-12-15 13:29:31 -05:00
Mike Reeves
e55086230d Merge the defaults and pillar for telegraf 2022-12-15 13:28:29 -05:00
Mike Reeves
d37a4b14ca Spelling error 2022-12-15 12:02:01 -05:00
Mike Reeves
fd27044471 Spelling error 2022-12-15 11:57:06 -05:00
Mike Reeves
ed87b08fc1 Spelling error 2022-12-15 10:59:07 -05:00
Mike Reeves
28e8c54443 Wire telegraf initial commit 2022-12-15 10:43:58 -05:00
m0duspwnens
b526532ab6 use global vars in states 2022-10-11 11:57:15 -04:00