Commit Graph

37 Commits

Author SHA1 Message Date
Mike Reeves a7efabd90d fix: tolerate pip's non-zero exit on psycopg2 patchelf step
salt's pip.installed flagged so_pillar_psycopg2_in_salt_python as
failed because pip exits non-zero when it can't find the patchelf
binary to rewrite the psycopg2 wheel's RPATH after extraction. The
wheel is fully installed and importable regardless — the patchelf
step is a cosmetic post-install rewrite, not a build dependency. But
salt's failure cascade then short-circuited so_pillar_initial_import
and the so-yaml mode flip, leaving the install in dual-pillar mode
instead of PG-canonical.

Replaced with cmd.run that runs pip with `|| true` and uses an
`import psycopg2` check as the actual readiness gate — same idea as
how salt's own bootstrap does it. Also fixed the require: ref on
so_pillar_initial_import (was `pip:`, needs to be `cmd:` for the new
state type).
2026-05-04 22:08:31 -04:00
Mike Reeves 92a7bb3053 fix: get postsalt's PG-canonical pillar actually working end-to-end
Five blockers turned up the first time the so_pillar schema was applied
against a fresh standalone install. Fixing them in order:

1. 006_rls.sql ordering bug
   006 GRANTed on so_pillar.change_queue and its sequence, but the table
   isn't created until 008_change_notify.sql. 006 errored mid-file with
   "relation so_pillar.change_queue does not exist", short-circuiting the
   rest of the pillar staging chain. Moved the three change_queue grants
   into 008 alongside the table creation so each file is self-contained.

2. so_pillar_* roles unable to log in
   006 created the roles as NOLOGIN and set no password. Salt-master's
   ext_pillar (postgres) and the pg_notify_pillar engine both connect as
   so_pillar_master via TCP, so both came up with "password authentication
   failed for user so_pillar_master". Added a templated cmd.run step in
   schema_pillar.sls (so_pillar_role_login_passwords) that ALTERs all three
   roles WITH LOGIN PASSWORD pulling from secrets:pillar_master_pass — the
   same password ext_pillar_postgres.conf.jinja and the engines.conf
   pg_notify_pillar block render with.

3. Missing GRANT CONNECT ON DATABASE securityonion
   USAGE on the schema is granted in 006 but CONNECT on the database isn't.
   Engine + ext_pillar succeeded auth then died with "permission denied
   for database securityonion". Added the explicit GRANT CONNECT in 006.

4. psycopg2 missing from salt's bundled python
   /opt/saltstack/salt/bin/python3 doesn't ship psycopg by default, so
   when salt-master tries to load the pg_notify_pillar engine its
   `import psycopg2` fails inside salt's loader and the engine silently
   doesn't start (no error in the salt log — you only notice when nothing
   ever drains so_pillar.change_queue). Added a pip.installed state in
   schema_pillar.sls bound to that interpreter via bin_env.

5. engines.conf vs pg_notify_pillar_engine.conf list-replace
   Salt's master.d/*.conf merge replaces top-level lists rather than
   concatenating them. The engine config used to live in its own
   master.d/pg_notify_pillar_engine.conf with `engines: [pg_notify_pillar]`
   alongside the legacy `engines.conf` carrying `engines: [checkmine,
   pillarWatch]`. Whichever loaded last won, so the engine never showed
   up in the loaded set even when the file existed. Fold the
   pg_notify_pillar declaration into engines.conf (now jinja-rendered,
   gated on postgres:so_pillar:enabled), drop the standalone state from
   pg_notify_pillar_engine.sls, and delete the now-orphaned conf jinja.

End state validated against a live standalone-net install on the dev rig:
salt-master ext_pillar reads from so_pillar.* with no errors, the
pg_notify_pillar engine LISTENs on so_pillar_change and drains the
change_queue (134-row backlog → 0 within seconds), and a so-yaml replace
on a pillar key flows disk → PG → ext_pillar → salt pillar.get with the
new value visible after a saltutil.refresh_pillar.
2026-05-04 19:47:38 -04:00
Mike Reeves 155b5c5d66 fix: consistent allowed_states guard in postgres.schema_pillar
Same `sls.split('.')[0]` pattern as ext_pillar_postgres + pg_notify_pillar_engine.
For sls='postgres.schema_pillar' the split happened to evaluate 'postgres',
which is in manager_states, so the guard worked accidentally — but it would
break silently if anyone ever moved the file under a deeper SLS path. Switch
to a literal `{% if 'postgres' in allowed_states %}` for the same intent-
revealing pattern as the master.d guards.
2026-05-04 19:25:14 -04:00
Mike Reeves 3d11694d51 make so-yaml PG-canonical and add pillar-change reactor stack
Two coupled changes that together let so_pillar.* be the canonical
config store, with config edits driving service reloads automatically:

so-yaml PG-canonical mode
- Adds /opt/so/conf/so-yaml/mode (and SO_YAML_BACKEND env override) with
  three values: dual (legacy), postgres (PG-only for managed paths),
  disk (emergency rollback). Bootstrap files (secrets.sls, ca/init.sls,
  *.nodes.sls, top.sls, ...) stay disk-only regardless via the existing
  SkipPath allowlist in so_yaml_postgres.locate.
- loadYaml/writeYaml/purgeFile now route to so_pillar.* in postgres
  mode: replace/add/get all read+write the database with no disk file
  ever appearing. PG failure is fatal in postgres mode (no silent
  fallback); dual mode preserves the prior best-effort mirror.
- so_yaml_postgres gains read_yaml(path), is_pg_managed(path), and
  is_enabled() so so-yaml can answer "is this path PG-managed and is
  PG up" without reaching into private helpers.
- schema_pillar.sls writes /opt/so/conf/so-yaml/mode = postgres after
  the importer succeeds, so flipping postgres:so_pillar:enabled flips
  so-yaml's behavior in lockstep with the schema being live.

pg_notify-driven change fan-out
- 008_change_notify.sql adds so_pillar.change_queue + an AFTER trigger
  on pillar_entry that enqueues the locator and pg_notifies
  'so_pillar_change'. Queue is drained at-least-once so engine restarts
  don't lose events; pg_notify is just the wakeup signal.
- New salt-master engine pg_notify_pillar.py LISTENs on the channel,
  drains the queue with FOR UPDATE SKIP LOCKED, debounces bursts, and
  fires 'so/pillar/changed' events grouped by (scope, role, minion).
- Reactor so_pillar_changed.sls catches the tag and dispatches to
  orch.so_pillar_reload, which carries a DISPATCH map of pillar-path
  prefix -> (state sls, role grain set) so adding a new service to
  the auto-reload list is a one-line edit instead of a new reactor.
- Engine + reactor wiring is gated on the same postgres:so_pillar:enabled
  flag as the schema and ext_pillar config so the whole stack flips
  on/off together.

Tests: 21 new cases (112 total, all passing) covering mode resolution,
PG-managed detection, and PG-canonical read/write/purge routing with
the PG client stubbed.
2026-05-01 09:31:48 -04:00
Mike Reeves 3fad895d6a add so_pillar schema + ext_pillar wiring (postsalt foundation)
Lays the database-backed pillar foundation for the postsalt branch. Salt
continues to read on-disk SLS first; the new ext_pillar config overlays
values from the so_pillar.* schema in so-postgres.

- salt/postgres/files/schema/pillar/00{1..7}_*.sql: idempotent DDL for
  scope/role/role_member/minion/pillar_entry/pillar_entry_history/
  drift_log, secret pgcrypto helpers, RLS, pg_cron retention.
- salt/postgres/schema_pillar.sls: applies the SQL files inside the
  so-postgres container after it's healthy, configures the master_key
  GUC, and runs so-pillar-import once. Gated on
  postgres:so_pillar:enabled feature flag (default false).
- salt/salt/master/ext_pillar_postgres.{sls,conf.jinja}: drops
  /etc/salt/master.d/ext_pillar_postgres.conf with list-form ext_pillar
  queries (global/role/minion/secrets) and ext_pillar_first: False so
  bootstrap pillars on disk render before the PG overlay.
- salt/postgres/init.sls + salt/salt/master.sls: include the new states.

Both new state branches are guarded so a default install with the flag
off is a no-op.
2026-04-30 16:30:57 -04:00
Mike Reeves 614f32c5e0 Split postgres auth from per-minion telegraf creds
The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.

Split responsibilities:

- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
  admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
  pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
  flock, atomic YAML write, PyYAML safe_dump so passwords never
  round-trip through so-yaml.py's type coercion. Idempotent add, quiet
  remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
  instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
  roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
  refreshes the new minion's pillar before the new node highstates,
  so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
  keys (idempotent) and runs postgres.telegraf_users once to reconcile
  the DB.
2026-04-22 10:55:15 -04:00
Mike Reeves 5f28e9b191 Move per-minion telegraf cred provisioning into so-minion
Simpler, race-free replacement for the reactor + orch + fan-out chain.

- salt/manager/tools/sbin/so-minion: expand add_telegraf_to_minion to
  generate a random 72-char password, reuse any existing password from
  the aggregate pillar, write postgres.telegraf.{user,pass} into the
  minion's own pillar file, and update the aggregate pillar so
  postgres.telegraf_users can CREATE ROLE on the next manager apply.
  Every create<ROLE> function already calls this hook, so add / addVM /
  setup dispatches are all covered identically and synchronously.
- salt/postgres/auth.sls: strip the fanout_targets loop and the
  postgres_telegraf_minion_pillar_<safe> cmd.run block — it's now
  redundant. The state still manages the so_postgres admin user and
  writes the aggregate pillar for postgres.telegraf_users to consume.
- salt/reactor/telegraf_user_sync.sls: deleted.
- salt/orch/telegraf_postgres_sync.sls: deleted.
- salt/salt/master.sls: drop the reactor_config_telegraf block that
  registered the reactor on /etc/salt/master.d/reactor_telegraf.conf.
- salt/orch/deploy_newnode.sls: drop the manager_fanout_postgres_telegraf
  step and the require: it added to the newnode highstate. Back to its
  original 3/dev shape.

No more ephemeral postgres_fanout_minion pillar, no more async salt/key
reactor, no more so-minion setupMinionFiles race: the pillar write
happens inline inside setupMinionFiles itself.
2026-04-21 15:34:15 -04:00
Mike Reeves 1abfd77351 Hide telegraf password from console and close so-minion race
Two fixes on the postgres telegraf fan-out path:

1. postgres.auth cmd.run leaked the password to the console because
   Salt always prints the Name: field and `show_changes: False` does
   not apply to cmd.run. Move the user and password into the `env:`
   attribute so the shell body still sees them via $PG_USER / $PG_PASS
   but Salt's state reporter never renders them.

2. so-minion's addMinion -> setupMinionFiles sequence removes the
   minion pillar file and rewrites it from scratch, which wipes the
   postgres.telegraf.* entries the reactor may have already written on
   salt-key accept. Add a postgres.auth fan-out step to
   orch.deploy_newnode (the orch so-minion kicks off after
   setupMinionFiles) and require it from the new minion's highstate.
   Idempotent via the existing unless: guard in postgres.auth.
2026-04-21 15:10:57 -04:00
Mike Reeves d5dc28e526 Fan postgres telegraf cred for manager on every auth run
The empty-pillar case produced a telegraf.conf with `user= password=`
which libpq misparses ("password=" gets consumed as the user value),
yielding `password authentication failed for user "password="` on
every manager without a prior fan-out (fresh install, not the salt-key
path the reactor handles).

Two fixes:

- salt/postgres/auth.sls: always fan for grains.id in addition to any
  postgres_fanout_minion from the reactor, so the manager's own pillar
  is populated on every postgres.auth run. The existing `unless` guard
  keeps re-runs idempotent.
- salt/telegraf/etc/telegraf.conf: gate the [[outputs.postgresql]]
  block on PG_USER and PG_PASS being non-empty. If a minion hasn't
  received its pillar yet the output block simply isn't rendered — the
  next highstate picks up the creds once the fan-out completes, and in
  the meantime telegraf keeps running the other outputs instead of
  erroring with a malformed connection string.
2026-04-21 14:40:19 -04:00
Mike Reeves 05f6503d61 Gate postgres telegraf fan-out on reactor-provided minion id
postgres.auth was running an `unless` shell check per up-minion on every
manager highstate, even when nothing had changed — N fork+python starts
of so-yaml.py add up on large grids. The work is only needed when a
specific minion's key is accepted.

- salt/postgres/auth.sls: fan out only when postgres_fanout_minion
  pillar is set (targets that single minion). Manager highstates with
  no pillar take a zero-N code path.
- salt/reactor/telegraf_user_sync.sls: re-pass the accepted minion id
  as postgres_fanout_minion to the orch.
- salt/orch/telegraf_postgres_sync.sls: forward the pillar to the
  salt.state invocation so the state render sees it.
- salt/manager/tools/sbin/soup: for the one-time 3.1.0 backfill, drop
  the per-minion state.apply and do an in-shell loop over the minion
  pillar files using so-yaml.py directly. Skips minions that already
  have postgres.telegraf.user set.
2026-04-21 10:05:08 -04:00
Mike Reeves a149ea7e8f Skip per-minion pillar fan-out when cred is already in place
Every postgres.auth run was rewriting every minion pillar file via
two so-yaml.py replace calls, even when nothing had changed. Passwords
are only generated on first encounter (see the `if key not in
telegraf_users` guard) and never rotate, so re-writing the same values
on every apply is wasted work and noisy state output.

Add an `unless:` check that compares the already-written
postgres.telegraf.user to the one we'd set. If they match, skip the
fan-out entirely. On first apply for a new minion the key isn't there,
so the replace runs; on subsequent applies it's a no-op.
2026-04-21 09:59:46 -04:00
Mike Reeves bb71e44614 Write per-minion telegraf creds to each minion's own pillar file
pillar/top.sls only distributes postgres.auth to manager-class roles,
so sensors / heavynodes / searchnodes / receivers / fleet / idh /
hypervisor / desktop minions never received the postgres telegraf
password they need to write metrics. Broadcasting the aggregate
postgres.auth pillar to every role would leak the so_postgres admin
password and every other minion's cred.

Fan out per-minion credentials into each minion's own pillar file at
/opt/so/saltstack/local/pillar/minions/<id>.sls. That file is already
distributed by pillar/top.sls exclusively to the matching minion via
`- minions.{{ grains.id }}`, so each minion sees only its own
postgres.telegraf.{user,pass} and nothing else.

- salt/postgres/auth.sls: after writing the manager-scoped aggregate
  pillar, fan the per-minion creds out via so-yaml.py replace for every
  up-minion. Creates the minion pillar file if missing. Requires
  postgres_auth_pillar so the manager pillar lands first.
- salt/telegraf/etc/telegraf.conf: consume postgres:telegraf:user and
  postgres:telegraf:pass directly from the minion's own pillar instead
  of walking postgres:auth:users which isn't visible off the manager.
2026-04-21 09:57:35 -04:00
Mike Reeves 84197fb33b Move postgres backup script and cron to the postgres states
The so-postgres-backup script and its cron were living under
salt/backup/config_backup.sls, which meant the backup script and cron
were deployed independently of whether postgres was enabled/disabled.

- Relocate salt/backup/tools/sbin/so-postgres-backup to
  salt/postgres/tools/sbin/so-postgres-backup so the existing
  postgres_sbin file.recurse in postgres/config.sls picks it up with
  everything else — no separate file.managed needed.
- Remove postgres_backup_script and so_postgres_backup from
  salt/backup/config_backup.sls.
- Add cron.present for so_postgres_backup to salt/postgres/enabled.sls
  and the matching cron.absent to salt/postgres/disabled.sls so the
  cron follows the container's lifecycle.
2026-04-21 09:42:41 -04:00
Mike Reeves 89a6e7c0dd Tidy config.sls makedirs and postgres helpLinks
- config.sls: postgresconfdir creates /opt/so/conf/postgres, so the
  two subdirectories under it (postgressecretsdir, postgresinitdir)
  don't need their own makedirs — require the parent instead.
- soc_postgres.yaml: helpLink for every annotated key now points to
  'postgres' instead of the carried-over 'influxdb' slug.
2026-04-21 09:39:58 -04:00
Mike Reeves f72c30abd0 Have postgres.telegraf_users include postgres.enabled
postgres_wait_ready requires docker_container: so-postgres, which is
declared in postgres.enabled. Running postgres.telegraf_users on its own
— as the reactor orch and the soup post-upgrade step both do — errored
because Salt couldn't resolve the require.

Include postgres.enabled from postgres.telegraf_users so the container
state is always in the render. postgres.enabled already includes
telegraf_users; Salt de-duplicates the circular include and the included
states are all idempotent, so repeated application is a no-op.
2026-04-21 09:35:59 -04:00
Mike Reeves 80bf07ffd8 Flesh out soc_postgres.yaml annotations
Add Configuration-UI annotations for every postgres pillar key defined
in defaults.yaml, not just telegraf.retention_days:

- postgres.enabled          — readonly; admin-visible but toggled via state
- postgres.telegraf.retention_days — drop advanced so user-tunable knobs
  surface in the default view
- postgres.config.max_connections, shared_buffers, log_min_messages —
  user-tunable performance/verbosity knobs, not advanced
- postgres.config.listen_addresses, port, ssl, ssl_cert_file, ssl_key_file,
  ssl_ca_file, hba_file, log_destination, logging_collector,
  shared_preload_libraries, cron.database_name — infra/Salt-managed,
  marked advanced so they're visible but out of the way

No defaults.yaml change; value-side stays the same.
2026-04-20 16:36:37 -04:00
Mike Reeves b69e50542a Use TELEGRAFMERGED for telegraf.output and de-jinja pg_hba.conf
- firewall/map.jinja and postgres/telegraf_users.sls now pull the
  telegraf output selector through TELEGRAFMERGED so the defaults.yaml
  value (BOTH) is the source of truth and pillar overrides merge in
  cleanly. pillar.get with a hardcoded fallback was brittle and would
  disagree with defaults.yaml if the two ever diverged.
- Rename salt/postgres/files/pg_hba.conf.jinja to pg_hba.conf and drop
  template: jinja from config.sls — the file has no jinja besides the
  comment header.
2026-04-20 16:06:01 -04:00
Mike Reeves 3ecd19d085 Move telegraf_output from global pillar to telegraf pillar
The Telegraf backend selector lived at global.telegraf_output but it is
a Telegraf-scoped setting, not a cross-cutting grid global. Move both
the value and the UI annotation under the telegraf pillar so it shows
up alongside the other Telegraf tuning knobs in the Configuration UI.

- salt/telegraf/defaults.yaml:    add telegraf.output: BOTH
- salt/telegraf/soc_telegraf.yaml: add telegraf.output annotation
- salt/global/defaults.yaml:      remove global.telegraf_output
- salt/global/soc_global.yaml:    remove global.telegraf_output annotation
- salt/vars/globals.map.jinja:    drop telegraf_output from GLOBALS
- salt/firewall/map.jinja:        read via pillar.get('telegraf:output')
- salt/postgres/telegraf_users.sls: read via pillar.get('telegraf:output')
- salt/telegraf/etc/telegraf.conf: read via TELEGRAFMERGED.output
- salt/postgres/tools/sbin/so-stats-show: update user-facing docs

No behavioral change — default stays BOTH.
2026-04-20 16:03:02 -04:00
Mike Reeves 8225d41661 Harden postgres secrets, TLS enforcement, and admin tooling
- Deliver postgres super and app passwords via mounted 0600 secret files
  (POSTGRES_PASSWORD_FILE, SO_POSTGRES_PASS_FILE) instead of plaintext env
  vars visible in docker inspect output
- Mount a managed pg_hba.conf that only allows local trust and hostssl
  scram-sha-256 so TCP clients cannot negotiate cleartext sessions
- Restrict postgres.key to 0400 and ensure owner/group 939
- Set umask 0077 on so-postgres-backup output
- Validate host values in so-stats-show against [A-Za-z0-9._-] before SQL
  interpolation so a compromised minion cannot inject SQL via a tag value
- Coerce postgres:telegraf:retention_days to int before rendering into SQL
- Escape single quotes when rendering pillar values into postgresql.conf
- Own postgres tooling in /usr/sbin as root:root so a container escape
  cannot rewrite admin scripts
- Gate ES migration TLS verification on esVerifyCert (default false,
  matching the elastic module's existing pattern)
2026-04-20 12:36:17 -04:00
Mike Reeves 3f46caaf02 Revoke PUBLIC CONNECT on securityonion database
Per-minion telegraf roles inherit CONNECT via PUBLIC by default and
could open sessions to the SOC database (though they have no readable
grants inside). Close the soft edge by revoking PUBLIC's CONNECT and
re-granting it to so_postgres only.
2026-04-17 19:10:07 -04:00
Mike Reeves f3181b204a Remove so-telegraf-trim and update retention description
pg_partman drops old partitions hourly; row-DELETE retention is
obsolete and a confusing emergency fallback on partitioned tables.
2026-04-17 19:06:16 -04:00
Mike Reeves dd39db4584 Drop so_telegraf_trim cron.absent tombstone
feature/postgres never shipped the original cron.present, so this
cleanup state is a no-op on every fresh install. The script itself
stays on disk for emergency use.
2026-04-17 18:59:39 -04:00
Mike Reeves 759880a800 Wait for TCP-ready postgres, not the init-phase Unix socket
docker-entrypoint.sh runs the init-scripts phase with listen_addresses=''
(Unix socket only). The old pg_isready check passed there and then raced
the docker_temp_server_stop shutdown before the final postgres started.
pg_isready -h 127.0.0.1 only returns success once the real CMD binds
TCP, so downstream psql execs never land during the shutdown window.
2026-04-17 16:43:41 -04:00
Mike Reeves 21076af01e Grant so_telegraf CREATE on partman schema
pg_partman 5.x's create_partition() creates a per-parent template
table inside the partman schema at runtime, which requires CREATE on
that schema. Also extend ALTER DEFAULT PRIVILEGES so the runtime-
created template tables are accessible to so_telegraf.
2026-04-17 15:34:19 -04:00
Mike Reeves 927eba566c Grant so_telegraf access to partman schema
Telegraf calls partman.create_parent() on first write of each metric,
which needs USAGE on the partman schema, EXECUTE on its functions and
procedures, and DML on partman.part_config.
2026-04-17 15:13:08 -04:00
Mike Reeves b3fbd5c7a4 Use Go-template placeholders and shell-guarded CREATE DATABASE
- Telegraf's outputs.postgresql plugin uses Go text/template syntax,
  not uppercase tokens. The {TABLE}/{COLUMNS}/{TABLELITERAL} strings
  were passed through to Postgres literally, producing syntax errors
  on every metric's first write. Switch to {{ .table }}, {{ .columns }},
  and {{ .table|quoteLiteral }} so partitioned parents and the partman
  create_parent() call succeed.
- Replace the \gexec "CREATE DATABASE ... WHERE NOT EXISTS" idiom in
  both init-users.sh and telegraf_users.sls with an explicit shell
  conditional. The prior idiom occasionally fired CREATE DATABASE even
  when so_telegraf already existed, producing duplicate-key failures.
2026-04-17 14:55:13 -04:00
Mike Reeves 5228668be0 Fix Telegraf→Postgres table creation and state.apply race
- Telegraf's partman template passed p_type:='native', which pg_partman
  5.x (the version shipped by postgresql-17-partman on Debian) rejects.
  Switched to 'range' so partman.create_parent() actually creates
  partitions and Telegraf's INSERTs succeed.
- Added a postgres_wait_ready gate in telegraf_users.sls so psql execs
  don't race the init-time restart that docker-entrypoint.sh performs.
- so-verify now ignores the literal "-v ON_ERROR_STOP=1" token in the
  setup log. Dropped the matching entry from so-log-check, which scans
  container stdout where that token never appears.
2026-04-17 13:00:12 -04:00
Mike Reeves 7d07f3c8fe Create so_telegraf DB from Salt and pin pg_partman schema
init-users.sh only runs on a fresh data dir, so upgrades onto an
existing /nsm/postgres volume never got so_telegraf. Pinning partman's
schema also makes partman.part_config reliably resolvable.
2026-04-17 10:51:08 -04:00
Mike Reeves d9a9029ce5 Adopt pg_partman + pg_cron for Telegraf metric tables
Every telegraf.* metric table is now a daily time-range partitioned
parent managed by pg_partman. Retention drops old partitions instead
of the row-by-row DELETE that so-telegraf-trim used to run nightly,
and dashboards will benefit from partition pruning at query time.

- Load pg_cron at server start via shared_preload_libraries and point
  cron.database_name at so_telegraf so job metadata lives alongside
  the metrics
- Telegraf create_templates override makes every new metric table a
  PARTITION BY RANGE (time) parent registered with partman.create_parent
  in one transaction (1 day interval, 3 premade)
- postgres_telegraf_group_role now also creates pg_partman and pg_cron
  extensions and schedules hourly partman.run_maintenance_proc
- New retention reconcile state updates partman.part_config.retention
  from postgres.telegraf.retention_days on every apply
- so_telegraf_trim cron is now unconditionally absent; script stays on
  disk as a manual fallback
2026-04-16 17:27:15 -04:00
Mike Reeves 9fe53d9ccc Use JSONB for Telegraf fields/tags to avoid 1600-column limit
High-cardinality inputs (docker, procstat, kafka) trigger ALTER TABLE
ADD COLUMN on every new field name, and with all minions writing into
a shared 'telegraf' schema the metric tables hit Postgres's 1600-column
per-table ceiling quickly. Setting fields_as_jsonb and tags_as_jsonb on
the postgresql output keeps metric tables fixed at (time, tag_id,
fields jsonb) and tag tables at (tag_id, tags jsonb).

- so-stats-show rewritten to use JSONB accessors
  ((fields->>'x')::numeric, tags->>'host', etc.) and cast memory/disk
  sizes to bigint so pg_size_pretty works
- Drop regex/regexFailureMessage from telegraf_output SOC UI entry to
  match the convention upstream used when removing them from
  mdengine/pcapengine/pipeline; options: list drives validation
2026-04-16 17:02:21 -04:00
Mike Reeves 470b3bd4da Comingle Telegraf metrics into shared schema
Per-minion schemas cause table count to explode (N minions * M metrics)
and the per-minion revocation story isn't worth it when retention is
short. Move all minions to a shared 'telegraf' schema while keeping
per-minion login credentials for audit.

- New so_telegraf NOLOGIN group role owns the telegraf schema; each
  per-minion role is a member and inherits insert/select via role
  inheritance
- Telegraf connection string uses options='-c role=so_telegraf' so
  tables auto-created on first write belong to the group role
- so-telegraf-trim walks the flat telegraf.* table set instead of
  per-minion schemas
- so-stats-show filters by host tag; CLI arg is now the hostname as
  tagged by Telegraf rather than a sanitized schema suffix
- Also renames so-show-stats -> so-stats-show
2026-04-16 15:40:54 -04:00
Mike Reeves d24808ff98 Fix so-show-stats tag column resolution
Telegraf's postgresql output stores tag values either as individual
columns on <metric>_tag or as a single JSONB 'tags' column, depending
on plugin version. Introspect information_schema.columns and build the
right accessor per tag instead of assuming one layout.
2026-04-15 19:28:10 -04:00
Mike Reeves cefbe01333 Add telegraf_output selector for InfluxDB/Postgres dual-write
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.

- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
  minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
  postgres.telegraf.retention_days (default 14)
2026-04-15 14:32:10 -04:00
Mike Reeves da1045e052 Fix init-users.sh password escaping for special characters
Use format() with %L for SQL literal escaping instead of raw
string interpolation. Also ALTER ROLE if user already exists
to keep password in sync with pillar.
2026-04-09 21:52:20 -04:00
Mike Reeves 46e38d39bb Enable postgres by default
Safe because postgres states are only applied to manager-type
nodes via top.sls and allowed_states.map.jinja.
2026-04-09 12:23:47 -04:00
Mike Reeves 762e73faf5 Add so-postgres host management scripts
- so-postgres-manage: wraps docker exec for psql operations
  (sql, sqlfile, shell, dblist, userlist)
- so-postgres-start/stop/restart: standard container lifecycle
- Scripts installed to /usr/sbin via file.recurse in config.sls
2026-04-09 09:55:42 -04:00
Mike Reeves 868cd11874 Add so-postgres Salt states and integration wiring
Phase 1 of the PostgreSQL central data platform:
- Salt states: init, enabled, disabled, config, ssl, auth, sostatus
- TLS via SO CA-signed certs with postgresql.conf template
- Two-tier auth: postgres superuser + so_postgres application user
- Firewall restricts port 5432 to manager-only (HA-ready)
- Wired into top.sls, pillar/top.sls, allowed_states, firewall
  containers map, docker defaults, CA signing policies, and setup
  scripts for all manager-type roles
2026-04-08 10:58:52 -04:00