Commit Graph

34 Commits

Author SHA1 Message Date
Mike Reeves 3d11694d51 make so-yaml PG-canonical and add pillar-change reactor stack
Two coupled changes that together let so_pillar.* be the canonical
config store, with config edits driving service reloads automatically:

so-yaml PG-canonical mode
- Adds /opt/so/conf/so-yaml/mode (and SO_YAML_BACKEND env override) with
  three values: dual (legacy), postgres (PG-only for managed paths),
  disk (emergency rollback). Bootstrap files (secrets.sls, ca/init.sls,
  *.nodes.sls, top.sls, ...) stay disk-only regardless via the existing
  SkipPath allowlist in so_yaml_postgres.locate.
- loadYaml/writeYaml/purgeFile now route to so_pillar.* in postgres
  mode: replace/add/get all read+write the database with no disk file
  ever appearing. PG failure is fatal in postgres mode (no silent
  fallback); dual mode preserves the prior best-effort mirror.
- so_yaml_postgres gains read_yaml(path), is_pg_managed(path), and
  is_enabled() so so-yaml can answer "is this path PG-managed and is
  PG up" without reaching into private helpers.
- schema_pillar.sls writes /opt/so/conf/so-yaml/mode = postgres after
  the importer succeeds, so flipping postgres:so_pillar:enabled flips
  so-yaml's behavior in lockstep with the schema being live.

pg_notify-driven change fan-out
- 008_change_notify.sql adds so_pillar.change_queue + an AFTER trigger
  on pillar_entry that enqueues the locator and pg_notifies
  'so_pillar_change'. Queue is drained at-least-once so engine restarts
  don't lose events; pg_notify is just the wakeup signal.
- New salt-master engine pg_notify_pillar.py LISTENs on the channel,
  drains the queue with FOR UPDATE SKIP LOCKED, debounces bursts, and
  fires 'so/pillar/changed' events grouped by (scope, role, minion).
- Reactor so_pillar_changed.sls catches the tag and dispatches to
  orch.so_pillar_reload, which carries a DISPATCH map of pillar-path
  prefix -> (state sls, role grain set) so adding a new service to
  the auto-reload list is a one-line edit instead of a new reactor.
- Engine + reactor wiring is gated on the same postgres:so_pillar:enabled
  flag as the schema and ext_pillar config so the whole stack flips
  on/off together.

Tests: 21 new cases (112 total, all passing) covering mode resolution,
PG-managed detection, and PG-canonical read/write/purge routing with
the PG client stubbed.
2026-05-01 09:31:48 -04:00
Mike Reeves 614f32c5e0 Split postgres auth from per-minion telegraf creds
The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.

Split responsibilities:

- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
  admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
  pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
  flock, atomic YAML write, PyYAML safe_dump so passwords never
  round-trip through so-yaml.py's type coercion. Idempotent add, quiet
  remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
  instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
  roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
  refreshes the new minion's pillar before the new node highstates,
  so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
  keys (idempotent) and runs postgres.telegraf_users once to reconcile
  the DB.
2026-04-22 10:55:15 -04:00
Mike Reeves 5f28e9b191 Move per-minion telegraf cred provisioning into so-minion
Simpler, race-free replacement for the reactor + orch + fan-out chain.

- salt/manager/tools/sbin/so-minion: expand add_telegraf_to_minion to
  generate a random 72-char password, reuse any existing password from
  the aggregate pillar, write postgres.telegraf.{user,pass} into the
  minion's own pillar file, and update the aggregate pillar so
  postgres.telegraf_users can CREATE ROLE on the next manager apply.
  Every create<ROLE> function already calls this hook, so add / addVM /
  setup dispatches are all covered identically and synchronously.
- salt/postgres/auth.sls: strip the fanout_targets loop and the
  postgres_telegraf_minion_pillar_<safe> cmd.run block — it's now
  redundant. The state still manages the so_postgres admin user and
  writes the aggregate pillar for postgres.telegraf_users to consume.
- salt/reactor/telegraf_user_sync.sls: deleted.
- salt/orch/telegraf_postgres_sync.sls: deleted.
- salt/salt/master.sls: drop the reactor_config_telegraf block that
  registered the reactor on /etc/salt/master.d/reactor_telegraf.conf.
- salt/orch/deploy_newnode.sls: drop the manager_fanout_postgres_telegraf
  step and the require: it added to the newnode highstate. Back to its
  original 3/dev shape.

No more ephemeral postgres_fanout_minion pillar, no more async salt/key
reactor, no more so-minion setupMinionFiles race: the pillar write
happens inline inside setupMinionFiles itself.
2026-04-21 15:34:15 -04:00
Mike Reeves 1abfd77351 Hide telegraf password from console and close so-minion race
Two fixes on the postgres telegraf fan-out path:

1. postgres.auth cmd.run leaked the password to the console because
   Salt always prints the Name: field and `show_changes: False` does
   not apply to cmd.run. Move the user and password into the `env:`
   attribute so the shell body still sees them via $PG_USER / $PG_PASS
   but Salt's state reporter never renders them.

2. so-minion's addMinion -> setupMinionFiles sequence removes the
   minion pillar file and rewrites it from scratch, which wipes the
   postgres.telegraf.* entries the reactor may have already written on
   salt-key accept. Add a postgres.auth fan-out step to
   orch.deploy_newnode (the orch so-minion kicks off after
   setupMinionFiles) and require it from the new minion's highstate.
   Idempotent via the existing unless: guard in postgres.auth.
2026-04-21 15:10:57 -04:00
Mike Reeves 05f6503d61 Gate postgres telegraf fan-out on reactor-provided minion id
postgres.auth was running an `unless` shell check per up-minion on every
manager highstate, even when nothing had changed — N fork+python starts
of so-yaml.py add up on large grids. The work is only needed when a
specific minion's key is accepted.

- salt/postgres/auth.sls: fan out only when postgres_fanout_minion
  pillar is set (targets that single minion). Manager highstates with
  no pillar take a zero-N code path.
- salt/reactor/telegraf_user_sync.sls: re-pass the accepted minion id
  as postgres_fanout_minion to the orch.
- salt/orch/telegraf_postgres_sync.sls: forward the pillar to the
  salt.state invocation so the state render sees it.
- salt/manager/tools/sbin/soup: for the one-time 3.1.0 backfill, drop
  the per-minion state.apply and do an in-shell loop over the minion
  pillar files using so-yaml.py directly. Skips minions that already
  have postgres.telegraf.user set.
2026-04-21 10:05:08 -04:00
Mike Reeves a902f667ba Target manager by role grain in telegraf_postgres_sync orch
The previous MANAGER resolution used pillar.get('setup:manager') with a
fallback to grains.get('master'). Neither works from the reactor:
setup:manager is only populated by the setup workflow (not by reactor
runs), and grains.master returns the minion's master-hostname setting,
not a targetable minion id.

Match the pattern used by orch/delete_hypervisor.sls: compound-target
whichever minion is the manager via role grain.
2026-04-21 09:37:35 -04:00
Mike Reeves 72105f1f2f Drop telegraf push from new-minion orch; highstate covers it
New minions run highstate as part of onboarding, which already applies
the telegraf state with the fresh pillar entry we just wrote. Pushing
telegraf a second time from the reactor is redundant.

- Remove the MINION-scoped salt.state block from the orch; keep only
  the manager-side postgres.auth + postgres.telegraf_users provisioning.
- Stop passing minion_id as pillar in the reactor; the orch doesn't
  reference it anymore.
2026-04-21 09:31:45 -04:00
Mike Reeves cefbe01333 Add telegraf_output selector for InfluxDB/Postgres dual-write
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.

- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
  minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
  postgres.telegraf.retention_days (default 14)
2026-04-15 14:32:10 -04:00
Josh Patterson 453c32df0d handle - in hypervisor hostname 2025-08-04 15:25:26 -04:00
Josh Patterson 6d7066c381 add license 2025-07-02 16:20:30 -04:00
Josh Patterson d003e1380f ensure hypervisor is remove from salt cloud profiles when key is deleted 2025-07-02 16:14:43 -04:00
Josh Patterson f6a0e62853 include managerhype in orch. run hypervisor state before libvirt states 2025-04-08 09:50:26 -04:00
Josh Patterson 445afca6ee use vrt 2025-04-03 13:44:13 -04:00
Josh Patterson ae993c47c1 remove minion pillar files when a vm is destroyed 2025-03-11 11:12:45 -04:00
Josh Patterson c784a6e440 fix setting hypervisor for our custom event tag 2025-03-10 16:55:02 -04:00
Josh Patterson f30938ed59 hypervisor annotation show if base domain is initialized or not 2025-03-06 15:26:08 -05:00
Josh Patterson 2c5861a0c2 ensure local hypervisor dir when new hypervisor key accepted. apply soc.dyanno.hypervisor when hypervisor key accepted 2025-03-05 08:51:10 -05:00
Josh Patterson c896785480 fix vm deletion 2025-02-24 14:20:09 -05:00
Josh Patterson 0006948c29 get hypervisor from dir name 2025-02-24 12:26:28 -05:00
Josh Patterson fd9a4966ec move logic from reactor to orchestration 2025-02-23 14:07:51 -05:00
Josh Patterson 3246176c0a comments 2025-02-21 14:34:08 -05:00
Josh Patterson b68f561e6f progress and hw tracking for soc hypervisor dynamic annotations 2025-02-21 09:50:01 -05:00
m0duspwnens 2e3c1adc63 runner to setup manager for first hypervisor 2025-01-14 16:20:21 -05:00
m0duspwnens 776afa4a36 setup items on manager when hypervisor joins the grid 2025-01-09 16:32:41 -05:00
m0duspwnens 1862deaf5e add copyright 2024-05-08 10:14:08 -04:00
m0duspwnens 0d2e5e0065 need repo and docker first 2024-05-08 09:50:01 -04:00
m0duspwnens dcc1f656ee predownload logstash and elastic for new searchnode and heavynode 2024-05-07 10:13:51 -04:00
m0duspwnens bdf1b45a07 redirect and throw in bg 2024-05-03 14:54:44 -04:00
m0duspwnens 3d4fd59a15 orchit 2024-05-03 13:48:51 -04:00
m0duspwnens 442a717d75 orchit 2024-05-03 12:08:57 -04:00
m0duspwnens fa3522a233 fix requirement 2024-05-03 11:10:21 -04:00
m0duspwnens bbc374b56e add logic in orch 2024-05-03 09:56:52 -04:00
m0duspwnens 2929877042 fix var 2024-05-02 16:37:54 -04:00
m0duspwnens e9b1263249 orchestate searchnode deployment 2024-05-02 16:32:43 -04:00