Exercises the FileNotFoundError and generic-exception branches added to
loadYaml in the previous commit, restoring 100% coverage required by
the build.
so-telegraf-cred was committed with mode 644, causing
`so-telegraf-cred add "$MINION_ID"` in so-minion's add_telegraf_to_minion
to fail with "Permission denied" and log "Failed to provision postgres
telegraf cred for <minion>". Mark it executable.
Also bail early in seed_creds_file if mkdir/printf/chmod fail, and in
so-yaml.py loadYaml surface a clear stderr message with the filename
instead of an unhandled FileNotFoundError traceback.
Swap the ~150-line Python implementation for a 48-line bash script that
delegates YAML mutation to so-yaml.py — the same helper so-minion and
soup already use. Same semantics: seed the creds pillar on first use,
idempotent add, silent remove.
SO minion ids are dot-free by construction (setup/so-functions:1884
strips everything after the first '.'), so using the raw id as the
so-yaml.py key path is safe.
The old flow had two writers for each per-minion Telegraf password
(so-minion wrote the minion pillar; postgres.auth regenerated any
missing aggregate entries). They drifted on first-boot and there was
no trigger to create DB roles when a new minion joined.
Split responsibilities:
- pillar/postgres/auth.sls (manager-scoped) keeps only the so_postgres
admin cred.
- pillar/telegraf/creds.sls (grid-wide) holds a {minion_id: {user,
pass}} map, shadowed per-install by the local-pillar copy.
- salt/manager/tools/sbin/so-telegraf-cred is the single writer:
flock, atomic YAML write, PyYAML safe_dump so passwords never
round-trip through so-yaml.py's type coercion. Idempotent add, quiet
remove.
- so-minion's add/remove hooks now shell out to so-telegraf-cred
instead of editing pillar files directly.
- postgres.telegraf_users iterates the new pillar key and CREATE/ALTERs
roles from it; telegraf.conf reads its own entry via grains.id.
- orch.deploy_newnode runs postgres.telegraf_users on the manager and
refreshes the new minion's pillar before the new node highstates,
so the DB role is in place the first time telegraf tries to connect.
- soup's post_to_3.1.0 backfills the creds pillar from accepted salt
keys (idempotent) and runs postgres.telegraf_users once to reconcile
the DB.
The reactor path is gone; so-minion now owns add/delete for new
minions. The backfill itself is unchanged — postgres.auth's up_minions
fallback fills the aggregate, postgres.telegraf_users creates the
roles, and the bash loop fans to per-minion pillar files — so the
pre-feature upgrade story still works end-to-end. Just refresh the
comment so it isn't misleading.
Paired with the add path in add_telegraf_to_minion: when a minion is
removed, drop its entry from the aggregate postgres pillar and drop the
matching so_telegraf_<safe> role from the database. Without this, stale
entries and DB roles accumulate over time.
Makes rotate-password and compromise-recovery both a clean delete+add:
so-minion -o=delete -m=<id>
so-minion -o=add -m=<id>
The first call drops the role and clears the aggregate pillar; the
second generates a brand-new password.
The cleanup is best-effort — if so-postgres isn't running or the DROP
ROLE fails (e.g., the role owns unexpected objects), we log a warning
and continue so the minion delete itself never gets blocked by postgres
state. Admins can mop up stray roles manually if that happens.
Simpler, race-free replacement for the reactor + orch + fan-out chain.
- salt/manager/tools/sbin/so-minion: expand add_telegraf_to_minion to
generate a random 72-char password, reuse any existing password from
the aggregate pillar, write postgres.telegraf.{user,pass} into the
minion's own pillar file, and update the aggregate pillar so
postgres.telegraf_users can CREATE ROLE on the next manager apply.
Every create<ROLE> function already calls this hook, so add / addVM /
setup dispatches are all covered identically and synchronously.
- salt/postgres/auth.sls: strip the fanout_targets loop and the
postgres_telegraf_minion_pillar_<safe> cmd.run block — it's now
redundant. The state still manages the so_postgres admin user and
writes the aggregate pillar for postgres.telegraf_users to consume.
- salt/reactor/telegraf_user_sync.sls: deleted.
- salt/orch/telegraf_postgres_sync.sls: deleted.
- salt/salt/master.sls: drop the reactor_config_telegraf block that
registered the reactor on /etc/salt/master.d/reactor_telegraf.conf.
- salt/orch/deploy_newnode.sls: drop the manager_fanout_postgres_telegraf
step and the require: it added to the newnode highstate. Back to its
original 3/dev shape.
No more ephemeral postgres_fanout_minion pillar, no more async salt/key
reactor, no more so-minion setupMinionFiles race: the pillar write
happens inline inside setupMinionFiles itself.
Two fixes on the postgres telegraf fan-out path:
1. postgres.auth cmd.run leaked the password to the console because
Salt always prints the Name: field and `show_changes: False` does
not apply to cmd.run. Move the user and password into the `env:`
attribute so the shell body still sees them via $PG_USER / $PG_PASS
but Salt's state reporter never renders them.
2. so-minion's addMinion -> setupMinionFiles sequence removes the
minion pillar file and rewrites it from scratch, which wipes the
postgres.telegraf.* entries the reactor may have already written on
salt-key accept. Add a postgres.auth fan-out step to
orch.deploy_newnode (the orch so-minion kicks off after
setupMinionFiles) and require it from the new minion's highstate.
Idempotent via the existing unless: guard in postgres.auth.
replace calls removeKey before addKey, so running `so-yaml.py replace`
on a new dotted key whose parent doesn't exist — e.g., postgres.auth
fanning postgres.telegraf.user into a minion pillar file that has
never carried any postgres.* keys — crashed with
KeyError: 'postgres'
from removeKey recursing into a missing parent dict.
Make removeKey a no-op when an intermediate key is absent so that:
- `remove` has the natural "remove if exists" semantics, and
- `replace` works for brand-new nested keys.
The empty-pillar case produced a telegraf.conf with `user= password=`
which libpq misparses ("password=" gets consumed as the user value),
yielding `password authentication failed for user "password="` on
every manager without a prior fan-out (fresh install, not the salt-key
path the reactor handles).
Two fixes:
- salt/postgres/auth.sls: always fan for grains.id in addition to any
postgres_fanout_minion from the reactor, so the manager's own pillar
is populated on every postgres.auth run. The existing `unless` guard
keeps re-runs idempotent.
- salt/telegraf/etc/telegraf.conf: gate the [[outputs.postgresql]]
block on PG_USER and PG_PASS being non-empty. If a minion hasn't
received its pillar yet the output block simply isn't rendered — the
next highstate picks up the creds once the fan-out completes, and in
the meantime telegraf keeps running the other outputs instead of
erroring with a malformed connection string.
postgres.auth was running an `unless` shell check per up-minion on every
manager highstate, even when nothing had changed — N fork+python starts
of so-yaml.py add up on large grids. The work is only needed when a
specific minion's key is accepted.
- salt/postgres/auth.sls: fan out only when postgres_fanout_minion
pillar is set (targets that single minion). Manager highstates with
no pillar take a zero-N code path.
- salt/reactor/telegraf_user_sync.sls: re-pass the accepted minion id
as postgres_fanout_minion to the orch.
- salt/orch/telegraf_postgres_sync.sls: forward the pillar to the
salt.state invocation so the state render sees it.
- salt/manager/tools/sbin/soup: for the one-time 3.1.0 backfill, drop
the per-minion state.apply and do an in-shell loop over the minion
pillar files using so-yaml.py directly. Skips minions that already
have postgres.telegraf.user set.
Every postgres.auth run was rewriting every minion pillar file via
two so-yaml.py replace calls, even when nothing had changed. Passwords
are only generated on first encounter (see the `if key not in
telegraf_users` guard) and never rotate, so re-writing the same values
on every apply is wasted work and noisy state output.
Add an `unless:` check that compares the already-written
postgres.telegraf.user to the one we'd set. If they match, skip the
fan-out entirely. On first apply for a new minion the key isn't there,
so the replace runs; on subsequent applies it's a no-op.
pillar/top.sls only distributes postgres.auth to manager-class roles,
so sensors / heavynodes / searchnodes / receivers / fleet / idh /
hypervisor / desktop minions never received the postgres telegraf
password they need to write metrics. Broadcasting the aggregate
postgres.auth pillar to every role would leak the so_postgres admin
password and every other minion's cred.
Fan out per-minion credentials into each minion's own pillar file at
/opt/so/saltstack/local/pillar/minions/<id>.sls. That file is already
distributed by pillar/top.sls exclusively to the matching minion via
`- minions.{{ grains.id }}`, so each minion sees only its own
postgres.telegraf.{user,pass} and nothing else.
- salt/postgres/auth.sls: after writing the manager-scoped aggregate
pillar, fan the per-minion creds out via so-yaml.py replace for every
up-minion. Creates the minion pillar file if missing. Requires
postgres_auth_pillar so the manager pillar lands first.
- salt/telegraf/etc/telegraf.conf: consume postgres:telegraf:user and
postgres:telegraf:pass directly from the minion's own pillar instead
of walking postgres:auth:users which isn't visible off the manager.
The so-postgres-backup script and its cron were living under
salt/backup/config_backup.sls, which meant the backup script and cron
were deployed independently of whether postgres was enabled/disabled.
- Relocate salt/backup/tools/sbin/so-postgres-backup to
salt/postgres/tools/sbin/so-postgres-backup so the existing
postgres_sbin file.recurse in postgres/config.sls picks it up with
everything else — no separate file.managed needed.
- Remove postgres_backup_script and so_postgres_backup from
salt/backup/config_backup.sls.
- Add cron.present for so_postgres_backup to salt/postgres/enabled.sls
and the matching cron.absent to salt/postgres/disabled.sls so the
cron follows the container's lifecycle.
- config.sls: postgresconfdir creates /opt/so/conf/postgres, so the
two subdirectories under it (postgressecretsdir, postgresinitdir)
don't need their own makedirs — require the parent instead.
- soc_postgres.yaml: helpLink for every annotated key now points to
'postgres' instead of the carried-over 'influxdb' slug.
The previous MANAGER resolution used pillar.get('setup:manager') with a
fallback to grains.get('master'). Neither works from the reactor:
setup:manager is only populated by the setup workflow (not by reactor
runs), and grains.master returns the minion's master-hostname setting,
not a targetable minion id.
Match the pattern used by orch/delete_hypervisor.sls: compound-target
whichever minion is the manager via role grain.
postgres_wait_ready requires docker_container: so-postgres, which is
declared in postgres.enabled. Running postgres.telegraf_users on its own
— as the reactor orch and the soup post-upgrade step both do — errored
because Salt couldn't resolve the require.
Include postgres.enabled from postgres.telegraf_users so the container
state is always in the render. postgres.enabled already includes
telegraf_users; Salt de-duplicates the circular include and the included
states are all idempotent, so repeated application is a no-op.
New minions run highstate as part of onboarding, which already applies
the telegraf state with the fresh pillar entry we just wrote. Pushing
telegraf a second time from the reactor is redundant.
- Remove the MINION-scoped salt.state block from the orch; keep only
the manager-side postgres.auth + postgres.telegraf_users provisioning.
- Stop passing minion_id as pillar in the reactor; the orch doesn't
reference it anymore.
salt/auth fires on every minion authentication — including every minion
restart and every master restart — so the reactor was re-running the
postgres.auth + postgres.telegraf_users + telegraf orchestration for
every already-accepted minion on every reconnect. The underlying states
are idempotent, so this was wasted work and log noise, not a correctness
issue.
Switch the subscription to salt/key, which fires only when the master
actually changes a key's state (accept / reject / delete). Match the
pattern used by salt/reactor/check_hypervisor.sls (registered in
salt/salt/cloud/reactor_config_hypervisor.sls) and add the result==True
guard so half-failed key operations don't trigger the orchestration.
Add Configuration-UI annotations for every postgres pillar key defined
in defaults.yaml, not just telegraf.retention_days:
- postgres.enabled — readonly; admin-visible but toggled via state
- postgres.telegraf.retention_days — drop advanced so user-tunable knobs
surface in the default view
- postgres.config.max_connections, shared_buffers, log_min_messages —
user-tunable performance/verbosity knobs, not advanced
- postgres.config.listen_addresses, port, ssl, ssl_cert_file, ssl_key_file,
ssl_ca_file, hba_file, log_destination, logging_collector,
shared_preload_libraries, cron.database_name — infra/Salt-managed,
marked advanced so they're visible but out of the way
No defaults.yaml change; value-side stays the same.
- firewall/map.jinja and postgres/telegraf_users.sls now pull the
telegraf output selector through TELEGRAFMERGED so the defaults.yaml
value (BOTH) is the source of truth and pillar overrides merge in
cleanly. pillar.get with a hardcoded fallback was brittle and would
disagree with defaults.yaml if the two ever diverged.
- Rename salt/postgres/files/pg_hba.conf.jinja to pg_hba.conf and drop
template: jinja from config.sls — the file has no jinja besides the
comment header.
The Telegraf backend selector lived at global.telegraf_output but it is
a Telegraf-scoped setting, not a cross-cutting grid global. Move both
the value and the UI annotation under the telegraf pillar so it shows
up alongside the other Telegraf tuning knobs in the Configuration UI.
- salt/telegraf/defaults.yaml: add telegraf.output: BOTH
- salt/telegraf/soc_telegraf.yaml: add telegraf.output annotation
- salt/global/defaults.yaml: remove global.telegraf_output
- salt/global/soc_global.yaml: remove global.telegraf_output annotation
- salt/vars/globals.map.jinja: drop telegraf_output from GLOBALS
- salt/firewall/map.jinja: read via pillar.get('telegraf:output')
- salt/postgres/telegraf_users.sls: read via pillar.get('telegraf:output')
- salt/telegraf/etc/telegraf.conf: read via TELEGRAFMERGED.output
- salt/postgres/tools/sbin/so-stats-show: update user-facing docs
No behavioral change — default stays BOTH.
state.apply takes a single mods argument; space-separated names are not
a list, so `state.apply postgres.auth postgres.telegraf_users` was only
applying postgres.auth and silently dropping the telegraf_users state.
Use comma-separated mods and add queue=True to match the rest of soup.
feature/postgres had rewritten the 3.1.0 upgrade block, dropping the
elastic upgrade work 3/dev landed for 9.0.8→9.3.3: elasticsearch_backup_index_templates,
the component template state cleanup, and the /usr/sbin/so-kibana-space-defaults
post-upgrade call. It also carried an older ES upgrade mapping
(8.18.8→9.0.8) that was superseded on 3/dev (9.0.8→9.3.3 for
3.0.0-20260331), and a handful of latent shell-quoting regressions in
verify_es_version_compatibility and the intermediate-upgrade helpers.
Adopt the 3/dev soup verbatim and only add the new Telegraf Postgres
provisioning to post_to_3.1.0 on top of so-kibana-space-defaults.
- Deliver postgres super and app passwords via mounted 0600 secret files
(POSTGRES_PASSWORD_FILE, SO_POSTGRES_PASS_FILE) instead of plaintext env
vars visible in docker inspect output
- Mount a managed pg_hba.conf that only allows local trust and hostssl
scram-sha-256 so TCP clients cannot negotiate cleartext sessions
- Restrict postgres.key to 0400 and ensure owner/group 939
- Set umask 0077 on so-postgres-backup output
- Validate host values in so-stats-show against [A-Za-z0-9._-] before SQL
interpolation so a compromised minion cannot inject SQL via a tag value
- Coerce postgres:telegraf:retention_days to int before rendering into SQL
- Escape single quotes when rendering pillar values into postgresql.conf
- Own postgres tooling in /usr/sbin as root:root so a container escape
cannot rewrite admin scripts
- Gate ES migration TLS verification on esVerifyCert (default false,
matching the elastic module's existing pattern)
Per-minion telegraf roles inherit CONNECT via PUBLIC by default and
could open sessions to the SOC database (though they have no readable
grants inside). Close the soft edge by revoking PUBLIC's CONNECT and
re-granting it to so_postgres only.
feature/postgres never shipped the original cron.present, so this
cleanup state is a no-op on every fresh install. The script itself
stays on disk for emergency use.
docker-entrypoint.sh runs the init-scripts phase with listen_addresses=''
(Unix socket only). The old pg_isready check passed there and then raced
the docker_temp_server_stop shutdown before the final postgres started.
pg_isready -h 127.0.0.1 only returns success once the real CMD binds
TCP, so downstream psql execs never land during the shutdown window.
Use CREATE TABLE IF NOT EXISTS and a WHERE-guarded create_parent() so
a Telegraf restart can re-run the templates safely after manual DB
surgery. Add an explicit tag_table_create_templates mirroring the
plugin default with IF NOT EXISTS for the same reason.
pg_partman 5.x's create_partition() creates a per-parent template
table inside the partman schema at runtime, which requires CREATE on
that schema. Also extend ALTER DEFAULT PRIVILEGES so the runtime-
created template tables are accessible to so_telegraf.
pg_partman 5.x splits p_parent_table on '.' and looks up the parts as
raw identifiers, so the literal must be 'schema.name' rather than the
double-quoted form quoteLiteral emits for .table.
Telegraf calls partman.create_parent() on first write of each metric,
which needs USAGE on the partman schema, EXECUTE on its functions and
procedures, and DML on partman.part_config.
- Telegraf's outputs.postgresql plugin uses Go text/template syntax,
not uppercase tokens. The {TABLE}/{COLUMNS}/{TABLELITERAL} strings
were passed through to Postgres literally, producing syntax errors
on every metric's first write. Switch to {{ .table }}, {{ .columns }},
and {{ .table|quoteLiteral }} so partitioned parents and the partman
create_parent() call succeed.
- Replace the \gexec "CREATE DATABASE ... WHERE NOT EXISTS" idiom in
both init-users.sh and telegraf_users.sls with an explicit shell
conditional. The prior idiom occasionally fired CREATE DATABASE even
when so_telegraf already existed, producing duplicate-key failures.
- Telegraf's partman template passed p_type:='native', which pg_partman
5.x (the version shipped by postgresql-17-partman on Debian) rejects.
Switched to 'range' so partman.create_parent() actually creates
partitions and Telegraf's INSERTs succeed.
- Added a postgres_wait_ready gate in telegraf_users.sls so psql execs
don't race the init-time restart that docker-entrypoint.sh performs.
- so-verify now ignores the literal "-v ON_ERROR_STOP=1" token in the
setup log. Dropped the matching entry from so-log-check, which scans
container stdout where that token never appears.
init-users.sh only runs on a fresh data dir, so upgrades onto an
existing /nsm/postgres volume never got so_telegraf. Pinning partman's
schema also makes partman.part_config reliably resolvable.
Every telegraf.* metric table is now a daily time-range partitioned
parent managed by pg_partman. Retention drops old partitions instead
of the row-by-row DELETE that so-telegraf-trim used to run nightly,
and dashboards will benefit from partition pruning at query time.
- Load pg_cron at server start via shared_preload_libraries and point
cron.database_name at so_telegraf so job metadata lives alongside
the metrics
- Telegraf create_templates override makes every new metric table a
PARTITION BY RANGE (time) parent registered with partman.create_parent
in one transaction (1 day interval, 3 premade)
- postgres_telegraf_group_role now also creates pg_partman and pg_cron
extensions and schedules hourly partman.run_maintenance_proc
- New retention reconcile state updates partman.part_config.retention
from postgres.telegraf.retention_days on every apply
- so_telegraf_trim cron is now unconditionally absent; script stays on
disk as a manual fallback
High-cardinality inputs (docker, procstat, kafka) trigger ALTER TABLE
ADD COLUMN on every new field name, and with all minions writing into
a shared 'telegraf' schema the metric tables hit Postgres's 1600-column
per-table ceiling quickly. Setting fields_as_jsonb and tags_as_jsonb on
the postgresql output keeps metric tables fixed at (time, tag_id,
fields jsonb) and tag tables at (tag_id, tags jsonb).
- so-stats-show rewritten to use JSONB accessors
((fields->>'x')::numeric, tags->>'host', etc.) and cast memory/disk
sizes to bigint so pg_size_pretty works
- Drop regex/regexFailureMessage from telegraf_output SOC UI entry to
match the convention upstream used when removing them from
mdengine/pcapengine/pipeline; options: list drives validation