salt/auth fires on every minion authentication — including every minion
restart and every master restart — so the reactor was re-running the
postgres.auth + postgres.telegraf_users + telegraf orchestration for
every already-accepted minion on every reconnect. The underlying states
are idempotent, so this was wasted work and log noise, not a correctness
issue.
Switch the subscription to salt/key, which fires only when the master
actually changes a key's state (accept / reject / delete). Match the
pattern used by salt/reactor/check_hypervisor.sls (registered in
salt/salt/cloud/reactor_config_hypervisor.sls) and add the result==True
guard so half-failed key operations don't trigger the orchestration.
Add Configuration-UI annotations for every postgres pillar key defined
in defaults.yaml, not just telegraf.retention_days:
- postgres.enabled — readonly; admin-visible but toggled via state
- postgres.telegraf.retention_days — drop advanced so user-tunable knobs
surface in the default view
- postgres.config.max_connections, shared_buffers, log_min_messages —
user-tunable performance/verbosity knobs, not advanced
- postgres.config.listen_addresses, port, ssl, ssl_cert_file, ssl_key_file,
ssl_ca_file, hba_file, log_destination, logging_collector,
shared_preload_libraries, cron.database_name — infra/Salt-managed,
marked advanced so they're visible but out of the way
No defaults.yaml change; value-side stays the same.
- firewall/map.jinja and postgres/telegraf_users.sls now pull the
telegraf output selector through TELEGRAFMERGED so the defaults.yaml
value (BOTH) is the source of truth and pillar overrides merge in
cleanly. pillar.get with a hardcoded fallback was brittle and would
disagree with defaults.yaml if the two ever diverged.
- Rename salt/postgres/files/pg_hba.conf.jinja to pg_hba.conf and drop
template: jinja from config.sls — the file has no jinja besides the
comment header.
The Telegraf backend selector lived at global.telegraf_output but it is
a Telegraf-scoped setting, not a cross-cutting grid global. Move both
the value and the UI annotation under the telegraf pillar so it shows
up alongside the other Telegraf tuning knobs in the Configuration UI.
- salt/telegraf/defaults.yaml: add telegraf.output: BOTH
- salt/telegraf/soc_telegraf.yaml: add telegraf.output annotation
- salt/global/defaults.yaml: remove global.telegraf_output
- salt/global/soc_global.yaml: remove global.telegraf_output annotation
- salt/vars/globals.map.jinja: drop telegraf_output from GLOBALS
- salt/firewall/map.jinja: read via pillar.get('telegraf:output')
- salt/postgres/telegraf_users.sls: read via pillar.get('telegraf:output')
- salt/telegraf/etc/telegraf.conf: read via TELEGRAFMERGED.output
- salt/postgres/tools/sbin/so-stats-show: update user-facing docs
No behavioral change — default stays BOTH.
state.apply takes a single mods argument; space-separated names are not
a list, so `state.apply postgres.auth postgres.telegraf_users` was only
applying postgres.auth and silently dropping the telegraf_users state.
Use comma-separated mods and add queue=True to match the rest of soup.
feature/postgres had rewritten the 3.1.0 upgrade block, dropping the
elastic upgrade work 3/dev landed for 9.0.8→9.3.3: elasticsearch_backup_index_templates,
the component template state cleanup, and the /usr/sbin/so-kibana-space-defaults
post-upgrade call. It also carried an older ES upgrade mapping
(8.18.8→9.0.8) that was superseded on 3/dev (9.0.8→9.3.3 for
3.0.0-20260331), and a handful of latent shell-quoting regressions in
verify_es_version_compatibility and the intermediate-upgrade helpers.
Adopt the 3/dev soup verbatim and only add the new Telegraf Postgres
provisioning to post_to_3.1.0 on top of so-kibana-space-defaults.
- Deliver postgres super and app passwords via mounted 0600 secret files
(POSTGRES_PASSWORD_FILE, SO_POSTGRES_PASS_FILE) instead of plaintext env
vars visible in docker inspect output
- Mount a managed pg_hba.conf that only allows local trust and hostssl
scram-sha-256 so TCP clients cannot negotiate cleartext sessions
- Restrict postgres.key to 0400 and ensure owner/group 939
- Set umask 0077 on so-postgres-backup output
- Validate host values in so-stats-show against [A-Za-z0-9._-] before SQL
interpolation so a compromised minion cannot inject SQL via a tag value
- Coerce postgres:telegraf:retention_days to int before rendering into SQL
- Escape single quotes when rendering pillar values into postgresql.conf
- Own postgres tooling in /usr/sbin as root:root so a container escape
cannot rewrite admin scripts
- Gate ES migration TLS verification on esVerifyCert (default false,
matching the elastic module's existing pattern)
Per-minion telegraf roles inherit CONNECT via PUBLIC by default and
could open sessions to the SOC database (though they have no readable
grants inside). Close the soft edge by revoking PUBLIC's CONNECT and
re-granting it to so_postgres only.
feature/postgres never shipped the original cron.present, so this
cleanup state is a no-op on every fresh install. The script itself
stays on disk for emergency use.
docker-entrypoint.sh runs the init-scripts phase with listen_addresses=''
(Unix socket only). The old pg_isready check passed there and then raced
the docker_temp_server_stop shutdown before the final postgres started.
pg_isready -h 127.0.0.1 only returns success once the real CMD binds
TCP, so downstream psql execs never land during the shutdown window.
Use CREATE TABLE IF NOT EXISTS and a WHERE-guarded create_parent() so
a Telegraf restart can re-run the templates safely after manual DB
surgery. Add an explicit tag_table_create_templates mirroring the
plugin default with IF NOT EXISTS for the same reason.
pg_partman 5.x's create_partition() creates a per-parent template
table inside the partman schema at runtime, which requires CREATE on
that schema. Also extend ALTER DEFAULT PRIVILEGES so the runtime-
created template tables are accessible to so_telegraf.
pg_partman 5.x splits p_parent_table on '.' and looks up the parts as
raw identifiers, so the literal must be 'schema.name' rather than the
double-quoted form quoteLiteral emits for .table.
Telegraf calls partman.create_parent() on first write of each metric,
which needs USAGE on the partman schema, EXECUTE on its functions and
procedures, and DML on partman.part_config.
- Telegraf's outputs.postgresql plugin uses Go text/template syntax,
not uppercase tokens. The {TABLE}/{COLUMNS}/{TABLELITERAL} strings
were passed through to Postgres literally, producing syntax errors
on every metric's first write. Switch to {{ .table }}, {{ .columns }},
and {{ .table|quoteLiteral }} so partitioned parents and the partman
create_parent() call succeed.
- Replace the \gexec "CREATE DATABASE ... WHERE NOT EXISTS" idiom in
both init-users.sh and telegraf_users.sls with an explicit shell
conditional. The prior idiom occasionally fired CREATE DATABASE even
when so_telegraf already existed, producing duplicate-key failures.
- Telegraf's partman template passed p_type:='native', which pg_partman
5.x (the version shipped by postgresql-17-partman on Debian) rejects.
Switched to 'range' so partman.create_parent() actually creates
partitions and Telegraf's INSERTs succeed.
- Added a postgres_wait_ready gate in telegraf_users.sls so psql execs
don't race the init-time restart that docker-entrypoint.sh performs.
- so-verify now ignores the literal "-v ON_ERROR_STOP=1" token in the
setup log. Dropped the matching entry from so-log-check, which scans
container stdout where that token never appears.
init-users.sh only runs on a fresh data dir, so upgrades onto an
existing /nsm/postgres volume never got so_telegraf. Pinning partman's
schema also makes partman.part_config reliably resolvable.
Every telegraf.* metric table is now a daily time-range partitioned
parent managed by pg_partman. Retention drops old partitions instead
of the row-by-row DELETE that so-telegraf-trim used to run nightly,
and dashboards will benefit from partition pruning at query time.
- Load pg_cron at server start via shared_preload_libraries and point
cron.database_name at so_telegraf so job metadata lives alongside
the metrics
- Telegraf create_templates override makes every new metric table a
PARTITION BY RANGE (time) parent registered with partman.create_parent
in one transaction (1 day interval, 3 premade)
- postgres_telegraf_group_role now also creates pg_partman and pg_cron
extensions and schedules hourly partman.run_maintenance_proc
- New retention reconcile state updates partman.part_config.retention
from postgres.telegraf.retention_days on every apply
- so_telegraf_trim cron is now unconditionally absent; script stays on
disk as a manual fallback
High-cardinality inputs (docker, procstat, kafka) trigger ALTER TABLE
ADD COLUMN on every new field name, and with all minions writing into
a shared 'telegraf' schema the metric tables hit Postgres's 1600-column
per-table ceiling quickly. Setting fields_as_jsonb and tags_as_jsonb on
the postgresql output keeps metric tables fixed at (time, tag_id,
fields jsonb) and tag tables at (tag_id, tags jsonb).
- so-stats-show rewritten to use JSONB accessors
((fields->>'x')::numeric, tags->>'host', etc.) and cast memory/disk
sizes to bigint so pg_size_pretty works
- Drop regex/regexFailureMessage from telegraf_output SOC UI entry to
match the convention upstream used when removing them from
mdengine/pcapengine/pipeline; options: list drives validation
Per-minion schemas cause table count to explode (N minions * M metrics)
and the per-minion revocation story isn't worth it when retention is
short. Move all minions to a shared 'telegraf' schema while keeping
per-minion login credentials for audit.
- New so_telegraf NOLOGIN group role owns the telegraf schema; each
per-minion role is a member and inherits insert/select via role
inheritance
- Telegraf connection string uses options='-c role=so_telegraf' so
tables auto-created on first write belong to the group role
- so-telegraf-trim walks the flat telegraf.* table set instead of
per-minion schemas
- so-stats-show filters by host tag; CLI arg is now the hostname as
tagged by Telegraf rather than a sanitized schema suffix
- Also renames so-show-stats -> so-stats-show
The psql invocation flag '-v ON_ERROR_STOP=1' used by the so-postgres
init script gets flagged by so-log-check because the token 'ERROR'
matches its error regex. Add to the exclusion list.
Telegraf's postgresql output stores tag values either as individual
columns on <metric>_tag or as a single JSONB 'tags' column, depending
on plugin version. Introspect information_schema.columns and build the
right accessor per tag instead of assuming one layout.
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.
- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
postgres.telegraf.retention_days (default 14)