feature/postgres never shipped the original cron.present, so this
cleanup state is a no-op on every fresh install. The script itself
stays on disk for emergency use.
docker-entrypoint.sh runs the init-scripts phase with listen_addresses=''
(Unix socket only). The old pg_isready check passed there and then raced
the docker_temp_server_stop shutdown before the final postgres started.
pg_isready -h 127.0.0.1 only returns success once the real CMD binds
TCP, so downstream psql execs never land during the shutdown window.
Use CREATE TABLE IF NOT EXISTS and a WHERE-guarded create_parent() so
a Telegraf restart can re-run the templates safely after manual DB
surgery. Add an explicit tag_table_create_templates mirroring the
plugin default with IF NOT EXISTS for the same reason.
pg_partman 5.x's create_partition() creates a per-parent template
table inside the partman schema at runtime, which requires CREATE on
that schema. Also extend ALTER DEFAULT PRIVILEGES so the runtime-
created template tables are accessible to so_telegraf.
pg_partman 5.x splits p_parent_table on '.' and looks up the parts as
raw identifiers, so the literal must be 'schema.name' rather than the
double-quoted form quoteLiteral emits for .table.
Telegraf calls partman.create_parent() on first write of each metric,
which needs USAGE on the partman schema, EXECUTE on its functions and
procedures, and DML on partman.part_config.
- Telegraf's outputs.postgresql plugin uses Go text/template syntax,
not uppercase tokens. The {TABLE}/{COLUMNS}/{TABLELITERAL} strings
were passed through to Postgres literally, producing syntax errors
on every metric's first write. Switch to {{ .table }}, {{ .columns }},
and {{ .table|quoteLiteral }} so partitioned parents and the partman
create_parent() call succeed.
- Replace the \gexec "CREATE DATABASE ... WHERE NOT EXISTS" idiom in
both init-users.sh and telegraf_users.sls with an explicit shell
conditional. The prior idiom occasionally fired CREATE DATABASE even
when so_telegraf already existed, producing duplicate-key failures.
- Telegraf's partman template passed p_type:='native', which pg_partman
5.x (the version shipped by postgresql-17-partman on Debian) rejects.
Switched to 'range' so partman.create_parent() actually creates
partitions and Telegraf's INSERTs succeed.
- Added a postgres_wait_ready gate in telegraf_users.sls so psql execs
don't race the init-time restart that docker-entrypoint.sh performs.
- so-verify now ignores the literal "-v ON_ERROR_STOP=1" token in the
setup log. Dropped the matching entry from so-log-check, which scans
container stdout where that token never appears.
init-users.sh only runs on a fresh data dir, so upgrades onto an
existing /nsm/postgres volume never got so_telegraf. Pinning partman's
schema also makes partman.part_config reliably resolvable.
Every telegraf.* metric table is now a daily time-range partitioned
parent managed by pg_partman. Retention drops old partitions instead
of the row-by-row DELETE that so-telegraf-trim used to run nightly,
and dashboards will benefit from partition pruning at query time.
- Load pg_cron at server start via shared_preload_libraries and point
cron.database_name at so_telegraf so job metadata lives alongside
the metrics
- Telegraf create_templates override makes every new metric table a
PARTITION BY RANGE (time) parent registered with partman.create_parent
in one transaction (1 day interval, 3 premade)
- postgres_telegraf_group_role now also creates pg_partman and pg_cron
extensions and schedules hourly partman.run_maintenance_proc
- New retention reconcile state updates partman.part_config.retention
from postgres.telegraf.retention_days on every apply
- so_telegraf_trim cron is now unconditionally absent; script stays on
disk as a manual fallback
High-cardinality inputs (docker, procstat, kafka) trigger ALTER TABLE
ADD COLUMN on every new field name, and with all minions writing into
a shared 'telegraf' schema the metric tables hit Postgres's 1600-column
per-table ceiling quickly. Setting fields_as_jsonb and tags_as_jsonb on
the postgresql output keeps metric tables fixed at (time, tag_id,
fields jsonb) and tag tables at (tag_id, tags jsonb).
- so-stats-show rewritten to use JSONB accessors
((fields->>'x')::numeric, tags->>'host', etc.) and cast memory/disk
sizes to bigint so pg_size_pretty works
- Drop regex/regexFailureMessage from telegraf_output SOC UI entry to
match the convention upstream used when removing them from
mdengine/pcapengine/pipeline; options: list drives validation
Per-minion schemas cause table count to explode (N minions * M metrics)
and the per-minion revocation story isn't worth it when retention is
short. Move all minions to a shared 'telegraf' schema while keeping
per-minion login credentials for audit.
- New so_telegraf NOLOGIN group role owns the telegraf schema; each
per-minion role is a member and inherits insert/select via role
inheritance
- Telegraf connection string uses options='-c role=so_telegraf' so
tables auto-created on first write belong to the group role
- so-telegraf-trim walks the flat telegraf.* table set instead of
per-minion schemas
- so-stats-show filters by host tag; CLI arg is now the hostname as
tagged by Telegraf rather than a sanitized schema suffix
- Also renames so-show-stats -> so-stats-show
The psql invocation flag '-v ON_ERROR_STOP=1' used by the so-postgres
init script gets flagged by so-log-check because the token 'ERROR'
matches its error regex. Add to the exclusion list.
Telegraf's postgresql output stores tag values either as individual
columns on <metric>_tag or as a single JSONB 'tags' column, depending
on plugin version. Introspect information_schema.columns and build the
right accessor per tag instead of assuming one layout.
Introduces global.telegraf_output (INFLUXDB|POSTGRES|BOTH, default BOTH)
so Telegraf can write metrics to Postgres alongside or instead of
InfluxDB. Each minion authenticates with its own so_telegraf_<minion>
role and writes to a matching schema inside a shared so_telegraf
database, keeping blast radius per-credential to that minion's data.
- Per-minion credentials auto-generated and persisted in postgres/auth.sls
- postgres/telegraf_users.sls reconciles roles/schemas on every apply
- Firewall opens 5432 only to minion hostgroups when Postgres output is active
- Reactor on salt/auth + orch/telegraf_postgres_sync.sls provision new
minions automatically on key accept
- soup post_to_3.1.0 backfills users for existing minions on upgrade
- so-show-stats prints latest CPU/mem/disk/load per minion for sanity checks
- so-telegraf-trim + nightly cron prune rows older than
postgres.telegraf.retention_days (default 14)