Calm operations

Signals, recovery, low noise

tl;dr

Predictive disk alerts, multi-signal gating, contract-aware paging, and safe automation:

allow-listed auto-restarts and event-driven scale-out (golden image → configure → health → join LB).

Fewer false pages; quieter on-call.

Scope & windows

Mixed estates/tools; mostly Zabbix

Apps varied; logs shipped centrally (not always journald).

Role

Implementer/SME: added checks, integrations, tuned alert policy

Approach

Predictive disks:time-to-full (TTF) with a severity ladder.

• P4 warning: TTF < 14d (service desk FIFO).

• P2 high: TTF < 7d (prioritised ticket, no on-call).

• P1 crit (weekday-aware):

• Fri 08:00–18:00: TTF < 3d (gives runway pre-weekend).

• Mon–Thu: TTF < 1d.

• Weekend: no TTF P1; rely on static <5% free emergency trigger.

Multi-signal LB gating: only page when impact is likely (e.g., 2/3 backends down = P1, 1/3 down = P2).

Auto-remediation: Zabbix Action → Rundeck webhook to restart known-safe services on the failed backend;

full audit.

Event-driven scale: Triggered by monitoring, prolonged increased load triggers scale-out.

Deploy hygiene: playbooks create monitoring maintenance via API.

Results

“Disk full” late pages disappeared; priority escalates earlier, calmly.

LB incidents escalate only when user impact is likely.

Known recurring faults self-heal; humans paged for true P1s.

No alert floods during maintenance; on-call calmer after releases.

Confidentiality

Client artifacts can't be shared.

Examples are anonymized and recreated; configs, names, and IPs are placeholders.

Receipts use the actual stack and are representative.

Code snippets

Predictive disks (TTF ladder in Zabbix)

Zabbix - Trigger

LB multi-signal gating (concept)

Zabbix - Trigger

Allow-listed auto-remediation (manifest + Rundeck webhook)

Zabbix - Action payload

Ansible - customer/ops/manifest.yml

Ansible - remediate.yml

Rundeck - remediate-job.yml

Event-driven scale-in/out (provision → configure → validate → LB)

Zabbix - Trigger

Zabbix - Payload

Shell - name_gen.sh

Ansible - scale_precheck.yml

Rundeck - scale-out-job.yml

Ansible - provision.yml

Ansible - lb_join.yml

Deploy hygiene: playbooks create monitoring maintenance via API.

Ansible - zbx_maintenance.yml

example call step