Calm operations
Signals, recovery, low noise
tl;dr
Predictive disk alerts, multi-signal gating, contract-aware paging, and safe automation:
allow-listed auto-restarts and event-driven scale-out (golden image → configure → health → join LB).
Fewer false pages; quieter on-call.
Scope & windows
Mixed estates/tools; mostly Zabbix
Apps varied; logs shipped centrally (not always journald).
Role
Implementer/SME: added checks, integrations, tuned alert policy
Approach
Predictive disks:time-to-full (TTF) with a severity ladder.
• P4 warning: TTF < 14d (service desk FIFO).
• P2 high: TTF < 7d (prioritised ticket, no on-call).
• P1 crit (weekday-aware):
• Fri 08:00–18:00: TTF < 3d (gives runway pre-weekend).
• Mon–Thu: TTF < 1d.
• Weekend: no TTF P1; rely on static <5% free emergency trigger.
Multi-signal LB gating: only page when impact is likely (e.g., 2/3 backends down = P1, 1/3 down = P2).
Auto-remediation: Zabbix Action → Rundeck webhook to restart known-safe services on the failed backend;
full audit.
Event-driven scale: Triggered by monitoring, prolonged increased load triggers scale-out.
Deploy hygiene: playbooks create monitoring maintenance via API.
Results
“Disk full” late pages disappeared; priority escalates earlier, calmly.
LB incidents escalate only when user impact is likely.
Known recurring faults self-heal; humans paged for true P1s.
No alert floods during maintenance; on-call calmer after releases.
Confidentiality
Client artifacts can't be shared.
Examples are anonymized and recreated; configs, names, and IPs are placeholders.
Receipts use the actual stack and are representative.
Code snippets
Predictive disks (TTF ladder in Zabbix)
Zabbix - Trigger
LB multi-signal gating (concept)
Zabbix - Trigger
Allow-listed auto-remediation (manifest + Rundeck webhook)
Zabbix - Action payload
Ansible - customer/ops/manifest.yml
Ansible - remediate.yml
Rundeck - remediate-job.yml
Event-driven scale-in/out (provision → configure → validate → LB)
Zabbix - Trigger
Zabbix - Payload
Shell - name_gen.sh
Ansible - scale_precheck.yml
Rundeck - scale-out-job.yml
Ansible - provision.yml
Ansible - lb_join.yml
Deploy hygiene: playbooks create monitoring maintenance via API.
Ansible - zbx_maintenance.yml
example call step