Work
Platform engineering case studies — no-downtime migrations, automated DB updates, and calm on-call.
Enterprise Linux Migration
Scope & windows:All VMs (hypervisor lifecycle owned by infra team).Short non-prod windows for testing; no practical prod downtime.
Approach:
• Build EL9 golden image (hardened/approved); provision env-by-env with Ansible.• OS/app preflights; rehearsed rollback; blue-green cutover via LB/proxies.• Post-cutover validation and monitoring; rollback path stays ready.Results:No practical downtime in production upgradeRollback plan documented and tested.No increase in alert noise post-cutover.
MariaDB Automation
Scope & windows:Multi-node MariaDB Galera cluster behind a load balancer.Rolling node ops during short windows; no write loss.
Approach:• Opinionated lane: preflight → change → validate;
serial: 1, any_errors_fatal: true
.• Drain per node via HAProxy socket; promote only when wsrep Synced and ready=ON.• Health-gated rejoin with retries/timeouts; audit via job logs.Results:Predictable windows; fewer manual steps; lower incident risk.Clear pass/fail gates; easy to pause/rollback per node.Auditable runs (who/what/which ref).
Calm Operations
Scope & windows:• Mixed estates/tools; mostly Zabbix.• Apps varied.• Logs centralized across varied stacks.• On-call only for P1; P2/P4 routed to tickets/Slack.
Approach:• Predictive disks (TTF ladder); weekday-aware P1; weekend static guard.• AND-gated paging (e.g., LB 2/3 down = P1) + allow-listed auto-remediation with health checks and cooldowns.• Event-driven scale-out: add uniquely named instances (no naming collisions) and merge into the app’s backend pool; cooldowns apply.
Results:Fewer false pages; calmer on-call.Earlier, clearer severities.Known faults self-recover; bursts absorbed.No alert floods; full audit via job logs.