Enterprise Linux upgrade factory

Enterprise Linux upgrade factory case study.

Scope & windows

All VMs (hypervisor lifecycle owned by a separate infra team).

Short non-production windows for testing; no user-visible downtime in production.

Role

Platform engineer.

Approach

EL9 baseline template (hardened/compliance-approved)

provision new VMs per environments (development, test, staging, production)

Ran installation/configuration playbooks per client/env/app.

Preflights via Ansible: OS checks as tasks; app checks via ad-hoc per application.

Rollback rehearsed in test with prod-like constraints.

Blue/green cutover behind LB/proxies; rollback = switch traffic to old VMs.

Results

No practical downtime in production upgrade

Rollback plan documented and tested.

No increase in alert noise post-cutover.

Confidentiality

Client artifacts can't be shared.

Examples are anonymized and recreated; configs, names, and IPs are placeholders.

Receipts use the actual stack and are representative.

Code snippets

VM Template preparation

Ansible - el9-template.yml

# create ansible user/sudo; install python3, cloud-init, guest-agent; run hardening role; clear machine-id.
---
- hosts: el9-template
  become: true
  gather_facts: true
  vars:
    guest_packages:
      - python3
      - cloud-init
      - open-vm-tools
      - auditd
  tasks:
    - name: Install packages
      package:
        name: "{{ guest_packages }}"
        state: present

    - name: Create ansible user
      user:
        name: ansible
        groups: ansible
        state: present
        shell: /bin/bash
        create_home: true

    - name: Add SSH key for ansible user
      authorized_key:
        user: ansible
        state: present
        key: "{{ lookup('file', lookup('env','HOME') + '/.ssh/id_ed25519.pub') }}"

    - copy:
        dest: /etc/sudoers.d/99-ansible
        mode: '0440'
        content: "ansible ALL=(ALL) NOPASSWD: ALL\n"
      notify: Validate sudoers

    - name: Import hardening role
      import_role:
        name: centos9-hardening-baseline

    - name: Clean up for template
      shell: |
        cloud-init clean --logs
        : > /etc/machine-id
        rm -f /var/lib/dbus/machine-id /etc/ssh/ssh_host_*
        args:
          warn: false
  handlers:
    - name: Validate sudoers
      command: visudo -cf /etc/sudoers

OS Preflight

Ansible - el9-preflight.yml

# SELinux enforcing; agent active; cloud-init ready; auditd ok.
---
- hosts: new_pool
  gather_facts: false
  become: true
  tasks:
    - command: getenforce
      register: se
      changed_when: false
      failed_when: se.stdout != "Enforcing"

    - command: systemctl is-active auditd
      changed_when: false

    - command: cloud-init is-active --wait
      changed_when: false

    - command: systemctl is-active open-vm-tools
      changed_when: false

    - command: systemctl is-active zabbix-agent
      changed_when: false

Application preflight (Ansible ad-hoc)

Shell - app-preflight.sh

# health endpoint 200.
ansible -m uri -a "url=http://app/health status_code=200"
# check application logs.
ansible -m shell -a "tail -n 100 /var/log/app.log"
# check backend/db connection
ansible -m shell -a "nc -zv backend 3306"

Cutover (Ansible)

Ansible - cutover.yaml

# Cutover: add → verify → switch → verify
---
- hosts: lb
  gather_facts: false
  become: true
  vars:
    old_backend: "app-old"
    new_backend: "app-new"
    haproxy_socket: "/run/haproxy/admin.sock"
    vip_health_url: "http://vip/health"
  tasks:
    - name: Add new backend (keep old active)
      template:
        src: config.j2
        dest: /etc/haproxy/haproxy.cfg
      vars: { active_backend: "{{ old_backend }}" }

    - name: Validate HAProxy config
      command: haproxy -c -f /etc/haproxy/haproxy.cfg
      changed_when: false

    - name: Reload HAProxy
      service: { name: haproxy, state: reloaded }

    - name: Ensure new backend is healthy (runtime JSON)
      shell: |
        printf 'show stat json\n' | socat - "$HAPROXY_SOCKET" \
        | jq -e --arg bk "$NEW_BACKEND" '
            [ .[] | select(.pxname==$bk and .svname!="BACKEND") | .status ]
            | length>0 and all(.=="UP")'
      args: { executable: /bin/bash }
        
      environment:
        HAPROXY_SOCKET: "{{ haproxy_socket }}"
        NEW_BACKEND: "{{ new_backend }}"
      changed_when: false

    - name: Switch traffic to new backend
      template:
        src: config.j2
        dest: /etc/haproxy/haproxy.cfg
      vars: { active_backend: "{{ new_backend }}" }

    - name: Validate HAProxy config
      command: haproxy -c -f /etc/haproxy/haproxy.cfg
      changed_when: false
      
    - name: Reload HAProxy
      service: { name: haproxy, state: reloaded }

    - name: Verify VIP /health
      uri: { url: "{{ vip_health_url }}", status_code: 200 }

    - name: Drain old backend (runtime socket)
      shell: |
        for s in $(printf 'show stat json\n' | socat - "$HAPROXY_SOCKET" \
          | jq -r --arg bk "$OLD_BACKEND" '.[]|select(.pxname==$bk and .svname!="BACKEND")|.svname'); do
          printf 'disable server %s/%s\n' "$OLD_BACKEND" "$s"
        done | socat - "$HAPROXY_SOCKET"
      args: { executable: /bin/bash }
      environment:
        HAPROXY_SOCKET: "{{ haproxy_socket }}"
        OLD_BACKEND: "{{ old_backend }}"
      changed_when: true