04Infrastructure2024

AlmaLinux Disaster Recovery

Boot-level recovery and hardening for a mission-critical AlmaLinux production server.

Context

Production recovery, VM ops

Focus

Boot repair, safe restore, hardening

Stack

AlmaLinux · Linux · GRUB · chroot · Rsync

Outcome

Service restored + runbook

What happened

A boot-level failure.

The server hit a boot-level failure that prevented normal startup. Recovery required taking control of the machine state, preserving evidence, restoring service safely, and then hardening the setup to reduce recurrence.

›Stabilize first: stop automation that makes the failure worse.
›Preserve evidence: logs and disk state before invasive repair steps.
›Restore service safely: avoid “fixes” that create silent corruption.
›Harden immediately after: write the runbook while it’s fresh.

Playbook

Correctness over improvisation.

›Boot into a recovery environment (rescue mode / console access).
›Confirm disk/filesystem health before mounting read-write.
›Repair the boot chain as needed (GRUB, initramfs, configs).
›Validate services and data directories, then bring traffic back gradually.
›Document the incident timeline and convert the steps into a runbook.

Hardening

Backups with restore drills (not just “backup succeeded”), explicit alerts for disk pressure and failed backup jobs, configuration tracking so “what changed?” is answerable quickly, and a clear “stop the bleeding” policy during incidents.