Talha IjlalTalha Ijlal

Project

AlmaLinux Disaster Recovery

Recovery work for a mission-critical AlmaLinux production server, focused on boot repair, data integrity, and a repeatable response playbook that reduces future downtime.

ContextBare-metal / VM ops, production recovery
FocusBoot repair, safe restore, hardening
Technologies
AlmaLinuxLinuxGRUBchrootSSHRsyncBackups

What Happened

The server hit a boot-level failure that prevented normal startup. Recovery required taking control of the machine state, preserving evidence, restoring service safely, and then hardening the setup to reduce recurrence.

warning

Recovery Principles

  • Stabilize first: stop automation that makes the failure worse.
  • Preserve evidence: logs and disk state before invasive repair steps.
  • Restore service safely: avoid “fixes” that create silent corruption.
  • Harden immediately after: write the runbook while it’s fresh.

Playbook

A structured sequence that favors correctness and repeatability over improvisation.

High-Level Steps

  1. Boot into a recovery environment (rescue mode / console access).
  2. Confirm disk/filesystem health before mounting read-write.
  3. Repair boot chain as needed (GRUB, initramfs, configs).
  4. Validate services and data directories, then bring traffic back gradually.
  5. Document the incident timeline and convert the steps into a runbook.

Hardening Improvements

  • Backups with restore drills (not just “backup succeeded”).
  • Explicit alerts for disk pressure, I/O latency, and failed backup jobs.
  • Configuration tracking so “what changed?” is answerable quickly.
  • Clear ownership and a “stop the bleeding” policy during incidents.