How to Stop Your Raspberry Pi From Failing to Boot (And Survive a Hard Power Cut)
A real-world guide to making your Pi bulletproof, from SD card corruption to NVMe migration
I learned this lesson the hard way. My Raspberry Pi 5 was happily serving a production WordPress site when I rebooted it. Two minutes later there was no SSH, no site, just a solid red light staring back at me. The culprit was MariaDB mid-write when the restart fired, the ext4 journal left half-committed, and the Pi unable to boot as a result. This is one of the most common Raspberry Pi failure modes, and it is almost entirely preventable. This guide covers every layer of the problem: understanding why it happens, hardening your setup to survive a hard power cut at any time, and ultimately migrating to NVMe so the problem goes away permanently.
1. What Actually Happened (And Why)
When you run sudo reboot on a Pi running a database, a race begins. Linux receives the reboot command while Docker containers (MariaDB, nginx, WordPress) are still doing I/O, the kernel starts unmounting filesystems, MariaDB never finishes flushing its write buffers, the ext4 journal is left mid-transaction, and on the next boot the kernel tries to replay the journal and fails. The result is orphaned inodes, corrupt block bitmaps, and a filesystem that refuses to mount cleanly.
What makes this worse on a Pi than on a desktop machine comes down to three compounding factors. A server PSU has roughly 10ms of holdup time after power is cut, while the Pi has zero. SD cards have no power-loss protection, unlike enterprise SSDs whose controllers flush their internal write cache on power loss. And because your OS, logs, and database all share the same card with no background fsck to silently fix filesystem errors, a single interrupted write at the wrong moment can take the entire system down.
2. Diagnosing a Failed Boot
If your Pi shows a solid red light with no green activity, it has not booted. The green LED blink code tells you why: no green activity at all means the SD card is not detected or not seated; three flashes means start.elf was not found due to a corrupted boot partition; four flashes means start.elf cannot launch due to corrupted firmware; seven flashes means the kernel image was not found; and a solid green that then stops means the kernel loaded but the OS crashed.
If the Pi will not boot, pop the SD card into a Mac or Linux machine and run a filesystem check using the e2fsprogs package.
# macOS
brew install e2fsprogs
# Linux
sudo apt-get install e2fsprogs Run the following block on your Mac or Linux machine with the SD card inserted. It installs e2fsprogs, runs a read-only check to show what is broken, and if errors are found, applies safe automatic repairs in place:
# Install e2fsprogs if not already present
# macOS:
brew install e2fsprogs
# Linux:
# sudo apt-get install e2fsprogs
cat > /tmp/pi-fsck.sh << 'EOF'
#!/usr/bin/env bash
# pi-fsck.sh — diagnose and repair a Pi SD card from another machine
# Usage: bash /tmp/pi-fsck.sh /dev/disk4s2
# (replace /dev/disk4s2 with the ext4 partition of your SD card)
set -euo pipefail
DEVICE="${1:?Usage: $0 }"
# Locate e2fsck — Homebrew installs it under the Cellar on macOS
if command -v e2fsck &>/dev/null; then
E2FSCK="e2fsck"
elif E2FSCK=$(find /opt/homebrew/Cellar/e2fsprogs -name e2fsck 2>/dev/null | head -1); then
: # found via Homebrew Cellar
else
echo "ERROR: e2fsck not found. Install e2fsprogs first." >&2
exit 1
fi
echo "=== Read-only check on ${DEVICE} ==="
sudo "${E2FSCK}" -n "${DEVICE}" || true
echo ""
echo "=== Applying safe automatic repairs (-p) ==="
sudo "${E2FSCK}" -p "${DEVICE}" && echo "Repair complete. SD card should boot." || {
echo ""
echo "Errors remain after -p pass. Escalating to -y (fix everything)..."
sudo "${E2FSCK}" -y "${DEVICE}" && echo "Full repair complete."
}
EOF
chmod +x /tmp/pi-fsck.sh Run it with your SD card’s ext4 partition as the argument (check diskutil list on macOS or lsblk on Linux to confirm the device path):
bash /tmp/pi-fsck.sh /dev/disk4s2 Look for output like Orphan file (inode 12) block 39 is not clean or Block bitmap differences: +(2916492--2916495). This is metadata corruption and it is almost always fixable without data loss. The FSCK0000.REC file on the boot partition is a dead giveaway that fsck ran and found problems. The script escalates automatically from safe repairs to a full -y pass if the first pass is not sufficient, after which you can pop the card back in and the Pi will boot.
3. The Prevention Stack
The following changes eliminate this class of problem entirely. Each layer targets a different failure mode, and together they take a Pi from “occasionally corrupts when I look at it funny” to a system that survives hard power cuts and keeps running reliably for years.
3.1 Always Stop Docker Before Rebooting
The simplest and most important rule is to never run bare sudo reboot on a Pi running containers. Create a scripts/safe-reboot.sh that gives Docker 30 seconds to flush MariaDB’s write buffers before the kernel touches the filesystem:
cat > ~/scripts/safe-reboot.sh << 'EOF'
#!/usr/bin/env bash
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "${REPO_ROOT}"
HALT=false
[[ "${1:-}" == "--halt" ]] && HALT=true
echo "Stopping Docker stack gracefully..."
docker compose down --timeout 30
echo "Syncing filesystem..."
sync
if [[ "${HALT}" == "true" ]]; then
sudo shutdown -h now
else
sudo reboot
fi
EOF
chmod +x ~/scripts/safe-reboot.sh 3.2 Install log2ram
log2ram redirects /var/log to a RAM disk (tmpfs) and syncs it to disk only on graceful shutdown, which eliminates the most write-intensive part of normal Pi operation.
curl -fsSL https://azlux.fr/repo.gpg | sudo gpg --dearmor \
-o /usr/share/keyrings/azlux-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/azlux-archive-keyring.gpg] \
http://packages.azlux.fr/debian/ bookworm main" \
| sudo tee /etc/apt/sources.list.d/azlux.list
sudo apt-get update && sudo apt-get install -y log2ram
sudo sed -i 's/^SIZE=.*/SIZE=128M/' /etc/log2ram.conf
sudo sed -i 's/^USE_RSYNC=.*/USE_RSYNC=true/' /etc/log2ram.conf On a hard power cut you lose recent log entries, but the filesystem survives. That is the right trade-off.
3.3 Volatile systemd Journal
By default, systemd writes a binary journal to /var/log/journal, which produces high-frequency disk I/O that can corrupt on power loss. Moving it entirely to RAM fixes this:
sudo mkdir -p /etc/systemd/journald.conf.d
sudo tee /etc/systemd/journald.conf.d/volatile.conf << 'EOF'
[Journal]
Storage=volatile
RuntimeMaxUse=64M
RuntimeMaxFileSize=16M
Compress=yes
EOF
sudo systemctl restart systemd-journald The journal now lives in /run/log/journal in RAM and takes effect immediately without a reboot.
3.4 Harden the Root Partition Mount Options
Add two options to your root partition in /etc/fstab:
PARTUUID=xxxxx / ext4 defaults,noatime,commit=10,errors=remount-ro 0 1 noatime stops the filesystem updating the last-accessed timestamp on every file read, cutting SD writes by roughly 30%. commit=10 flushes the ext4 journal every 10 seconds instead of every 5, meaning at most 10 seconds of metadata needs replaying on a hard reboot. errors=remount-ro is the critical safety net: if the kernel detects filesystem errors at runtime, it remounts root as read-only instead of panicking or silently corrupting further, and SSH remains functional in read-only mode.
3.5 Reduce MariaDB Write Frequency
By default, InnoDB flushes its transaction log to disk on every single commit, which is extreme for a blog serving occasional traffic. Changing it to flush once per second is the single most impactful database change for SD card longevity:
[mysqld]
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0 The trade-off is that up to roughly one second of database writes could be lost on a hard power cut, but they roll back cleanly with no corruption. For a blog, this is a completely acceptable risk.
3.6 Cap Docker Container Logs
Docker’s default logging writes unbounded JSON files to /var/lib/docker/containers/*/. Create /etc/docker/daemon.json to cap this:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
} Reload Docker with sudo systemctl reload-or-restart docker. Each container’s logs are now capped at 30MB total, which prevents runaway logging from filling the SD card.
3.7 Enable the Hardware Watchdog
The Pi 5 has a hardware watchdog built into the SoC. If the kernel hangs and nothing kicks the watchdog for 15 seconds, it forces a hard reboot. Enable it in /boot/firmware/config.txt:
dtparam=watchdog=on Then configure systemd to use it:
sudo mkdir -p /etc/systemd/system.conf.d
sudo tee /etc/systemd/system.conf.d/watchdog.conf << 'EOF'
[Manager]
RuntimeWatchdogSec=15s
RebootWatchdogSec=2min
EOF RuntimeWatchdogSec=15s forces an automatic hard reboot if the system hangs for 15 seconds, while RebootWatchdogSec=2min forces a reboot if a shutdown or reboot takes more than two minutes due to a stuck service. This takes effect after reboot.
3.8 Fix Service Startup Order
A subtle but important failure mode: cloudflared starts before the network is ready, fails silently, and never recovers. The fix is a systemd drop-in that adds the correct dependencies:
sudo mkdir -p /etc/systemd/system/cloudflared.service.d
sudo tee /etc/systemd/system/cloudflared.service.d/wait-for-network.conf << 'EOF'
[Unit]
After=network-online.target docker.service
Wants=network-online.target
[Service]
Restart=on-failure
RestartSec=10s
EOF
sudo systemctl daemon-reload The distinction between After=network.target and After=network-online.target matters enormously here. The former fires when the network interface has an IP address; DNS may not yet be functional. The latter fires only when the network is actually usable, with DNS resolving and routes working. Use network-online.target for anything that communicates with the internet. Also add a timeout guard so boot does not stall if the network is unavailable:
sudo mkdir -p /etc/systemd/system/systemd-networkd-wait-online.service.d
sudo tee /etc/systemd/system/systemd-networkd-wait-online.service.d/timeout.conf << 'EOF'
[Service]
TimeoutStartSec=15s
EOF 3.9 Graceful Docker Shutdown on Any Reboot
Even if someone runs sudo reboot directly, Docker should stop cleanly. Install a systemd hook to guarantee this:
sudo tee /etc/systemd/system/wordpress-stack-shutdown.service << 'EOF'
[Unit]
Description=Graceful WordPress stack shutdown
DefaultDependencies=no
Before=docker.service shutdown.target reboot.target halt.target
After=docker.service
Requires=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/home/pi/andrewbakerninja-pi/scripts/stack-stop.sh
TimeoutStopSec=60s
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable wordpress-stack-shutdown.service Where stack-stop.sh is written and enabled as follows. Run this block to create the script, the systemd unit, and enable it in one pass:
cat > /home/pi/andrewbakerninja-pi/scripts/stack-stop.sh <&1 | logger -t wordpress-shutdown
sync
EOF
chmod +x /home/pi/andrewbakerninja-pi/scripts/stack-stop.sh
sudo tee /etc/systemd/system/wordpress-stack-shutdown.service << 'EOF'
[Unit]
Description=Graceful WordPress stack shutdown
DefaultDependencies=no
Before=docker.service shutdown.target reboot.target halt.target
After=docker.service
Requires=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/home/pi/andrewbakerninja-pi/scripts/stack-stop.sh
TimeoutStopSec=60s
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable wordpress-stack-shutdown.service With this hook in place, even a bare sudo reboot will trigger a clean Docker shutdown before anything else gets killed.
3.10 Set a Static IP
mDNS .local hostnames can be slow to register on boot, leaving SSH unreachable for 30 to 60 seconds after startup. Setting a static IP means the device is always reachable at a known address the moment the network interface comes up:
sudo tee -a /etc/dhcpcd.conf << 'EOF'
interface wlan0
static ip_address=192.168.0.45/24
static routers=192.168.0.1
static domain_name_servers=192.168.0.1 8.8.8.8
EOF Replace wlan0 with eth0 if you are on a wired connection. Check your current interface with ip route show default.
4. Choosing the Right SD Card
Not all SD cards are equal for Pi server use. The key metric is Total Bytes Written (TBW) endurance, and consumer-grade cards optimise for read speed rather than write durability. For serious use, the Samsung Endurance PRO (approximately 43TB TBW on 64GB) is the best pick, designed for 24/7 write loads in CCTV cameras. The SanDisk Max Endurance (approximately 40TB TBW) is a solid alternative that is often slightly cheaper. Avoid the SanDisk Ultra, Kingston Canvas, and any random eBay brands: the consumer-grade cards deliver only 3 to 5TB TBW and will die within a year or two under server load, while counterfeit cards are rampant on marketplaces and often present an 8GB card as 64GB.
You can test for counterfeit cards with f3probe:
sudo apt-get install f3
sudo f3probe --destructive --time-ops /dev/mmcblk0 5. The Permanent Fix: Migrating to NVMe
SD cards are fundamentally the wrong storage medium for a server. The Pi 5 has a PCIe 2.0 x1 connector specifically for NVMe, and moving to it eliminates the entire class of problems described in this guide.
5.1 Hardware You Need
The official option is the Raspberry Pi M.2 HAT+, which supports M.2 2230, 2242, and 2280 form factors, requires M-key NVMe (B+M key drives may have physical clearance issues), and plugs into the PCIe FPC connector on the bottom of the Pi 5. Solid third-party alternatives include the Geekworm X1008/X1009 and the Pimoroni NVMe Base. For the NVMe drive itself, any M.2 NVMe will work; the WD Black SN770 and Samsung 980 are reliable choices at the 1TB tier. The PCIe Gen 2 x1 connection gives you roughly 500MB/s, which is not the drive’s full rated speed but is about ten times faster than the SD card.
5.2 Enabling PCIe Gen 3 (Optional)
The Pi 5 supports PCIe Gen 3 with newer firmware, pushing theoretical throughput to around 1GB/s. Add this to /boot/firmware/config.txt to enable it:
dtparam=pciex1_gen=3 Results vary by drive, as some drives do not negotiate Gen 3 correctly on Pi hardware. Test with Gen 2 first and only enable this if you want to squeeze out extra performance.
5.3 Updating the EEPROM Firmware
Before migrating, ensure your EEPROM is current. The minimum version for stable NVMe boot is 2024-04-16 or later:
sudo rpi-eeprom-update
# If an update is available:
sudo rpi-eeprom-update -a
sudo reboot 5.4 Migrating the Root Filesystem
The simplest method is rpi-clone, which creates an exact copy of your SD card on the NVMe drive, updates UUIDs and fstab, and configures the bootloader automatically:
sudo apt-get install rpi-clone
lsblk # Identify your NVMe device, usually /dev/nvme0n1
sudo rpi-clone /dev/nvme0n1 If you prefer more control, you can use rsync directly: partition and format the NVMe, mount it, copy everything across excluding virtual filesystems, update fstab on the new root to reference the NVMe partition’s PARTUUID (obtained via sudo blkid /dev/nvme0n1p1), sync, and unmount.
5.5 Setting the Boot Order
Tell the EEPROM to boot from NVMe first with the SD card as fallback:
sudo rpi-eeprom-config --edit Find the BOOT_ORDER line and change it to:
BOOT_ORDER=0xf16 Boot order is read right-to-left, so this means: try NVMe (PCIe, code 6) first, fall back to SD card (code 1), and restart from the beginning if all options fail (code f). Save, exit, and reboot. Verify the Pi booted from NVMe with findmnt /, which should show /dev/nvme0n1p2 rather than /dev/mmcblk0p2.
5.6 After Migration: Keep SD as Emergency Boot
Leave the SD card installed. It now functions as a recovery device: if the NVMe fails or develops corruption, removing the HAT causes the Pi to fall back to the SD card automatically. Periodically refresh the SD card with rpi-clone or rsync so it stays reasonably current as a recovery image.
6. Putting It All Together
Here is the complete hardening script that applies everything in this guide in one pass:
#!/usr/bin/env bash
# harden-storage.sh — run once after initial deploy
set -euo pipefail
# log2ram
curl -fsSL https://azlux.fr/repo.gpg | sudo gpg --dearmor \
-o /usr/share/keyrings/azlux-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/azlux-archive-keyring.gpg] \
http://packages.azlux.fr/debian/ bookworm main" \
| sudo tee /etc/apt/sources.list.d/azlux.list
sudo apt-get update -qq && sudo apt-get install -y log2ram
sudo sed -i 's/^SIZE=.*/SIZE=128M/' /etc/log2ram.conf
sudo sed -i 's/^USE_RSYNC=.*/USE_RSYNC=true/' /etc/log2ram.conf
# Volatile journald
sudo mkdir -p /etc/systemd/journald.conf.d
sudo tee /etc/systemd/journald.conf.d/volatile.conf << 'EOF'
[Journal]
Storage=volatile
RuntimeMaxUse=64M
RuntimeMaxFileSize=16M
Compress=yes
EOF
sudo systemctl restart systemd-journald
# fstab hardening
sudo sed -i 's|\(PARTUUID=[^ ]*\s*/\s*ext4\s*\)defaults,noatime|\1defaults,noatime,commit=10,errors=remount-ro|' /etc/fstab
# Docker log limits
sudo tee /etc/docker/daemon.json << 'EOF'
{
"log-driver": "json-file",
"log-opts": { "max-size": "10m", "max-file": "3" }
}
EOF
sudo systemctl reload-or-restart docker
# Hardware watchdog
echo 'dtparam=watchdog=on' | sudo tee -a /boot/firmware/config.txt
sudo mkdir -p /etc/systemd/system.conf.d
sudo tee /etc/systemd/system.conf.d/watchdog.conf << 'EOF'
[Manager]
RuntimeWatchdogSec=15s
RebootWatchdogSec=2min
EOF
# Static IP
IFACE=$(ip route show default | awk '/default/ {print $5; exit}')
sudo tee -a /etc/dhcpcd.conf << EOF
interface ${IFACE}
static ip_address=192.168.0.45/24
static routers=192.168.0.1
static domain_name_servers=192.168.0.1 8.8.8.8
EOF
# SSH resilience
sudo mkdir -p /etc/systemd/system/ssh.service.d
sudo tee /etc/systemd/system/ssh.service.d/resilience.conf << 'EOF'
[Unit]
After=network.target
Before=network-online.target
[Service]
Restart=always
RestartSec=3s
EOF
sudo systemctl daemon-reload
echo "Done. Reboot with: bash scripts/safe-reboot.sh" 7. What Each Layer Protects Against
| Scenario | Protection |
|---|---|
| Hard power cut during MariaDB write | innodb_flush_log_at_trx_commit=2: last ~1s rolls back cleanly |
| Reboot with containers running | safe-reboot.sh plus wordpress-stack-shutdown.service |
| Log writes corrupting ext4 journal | log2ram plus volatile journald |
| Filesystem errors detected at runtime | errors=remount-ro: root goes read-only, SSH still works |
| Kernel hang or infinite loop | Hardware watchdog reboots in 15 seconds |
| cloudflared fails on boot due to network race | After=network-online.target plus Restart=on-failure |
| SSH unreachable after boot | Static IP plus Restart=always on SSH service |
| SD card fills up with logs | Docker log caps plus log2ram |
| SD card wears out | NVMe migration as permanent fix |
8. Summary
SD card corruption on a Raspberry Pi is not bad luck. It is a predictable failure mode with well-understood causes and straightforward mitigations, and the combination of all the layers described here takes a Pi from a fragile single-device server into something that can survive a hard power cut, recover automatically, and run reliably for years.
Never reboot without stopping Docker first: use safe-reboot.sh. Move logs to RAM with log2ram and volatile journald. Tune ext4 and MariaDB to reduce flush frequency and add errors=remount-ro. Enable the hardware watchdog for automatic recovery from kernel hangs. Fix service startup order so that anything internet-dependent waits for network-online.target. And if you want the problem to go away permanently, migrate to NVMe using the Pi 5’s PCIe slot. The hardware is capable of being genuinely production-grade; it just needs a little help from the software layer to get there.