👁4views
How to Stop Your Raspberry Pi From Failing to Boot (And Survive a Hard Power Cut)

CloudScale AI SEO - Article Summary
  • 1.
    What it is
    A step-by-step guide to preventing Raspberry Pi boot failures caused by filesystem corruption, covering diagnosis, software hardening, and migration from SD cards to NVMe storage.
  • 2.
    Why it matters
    Anyone running a Pi as a production server risks complete data loss from a single power cut or unclean reboot, and most of the fixes are simple configuration changes that take minutes to apply.
  • 3.
    Key takeaway
    Never reboot a Pi running databases without first stopping Docker containers, and apply filesystem and logging hardening to survive hard power cuts without corruption.

A real-world guide to making your Pi bulletproof, from SD card corruption to NVMe migration

I learned this lesson the hard way. My Raspberry Pi 5 was happily serving a production WordPress site when I rebooted it. Two minutes later there was no SSH, no site, just a solid red light staring back at me. The culprit was MariaDB mid-write when the restart fired, the ext4 journal left half-committed, and the Pi unable to boot as a result. This is one of the most common Raspberry Pi failure modes, and it is almost entirely preventable. This guide covers every layer of the problem: understanding why it happens, hardening your setup to survive a hard power cut at any time, and ultimately migrating to NVMe so the problem goes away permanently.

1. What Actually Happened (And Why)

When you run sudo reboot on a Pi running a database, a race begins. Linux receives the reboot command while Docker containers (MariaDB, nginx, WordPress) are still doing I/O, the kernel starts unmounting filesystems, MariaDB never finishes flushing its write buffers, the ext4 journal is left mid-transaction, and on the next boot the kernel tries to replay the journal and fails. The result is orphaned inodes, corrupt block bitmaps, and a filesystem that refuses to mount cleanly.

What makes this worse on a Pi than on a desktop machine comes down to three compounding factors. A server PSU has roughly 10ms of holdup time after power is cut, while the Pi has zero. SD cards have no power-loss protection, unlike enterprise SSDs whose controllers flush their internal write cache on power loss. And because your OS, logs, and database all share the same card with no background fsck to silently fix filesystem errors, a single interrupted write at the wrong moment can take the entire system down.

2. Diagnosing a Failed Boot

If your Pi shows a solid red light with no green activity, it has not booted. The green LED blink code tells you why: no green activity at all means the SD card is not detected or not seated; three flashes means start.elf was not found due to a corrupted boot partition; four flashes means start.elf cannot launch due to corrupted firmware; seven flashes means the kernel image was not found; and a solid green that then stops means the kernel loaded but the OS crashed.

If the Pi will not boot, pop the SD card into a Mac or Linux machine and run a filesystem check using the e2fsprogs package.

# macOS
brew install e2fsprogs

# Linux
sudo apt-get install e2fsprogs

Run the following block on your Mac or Linux machine with the SD card inserted. It installs e2fsprogs, runs a read-only check to show what is broken, and if errors are found, applies safe automatic repairs in place:

# Install e2fsprogs if not already present
# macOS:
brew install e2fsprogs
# Linux:
# sudo apt-get install e2fsprogs

cat > /tmp/pi-fsck.sh << 'EOF'
#!/usr/bin/env bash
# pi-fsck.sh — diagnose and repair a Pi SD card from another machine
# Usage: bash /tmp/pi-fsck.sh /dev/disk4s2
#   (replace /dev/disk4s2 with the ext4 partition of your SD card)
set -euo pipefail

DEVICE="${1:?Usage: $0 }"

# Locate e2fsck — Homebrew installs it under the Cellar on macOS
if command -v e2fsck &>/dev/null; then
    E2FSCK="e2fsck"
elif E2FSCK=$(find /opt/homebrew/Cellar/e2fsprogs -name e2fsck 2>/dev/null | head -1); then
    : # found via Homebrew Cellar
else
    echo "ERROR: e2fsck not found. Install e2fsprogs first." >&2
    exit 1
fi

echo "=== Read-only check on ${DEVICE} ==="
sudo "${E2FSCK}" -n "${DEVICE}" || true

echo ""
echo "=== Applying safe automatic repairs (-p) ==="
sudo "${E2FSCK}" -p "${DEVICE}" && echo "Repair complete. SD card should boot." || {
    echo ""
    echo "Errors remain after -p pass. Escalating to -y (fix everything)..."
    sudo "${E2FSCK}" -y "${DEVICE}" && echo "Full repair complete."
}
EOF
chmod +x /tmp/pi-fsck.sh

Run it with your SD card’s ext4 partition as the argument (check diskutil list on macOS or lsblk on Linux to confirm the device path):

bash /tmp/pi-fsck.sh /dev/disk4s2

Look for output like Orphan file (inode 12) block 39 is not clean or Block bitmap differences: +(2916492--2916495). This is metadata corruption and it is almost always fixable without data loss. The FSCK0000.REC file on the boot partition is a dead giveaway that fsck ran and found problems. The script escalates automatically from safe repairs to a full -y pass if the first pass is not sufficient, after which you can pop the card back in and the Pi will boot.

3. The Prevention Stack

The following changes eliminate this class of problem entirely. Each layer targets a different failure mode, and together they take a Pi from “occasionally corrupts when I look at it funny” to a system that survives hard power cuts and keeps running reliably for years.

3.1 Always Stop Docker Before Rebooting

The simplest and most important rule is to never run bare sudo reboot on a Pi running containers. Create a scripts/safe-reboot.sh that gives Docker 30 seconds to flush MariaDB’s write buffers before the kernel touches the filesystem:

cat > ~/scripts/safe-reboot.sh << 'EOF'
#!/usr/bin/env bash
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "${REPO_ROOT}"

HALT=false
[[ "${1:-}" == "--halt" ]] && HALT=true

echo "Stopping Docker stack gracefully..."
docker compose down --timeout 30
echo "Syncing filesystem..."
sync

if [[ "${HALT}" == "true" ]]; then
    sudo shutdown -h now
else
    sudo reboot
fi
EOF
chmod +x ~/scripts/safe-reboot.sh

3.2 Install log2ram

log2ram redirects /var/log to a RAM disk (tmpfs) and syncs it to disk only on graceful shutdown, which eliminates the most write-intensive part of normal Pi operation.

curl -fsSL https://azlux.fr/repo.gpg | sudo gpg --dearmor \
    -o /usr/share/keyrings/azlux-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/azlux-archive-keyring.gpg] \
    http://packages.azlux.fr/debian/ bookworm main" \
    | sudo tee /etc/apt/sources.list.d/azlux.list

sudo apt-get update && sudo apt-get install -y log2ram

sudo sed -i 's/^SIZE=.*/SIZE=128M/' /etc/log2ram.conf
sudo sed -i 's/^USE_RSYNC=.*/USE_RSYNC=true/' /etc/log2ram.conf

On a hard power cut you lose recent log entries, but the filesystem survives. That is the right trade-off.

3.3 Volatile systemd Journal

By default, systemd writes a binary journal to /var/log/journal, which produces high-frequency disk I/O that can corrupt on power loss. Moving it entirely to RAM fixes this:

sudo mkdir -p /etc/systemd/journald.conf.d
sudo tee /etc/systemd/journald.conf.d/volatile.conf << 'EOF'
[Journal]
Storage=volatile
RuntimeMaxUse=64M
RuntimeMaxFileSize=16M
Compress=yes
EOF

sudo systemctl restart systemd-journald

The journal now lives in /run/log/journal in RAM and takes effect immediately without a reboot.

3.4 Harden the Root Partition Mount Options

Add two options to your root partition in /etc/fstab:

PARTUUID=xxxxx  /  ext4  defaults,noatime,commit=10,errors=remount-ro  0  1

noatime stops the filesystem updating the last-accessed timestamp on every file read, cutting SD writes by roughly 30%. commit=10 flushes the ext4 journal every 10 seconds instead of every 5, meaning at most 10 seconds of metadata needs replaying on a hard reboot. errors=remount-ro is the critical safety net: if the kernel detects filesystem errors at runtime, it remounts root as read-only instead of panicking or silently corrupting further, and SSH remains functional in read-only mode.

3.5 Reduce MariaDB Write Frequency

By default, InnoDB flushes its transaction log to disk on every single commit, which is extreme for a blog serving occasional traffic. Changing it to flush once per second is the single most impactful database change for SD card longevity:

[mysqld]
innodb_flush_log_at_trx_commit = 2
sync_binlog                    = 0

The trade-off is that up to roughly one second of database writes could be lost on a hard power cut, but they roll back cleanly with no corruption. For a blog, this is a completely acceptable risk.

3.6 Cap Docker Container Logs

Docker’s default logging writes unbounded JSON files to /var/lib/docker/containers/*/. Create /etc/docker/daemon.json to cap this:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Reload Docker with sudo systemctl reload-or-restart docker. Each container’s logs are now capped at 30MB total, which prevents runaway logging from filling the SD card.

3.7 Enable the Hardware Watchdog

The Pi 5 has a hardware watchdog built into the SoC. If the kernel hangs and nothing kicks the watchdog for 15 seconds, it forces a hard reboot. Enable it in /boot/firmware/config.txt:

dtparam=watchdog=on

Then configure systemd to use it:

sudo mkdir -p /etc/systemd/system.conf.d
sudo tee /etc/systemd/system.conf.d/watchdog.conf << 'EOF'
[Manager]
RuntimeWatchdogSec=15s
RebootWatchdogSec=2min
EOF

RuntimeWatchdogSec=15s forces an automatic hard reboot if the system hangs for 15 seconds, while RebootWatchdogSec=2min forces a reboot if a shutdown or reboot takes more than two minutes due to a stuck service. This takes effect after reboot.

3.8 Fix Service Startup Order

A subtle but important failure mode: cloudflared starts before the network is ready, fails silently, and never recovers. The fix is a systemd drop-in that adds the correct dependencies:

sudo mkdir -p /etc/systemd/system/cloudflared.service.d
sudo tee /etc/systemd/system/cloudflared.service.d/wait-for-network.conf << 'EOF'
[Unit]
After=network-online.target docker.service
Wants=network-online.target

[Service]
Restart=on-failure
RestartSec=10s
EOF

sudo systemctl daemon-reload

The distinction between After=network.target and After=network-online.target matters enormously here. The former fires when the network interface has an IP address; DNS may not yet be functional. The latter fires only when the network is actually usable, with DNS resolving and routes working. Use network-online.target for anything that communicates with the internet. Also add a timeout guard so boot does not stall if the network is unavailable:

sudo mkdir -p /etc/systemd/system/systemd-networkd-wait-online.service.d
sudo tee /etc/systemd/system/systemd-networkd-wait-online.service.d/timeout.conf << 'EOF'
[Service]
TimeoutStartSec=15s
EOF

3.9 Graceful Docker Shutdown on Any Reboot

Even if someone runs sudo reboot directly, Docker should stop cleanly. Install a systemd hook to guarantee this:

sudo tee /etc/systemd/system/wordpress-stack-shutdown.service << 'EOF'
[Unit]
Description=Graceful WordPress stack shutdown
DefaultDependencies=no
Before=docker.service shutdown.target reboot.target halt.target
After=docker.service
Requires=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/home/pi/andrewbakerninja-pi/scripts/stack-stop.sh
TimeoutStopSec=60s

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable wordpress-stack-shutdown.service

Where stack-stop.sh is written and enabled as follows. Run this block to create the script, the systemd unit, and enable it in one pass:

cat > /home/pi/andrewbakerninja-pi/scripts/stack-stop.sh <&1 | logger -t wordpress-shutdown
sync
EOF
chmod +x /home/pi/andrewbakerninja-pi/scripts/stack-stop.sh

sudo tee /etc/systemd/system/wordpress-stack-shutdown.service << 'EOF'
[Unit]
Description=Graceful WordPress stack shutdown
DefaultDependencies=no
Before=docker.service shutdown.target reboot.target halt.target
After=docker.service
Requires=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/home/pi/andrewbakerninja-pi/scripts/stack-stop.sh
TimeoutStopSec=60s

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable wordpress-stack-shutdown.service

With this hook in place, even a bare sudo reboot will trigger a clean Docker shutdown before anything else gets killed.

3.10 Set a Static IP

mDNS .local hostnames can be slow to register on boot, leaving SSH unreachable for 30 to 60 seconds after startup. Setting a static IP means the device is always reachable at a known address the moment the network interface comes up:

sudo tee -a /etc/dhcpcd.conf << 'EOF'
interface wlan0
static ip_address=192.168.0.45/24
static routers=192.168.0.1
static domain_name_servers=192.168.0.1 8.8.8.8
EOF

Replace wlan0 with eth0 if you are on a wired connection. Check your current interface with ip route show default.

4. Choosing the Right SD Card

Not all SD cards are equal for Pi server use. The key metric is Total Bytes Written (TBW) endurance, and consumer-grade cards optimise for read speed rather than write durability. For serious use, the Samsung Endurance PRO (approximately 43TB TBW on 64GB) is the best pick, designed for 24/7 write loads in CCTV cameras. The SanDisk Max Endurance (approximately 40TB TBW) is a solid alternative that is often slightly cheaper. Avoid the SanDisk Ultra, Kingston Canvas, and any random eBay brands: the consumer-grade cards deliver only 3 to 5TB TBW and will die within a year or two under server load, while counterfeit cards are rampant on marketplaces and often present an 8GB card as 64GB.

You can test for counterfeit cards with f3probe:

sudo apt-get install f3
sudo f3probe --destructive --time-ops /dev/mmcblk0

5. The Permanent Fix: Migrating to NVMe

SD cards are fundamentally the wrong storage medium for a server. The Pi 5 has a PCIe 2.0 x1 connector specifically for NVMe, and moving to it eliminates the entire class of problems described in this guide.

5.1 Hardware You Need

The official option is the Raspberry Pi M.2 HAT+, which supports M.2 2230, 2242, and 2280 form factors, requires M-key NVMe (B+M key drives may have physical clearance issues), and plugs into the PCIe FPC connector on the bottom of the Pi 5. Solid third-party alternatives include the Geekworm X1008/X1009 and the Pimoroni NVMe Base. For the NVMe drive itself, any M.2 NVMe will work; the WD Black SN770 and Samsung 980 are reliable choices at the 1TB tier. The PCIe Gen 2 x1 connection gives you roughly 500MB/s, which is not the drive’s full rated speed but is about ten times faster than the SD card.

5.2 Enabling PCIe Gen 3 (Optional)

The Pi 5 supports PCIe Gen 3 with newer firmware, pushing theoretical throughput to around 1GB/s. Add this to /boot/firmware/config.txt to enable it:

dtparam=pciex1_gen=3

Results vary by drive, as some drives do not negotiate Gen 3 correctly on Pi hardware. Test with Gen 2 first and only enable this if you want to squeeze out extra performance.

5.3 Updating the EEPROM Firmware

Before migrating, ensure your EEPROM is current. The minimum version for stable NVMe boot is 2024-04-16 or later:

sudo rpi-eeprom-update

# If an update is available:
sudo rpi-eeprom-update -a
sudo reboot

5.4 Migrating the Root Filesystem

The simplest method is rpi-clone, which creates an exact copy of your SD card on the NVMe drive, updates UUIDs and fstab, and configures the bootloader automatically:

sudo apt-get install rpi-clone
lsblk  # Identify your NVMe device, usually /dev/nvme0n1
sudo rpi-clone /dev/nvme0n1

If you prefer more control, you can use rsync directly: partition and format the NVMe, mount it, copy everything across excluding virtual filesystems, update fstab on the new root to reference the NVMe partition’s PARTUUID (obtained via sudo blkid /dev/nvme0n1p1), sync, and unmount.

5.5 Setting the Boot Order

Tell the EEPROM to boot from NVMe first with the SD card as fallback:

sudo rpi-eeprom-config --edit

Find the BOOT_ORDER line and change it to:

BOOT_ORDER=0xf16

Boot order is read right-to-left, so this means: try NVMe (PCIe, code 6) first, fall back to SD card (code 1), and restart from the beginning if all options fail (code f). Save, exit, and reboot. Verify the Pi booted from NVMe with findmnt /, which should show /dev/nvme0n1p2 rather than /dev/mmcblk0p2.

5.6 After Migration: Keep SD as Emergency Boot

Leave the SD card installed. It now functions as a recovery device: if the NVMe fails or develops corruption, removing the HAT causes the Pi to fall back to the SD card automatically. Periodically refresh the SD card with rpi-clone or rsync so it stays reasonably current as a recovery image.

6. Putting It All Together

Here is the complete hardening script that applies everything in this guide in one pass:

#!/usr/bin/env bash
# harden-storage.sh — run once after initial deploy
set -euo pipefail

# log2ram
curl -fsSL https://azlux.fr/repo.gpg | sudo gpg --dearmor \
    -o /usr/share/keyrings/azlux-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/azlux-archive-keyring.gpg] \
    http://packages.azlux.fr/debian/ bookworm main" \
    | sudo tee /etc/apt/sources.list.d/azlux.list
sudo apt-get update -qq && sudo apt-get install -y log2ram
sudo sed -i 's/^SIZE=.*/SIZE=128M/' /etc/log2ram.conf
sudo sed -i 's/^USE_RSYNC=.*/USE_RSYNC=true/' /etc/log2ram.conf

# Volatile journald
sudo mkdir -p /etc/systemd/journald.conf.d
sudo tee /etc/systemd/journald.conf.d/volatile.conf << 'EOF'
[Journal]
Storage=volatile
RuntimeMaxUse=64M
RuntimeMaxFileSize=16M
Compress=yes
EOF
sudo systemctl restart systemd-journald

# fstab hardening
sudo sed -i 's|\(PARTUUID=[^ ]*\s*/\s*ext4\s*\)defaults,noatime|\1defaults,noatime,commit=10,errors=remount-ro|' /etc/fstab

# Docker log limits
sudo tee /etc/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": { "max-size": "10m", "max-file": "3" }
}
EOF
sudo systemctl reload-or-restart docker

# Hardware watchdog
echo 'dtparam=watchdog=on' | sudo tee -a /boot/firmware/config.txt
sudo mkdir -p /etc/systemd/system.conf.d
sudo tee /etc/systemd/system.conf.d/watchdog.conf << 'EOF'
[Manager]
RuntimeWatchdogSec=15s
RebootWatchdogSec=2min
EOF

# Static IP
IFACE=$(ip route show default | awk '/default/ {print $5; exit}')
sudo tee -a /etc/dhcpcd.conf << EOF
interface ${IFACE}
static ip_address=192.168.0.45/24
static routers=192.168.0.1
static domain_name_servers=192.168.0.1 8.8.8.8
EOF

# SSH resilience
sudo mkdir -p /etc/systemd/system/ssh.service.d
sudo tee /etc/systemd/system/ssh.service.d/resilience.conf << 'EOF'
[Unit]
After=network.target
Before=network-online.target

[Service]
Restart=always
RestartSec=3s
EOF

sudo systemctl daemon-reload
echo "Done. Reboot with: bash scripts/safe-reboot.sh"

7. What Each Layer Protects Against

ScenarioProtection
Hard power cut during MariaDB writeinnodb_flush_log_at_trx_commit=2: last ~1s rolls back cleanly
Reboot with containers runningsafe-reboot.sh plus wordpress-stack-shutdown.service
Log writes corrupting ext4 journallog2ram plus volatile journald
Filesystem errors detected at runtimeerrors=remount-ro: root goes read-only, SSH still works
Kernel hang or infinite loopHardware watchdog reboots in 15 seconds
cloudflared fails on boot due to network raceAfter=network-online.target plus Restart=on-failure
SSH unreachable after bootStatic IP plus Restart=always on SSH service
SD card fills up with logsDocker log caps plus log2ram
SD card wears outNVMe migration as permanent fix

8. Summary

SD card corruption on a Raspberry Pi is not bad luck. It is a predictable failure mode with well-understood causes and straightforward mitigations, and the combination of all the layers described here takes a Pi from a fragile single-device server into something that can survive a hard power cut, recover automatically, and run reliably for years.

Never reboot without stopping Docker first: use safe-reboot.sh. Move logs to RAM with log2ram and volatile journald. Tune ext4 and MariaDB to reduce flush frequency and add errors=remount-ro. Enable the hardware watchdog for automatic recovery from kernel hangs. Fix service startup order so that anything internet-dependent waits for network-online.target. And if you want the problem to go away permanently, migrate to NVMe using the Pi 5’s PCIe slot. The hardware is capable of being genuinely production-grade; it just needs a little help from the software layer to get there.