A Simple Script to Check if Your Page is SEO and AEO Friendly

Search engines no longer operate alone. Your content is now consumed by
Google, Bing, Perplexity, ChatGPT, Claude, Gemini, and dozens of other
AI driven systems that crawl the web and extract answers.

Classic SEO focuses on ranking. Modern discovery also requires AEO (Answer Engine Optimization) which focuses on being understood and extracted by AI systems. A marketing page must therefore satisfy four technical conditions:

  1. It must be crawlable
  2. It must be indexable
  3. It must be structured so machines understand it
  4. It must contain content that AI systems can extract and summarize

Many sites fail before content quality even matters. Robots rules block
crawlers, canonical tags are missing, structured data is absent, or the
page simply contains too little readable content.

The easiest way to diagnose this is to run a single script that inspects
the page like a crawler would.

The following Bash script performs a quick diagnostic to check whether
your page is friendly for both search engines and AI answer systems.

The script focuses only on technical discoverability, not marketing copy
quality.

2. What the Script Checks

The script inspects the following signals.

Crawlability

  • robots.txt presence
  • sitemap.xml presence
  • HTTP response status

Indexability

  • canonical tag
  • robots meta directives
  • noindex detection

Search Metadata

  • title tag
  • meta description
  • OpenGraph tags

Structured Data

  • JSON LD schema detection

Content Structure

  • heading structure
  • word count
  • lists and FAQ signals

AI Extraction Signals

  • presence of lists
  • FAQ style content
  • paragraph density

This combination gives a quick technical indication of whether a page is
discoverable and understandable by both crawlers and AI systems.

3. Installation Script

Run the following command once on your Mac.\
It will create the diagnostic script and make it executable.

cat << 'EOF' > ~/seo-aeo-check.sh
#!/usr/bin/env bash

set -euo pipefail

URL="${1:-}"

if [[ -z "$URL" ]]; then
  echo "Usage: seo-aeo-check.sh https://example.com/page"
  exit 1
fi

UA="Mozilla/5.0 (compatible; SEO-AEO-Inspector/1.0)"

TMP=$(mktemp -d)
BODY="$TMP/body.html"
HEAD="$TMP/headers.txt"

cleanup() { rm -rf "$TMP"; }
trap cleanup EXIT

pass=0
warn=0
fail=0

p(){ echo "PASS  $1"; pass=$((pass+1)); }
w(){ echo "WARN  $1"; warn=$((warn+1)); }
f(){ echo "FAIL  $1"; fail=$((fail+1)); }

echo
echo "========================================"
echo "SEO / AEO PAGE ANALYSIS"
echo "========================================"
echo

curl -sSL -A "$UA" -D "$HEAD" "$URL" -o "$BODY"

status=$(grep HTTP "$HEAD" | tail -1 | awk '{print $2}')
ctype=$(grep -i content-type "$HEAD" | awk '{print $2}')

echo "URL: $URL"
echo "Status: $status"
echo "Content type: $ctype"
echo

if [[ "$status" =~ ^2 ]]; then
  p "Page returns successful HTTP status"
else
  f "Page does not return HTTP 200"
fi

title=$(grep -i "<title>" "$BODY" | sed -e 's/<[^>]*>//g' | head -1)

if [[ -n "$title" ]]; then
  p "Title tag present"
  echo "Title: $title"
else
  f "Missing title tag"
fi

desc=$(grep -i 'meta name="description"' "$BODY" || true)

if [[ -n "$desc" ]]; then
  p "Meta description present"
else
  w "Meta description missing"
fi

canon=$(grep -i 'rel="canonical"' "$BODY" || true)

if [[ -n "$canon" ]]; then
  p "Canonical tag found"
else
  f "Canonical tag missing"
fi

robots=$(grep -i 'meta name="robots"' "$BODY" || true)

if [[ "$robots" == *noindex* ]]; then
  f "Page contains noindex directive"
else
  p "No index blocking meta tag"
fi

og=$(grep -i 'property="og:title"' "$BODY" || true)

if [[ -n "$og" ]]; then
  p "OpenGraph tags present"
else
  w "OpenGraph tags missing"
fi

schema=$(grep -i 'application/ld+json' "$BODY" || true)

if [[ -n "$schema" ]]; then
  p "JSON-LD structured data detected"
else
  w "No structured data detected"
fi

h1=$(grep -i "<h1" "$BODY" | wc -l | tr -d ' ')

if [[ "$h1" == "1" ]]; then
  p "Single H1 detected"
elif [[ "$h1" == "0" ]]; then
  f "No H1 found"
else
  w "Multiple H1 tags"
fi

words=$(sed 's/<[^>]*>/ /g' "$BODY" | wc -w | tr -d ' ')

echo "Word count: $words"

if [[ "$words" -gt 300 ]]; then
  p "Page contains enough textual content"
else
  w "Thin content detected"
fi

domain=$(echo "$URL" | awk -F/ '{print $1"//"$3}')
robots_url="$domain/robots.txt"

if curl -s -A "$UA" "$robots_url" | grep -q "User-agent"; then
  p "robots.txt detected"
else
  w "robots.txt missing"
fi

sitemap="$domain/sitemap.xml"

if curl -s -I "$sitemap" | grep -q "200"; then
  p "Sitemap detected"
else
  w "No sitemap.xml found"
fi

faq=$(grep -i "FAQ" "$BODY" || true)

if [[ -n "$faq" ]]; then
  p "FAQ style content detected"
else
  w "No FAQ style content"
fi

lists=$(grep -i "<ul" "$BODY" || true)

if [[ -n "$lists" ]]; then
  p "Lists present which helps answer extraction"
else
  w "No lists found"
fi

echo
echo "========================================"
echo "RESULT"
echo "========================================"
echo "Pass: $pass"
echo "Warn: $warn"
echo "Fail: $fail"

total=$((pass+warn+fail))
score=$((pass*100/total))

echo "SEO/AEO Score: $score/100"
echo
echo "Done."
EOF

chmod +x ~/seo-aeo-check.sh

4. Running the Diagnostic

You can now check any page with a single command.

~/seo-aeo-check.sh https://yourdomain.com/page

Example:

~/seo-aeo-check.sh https://andrewbaker.ninja

The script will print a simple report showing pass signals, warnings,
failures, and an overall score.

5. How to Interpret the Results

Failures normally indicate hard blockers such as:

  • missing canonical tags
  • no H1 heading
  • noindex directives
  • HTTP errors

Warnings normally indicate optimization opportunities such as:

  • missing structured data
  • thin content
  • lack of lists or FAQ style sections
  • missing OpenGraph tags

For AI answer systems, the most important structural signals are:

  • clear headings
  • structured lists
  • question based sections
  • FAQ schema
  • sufficient readable text

Without these signals many AI systems struggle to extract meaningful
answers.

6. Why This Matters More in the AI Era

Search engines index pages. AI systems extract answers.

That difference means structure now matters as much as keywords. Pages that perform well for AI discovery tend to include:

  • clear headings
  • structured content blocks
  • lists and steps
  • explicit questions and answers
  • schema markup

When these signals exist, your content becomes much easier for AI
systems to interpret and reference. In other words, good AEO makes your content easier for machines to read, summarize, and cite. And in an AI driven discovery ecosystem, that visibility increasingly
determines whether your content is seen at all.

The Hitchhikers Guide to Fixing Why a Thumbnail Image Does Not Show for Your Article on WhatsApp, LinkedIn, Twitter or Instagram

When you share a link on WhatsApp, LinkedIn, X, or Instagram and nothing appears except a bare URL, it feels broken in a way that is surprisingly hard to diagnose. The page loads fine in a browser, the image exists, the og:image tag is there, yet the preview is blank. This post gives you a single unified diagnostic script that checks every known failure mode, produces a categorised report, and flags the specific fix for each issue it finds. It then walks through each failure pattern in detail so you understand what the output means and what to do about it.

1. How Link Preview Crawlers Work

When you paste a URL into WhatsApp, LinkedIn, X, or Instagram, the platform does not wait for you to send it. A background process immediately dispatches a headless HTTP request to that URL and this request looks like a bot because it is one. It reads the page’s <head> section, extracts Open Graph meta tags, fetches the og:image URL, and caches the result. The preview you see is assembled entirely from that cached crawl with no browser rendering involved at any point.

Every platform runs its own crawler with its own user agent string, its own image dimension requirements, its own file size tolerance, and its own sensitivity to HTTP response headers. If anything in that chain fails, the preview either shows no image, shows the wrong image, or does not render at all. The key insight is that your website must serve correct, accessible, standards-compliant responses not to humans in browsers but to automated crawlers that look nothing like browsers. Security rules that protect against bots can inadvertently block the very crawlers you need.

2. Platform Requirements at a Glance

PlatformCrawler User AgentRecommended og:image SizeMax File SizeAspect Ratio
WhatsAppWhatsApp/2.x, facebookexternalhit/1.1, Facebot1200 x 630 px~300 KB1.91:1
LinkedInLinkedInBot/1.01200 x 627 px5 MB1.91:1
X (Twitter)Twitterbot/1.01200 x 675 px5 MB1.91:1
Instagramfacebookexternalhit/1.11200 x 630 px8 MB1.91:1
Facebookfacebookexternalhit/1.11200 x 630 px8 MB1.91:1
iMessagefacebookexternalhit/1.1, Facebot, Twitterbot/1.01200 x 630 px5 MB1.91:1

The minimum required OG tags across all platforms are the same five properties and every page you want to share should carry all of them:

<meta property="og:title" content="Your Page Title" />
<meta property="og:description" content="A brief description" />
<meta property="og:image" content="https://example.com/image.jpg" />
<meta property="og:url" content="https://example.com/page" />
<meta property="og:type" content="article" />

X additionally requires two Twitter Card tags to render the large image preview format. Without these, X falls back to a small summary card with no prominent image:

<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://example.com/image.jpg" />

3. Why WhatsApp Is the Most Sensitive Platform

WhatsApp imposes constraints that none of the other major platforms enforce as strictly and most of them are undocumented. The first and most commonly missed is the image file size limit. Facebook supports og:image files up to 8 MB and LinkedIn up to 5 MB, but WhatsApp silently drops the thumbnail if the image exceeds roughly 300 KB. There is no error anywhere in your logs, no HTTP error code, no indication in Cloudflare analytics, and the preview simply renders without an image. WhatsApp also caches that failure, which means users who share the link shortly after you publish will see a bare URL even after you fix the underlying image.

A single WhatsApp link share can trigger requests from three distinct Meta crawler user agents: WhatsApp/2.x, facebookexternalhit/1.1, and Facebot. If your WAF or bot protection blocks any one of them the preview fails. Cloudflare’s Super Bot Fight Mode treats facebookexternalhit as an automated bot by default and will challenge or block it unless you have explicitly created an exception. Unlike LinkedIn’s bot which retries on challenge pages, WhatsApp’s crawler has no retry mechanism and if it gets a 403, a challenge page, or a slow response, it caches the failure immediately.

Response time compounds this further because WhatsApp’s crawler has an aggressive timeout, and if your origin server takes more than a few seconds to respond the crawl times out before it can read any OG tags at all. This matters most on cold start servers or on cache miss paths where your origin has to run full PHP to generate the page. Redirect chains make things worse still because each hop consumes time against WhatsApp’s timeout budget and a chain of three or four redirects on a slow origin can tip a borderline-fast site over the threshold. The diagnostic script follows every hop and prints each one with its timing so you can see exactly where the time is going.

4. The Unified Diagnostic Script

This is the only script you need. Run it against any URL and it produces a full categorised report covering all known failure modes. It tests everything in a single pass: OG tags, image size and dimensions, redirect chains, TTFB, Cloudflare detection, WAF bypass verification, CSP image blocking, meta refresh redirects, robots.txt crawler directives, and all five major crawler user agents.

The install wrapper below writes the script to disk and makes it executable in one paste. Run it as bash install-diagnose-social-preview.sh and it creates diagnose-social-preview.sh ready to use. Then point it at any URL with bash diagnose-social-preview.sh https://yoursite.com/your-post/.

cat > check-social-preview.sh << 'EOF'
#!/usr/bin/env bash
# check-social-preview.sh
# Usage: bash check-social-preview.sh <url>
# Runs a full diagnostic against social preview crawler requirements.
# Tests: OG tags, image size, HTTPS, CSP, X-Content-Type-Options,
#        response time, robots.txt, and crawler accessibility.

set -uo pipefail

TARGET_URL="${1:-}"

if [[ -z "$TARGET_URL" ]]; then
  echo "Usage: bash check-social-preview.sh <url>"
  exit 1
fi

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
BOLD='\033[1m'
RESET='\033[0m'

PASS=0
WARN=0
FAIL=0

pass() { echo -e "  ${GREEN}[PASS]${RESET} $1"; PASS=$(( PASS + 1 )); }
warn() { echo -e "  ${YELLOW}[WARN]${RESET} $1"; WARN=$(( WARN + 1 )); }
fail() { echo -e "  ${RED}[FAIL]${RESET} $1"; FAIL=$(( FAIL + 1 )); }
section() { echo -e "\n${BOLD}${CYAN}--- $1 ---${RESET}"; }

WA_UA="WhatsApp/2.23.24.82 A"
FB_UA="facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
LI_UA="LinkedInBot/1.0 (compatible; Mozilla/5.0; Apache-HttpClient +http://www.linkedin.com)"
TW_UA="Twitterbot/1.0"

TMP=$(mktemp -d)
trap 'rm -rf "$TMP"' EXIT

echo -e "\n${BOLD}=============================================="
echo -e "  Social Preview Health Check"
echo -e "  $TARGET_URL"
echo -e "==============================================${RESET}\n"

# -------------------------------------------------------
section "1. HTTPS"

if [[ "$TARGET_URL" =~ ^https:// ]]; then
  pass "URL uses HTTPS"
else
  fail "URL uses HTTP. WhatsApp requires HTTPS for previews."
fi

# -------------------------------------------------------
section "2. HTTP Response (WhatsApp UA)"

HTTP_STATUS=$(curl -o /dev/null -s -w "%{http_code}" -A "$WA_UA" -L --max-time 10 "$TARGET_URL" || echo "000")
if [[ "$HTTP_STATUS" == "200" ]]; then
  pass "HTTP 200 OK for WhatsApp user agent"
elif [[ "$HTTP_STATUS" == "301" || "$HTTP_STATUS" == "302" ]]; then
  warn "Redirect ($HTTP_STATUS). Crawler follows redirects but this adds latency."
else
  fail "Non-200 response: $HTTP_STATUS. Crawler may not read page content."
fi

# -------------------------------------------------------
section "3. Response Time"

TOTAL_TIME=$(curl -o /dev/null -s -w "%{time_total}" -A "$WA_UA" -L --max-time 15 "$TARGET_URL" || echo "0")
TTFB=$(curl -o /dev/null -s -w "%{time_starttransfer}" -A "$WA_UA" -L --max-time 15 "$TARGET_URL" || echo "0")
echo "  Total: ${TOTAL_TIME}s  TTFB: ${TTFB}s"

# Convert float to integer milliseconds using awk (no bc dependency)
TOTAL_MS=$(echo "$TOTAL_TIME" | awk '{printf "%d", $1 * 1000}')
if [ "$TOTAL_MS" -lt 3000 ] 2>/dev/null; then
  pass "Response time under 3s"
elif [ "$TOTAL_MS" -lt 5000 ] 2>/dev/null; then
  warn "Response time between 3 and 5s. WhatsApp crawler may timeout."
else
  fail "Response time over 5s. Crawler will likely timeout before reading OG tags."
fi

# -------------------------------------------------------
section "4. Open Graph Tags"

HTML=$(curl -s -A "$WA_UA" -L --max-time 10 "$TARGET_URL" 2>/dev/null || echo "")

check_og_tag() {
  local tag="$1"
  local label="$2"
  local val
  val=$(echo "$HTML" | grep -oiE "property=\"${tag}\"[^>]+content=\"[^\"]+\"" \
    | grep -oiE 'content="[^"]+"' | sed 's/content="//;s/"//' | head -1 || echo "")
  if [[ -z "$val" ]]; then
    val=$(echo "$HTML" | grep -oiE "content=\"[^\"]+\"[^>]+property=\"${tag}\"" \
      | grep -oiE 'content="[^"]+"' | sed 's/content="//;s/"//' | head -1 || echo "")
  fi
  if [[ -n "$val" ]]; then
    pass "${label}: $val" >&2
  else
    fail "${label} missing" >&2
  fi
  echo "$val"
}

OG_TITLE=$(check_og_tag "og:title" "og:title")
OG_DESC=$(check_og_tag "og:description" "og:description")
OG_IMAGE=$(check_og_tag "og:image" "og:image")
OG_URL=$(check_og_tag "og:url" "og:url")
OG_TYPE=$(check_og_tag "og:type" "og:type")

TW_CARD=$(echo "$HTML" | grep -oiE 'name="twitter:card"[^>]+content="[^"]+"' \
  | grep -oiE 'content="[^"]+"' | sed 's/content="//;s/"//' | head -1 || echo "")
TW_IMAGE=$(echo "$HTML" | grep -oiE 'name="twitter:image"[^>]+content="[^"]+"' \
  | grep -oiE 'content="[^"]+"' | sed 's/content="//;s/"//' | head -1 || echo "")

if [[ -n "$TW_CARD" ]]; then
  pass "twitter:card: $TW_CARD"
else
  warn "twitter:card missing. X/Twitter will not render large image cards."
fi

if [[ -n "$TW_IMAGE" ]]; then
  pass "twitter:image: $TW_IMAGE"
else
  warn "twitter:image missing. X/Twitter falls back to og:image, which usually works."
fi

# -------------------------------------------------------
section "5. og:image Analysis"

if [[ -z "$OG_IMAGE" ]]; then
  fail "Cannot analyse og:image because the tag is missing."
else
  IMG_STATUS=$(curl -o /dev/null -s -w "%{http_code}" -A "$WA_UA" -L --max-time 10 "$OG_IMAGE" || echo "000")
  if [[ "$IMG_STATUS" == "200" ]]; then
    pass "og:image URL returns HTTP 200"
  else
    fail "og:image URL returns HTTP $IMG_STATUS. Image is inaccessible to the crawler."
  fi

  if [[ "$OG_IMAGE" =~ ^https:// ]]; then
    pass "og:image uses HTTPS"
  else
    fail "og:image uses HTTP. WhatsApp requires HTTPS images."
  fi

  IMG_CT=$(curl -sI -A "$WA_UA" -L --max-time 10 "$OG_IMAGE" \
    | grep -i "^content-type:" | head -1 | cut -d: -f2- | xargs 2>/dev/null || echo "")
  if [[ "$IMG_CT" =~ image/ ]]; then
    pass "og:image Content-Type: $IMG_CT"
  else
    warn "og:image Content-Type unexpected: '$IMG_CT'. WhatsApp may reject non-image MIME types."
  fi

  IMG_FILE="${TMP}/ogimage"
  curl -s -A "$WA_UA" -L --max-time 15 -o "$IMG_FILE" "$OG_IMAGE" 2>/dev/null || true
  if [[ -f "$IMG_FILE" ]]; then
    SIZE_BYTES=$(wc -c < "$IMG_FILE" | tr -d ' ')
    SIZE_KB=$(echo "$SIZE_BYTES" | awk '{printf "%.1f", $1 / 1024}')
    if [ "$SIZE_BYTES" -gt 307200 ] 2>/dev/null; then
      fail "Image is ${SIZE_KB} KB and exceeds WhatsApp's undocumented ~300 KB limit. This is the most common cause of missing WhatsApp thumbnails."
    elif [ "$SIZE_BYTES" -gt 204800 ] 2>/dev/null; then
      warn "Image is ${SIZE_KB} KB and is approaching the WhatsApp 300 KB limit. Consider optimising."
    else
      pass "Image size: ${SIZE_KB} KB (well within the 300 KB WhatsApp limit)"
    fi

    if command -v identify &>/dev/null; then
      DIMS=$(identify -format '%wx%h' "$IMG_FILE" 2>/dev/null || echo "unknown")
      W=$(echo "$DIMS" | cut -dx -f1 || echo "0")
      H=$(echo "$DIMS" | cut -dx -f2 || echo "0")
      echo "  Image dimensions: ${DIMS}px"
      if [ "$W" -ge 1200 ] && [ "$H" -ge 630 ] 2>/dev/null; then
        pass "Image dimensions meet the 1200x630 minimum recommendation"
      elif [ "$W" -ge 600 ] 2>/dev/null; then
        warn "Image is ${DIMS}px. Minimum 1200x630 is recommended for large card previews."
      else
        fail "Image is ${DIMS}px and is too small for reliable previews on most platforms."
      fi
    else
      warn "ImageMagick not installed so image dimensions could not be checked. Install with: brew install imagemagick"
    fi
  else
    warn "Could not download image for size check"
  fi
fi

# -------------------------------------------------------
section "6. HTTP Security Headers (Crawler Compatibility)"

HEADERS=$(curl -sI -A "$WA_UA" -L --max-time 10 "$TARGET_URL" || echo "")

CSP=$(echo "$HEADERS" | grep -i "^content-security-policy:" | head -1 | cut -d: -f2- || echo "")
if [[ -n "$CSP" ]]; then
  echo "  CSP present: $CSP"
  if echo "$CSP" | grep -qi "img-src"; then
    # Pass if img-src contains https: wildcard or * (allows all HTTPS images)
    if echo "$CSP" | grep -qi "img-src[^;]*https:"; then
      pass "CSP img-src allows all HTTPS images. og:image will be accessible to crawlers."
    elif echo "$CSP" | grep -qi "img-src[^;]* \*"; then
      pass "CSP img-src uses wildcard. og:image will be accessible to crawlers."
    else
      warn "CSP has a restrictive img-src directive. Verify it includes your image CDN domain."
    fi
  else
    pass "CSP does not restrict img-src"
  fi
else
  warn "No Content-Security-Policy header present. Recommended for security, though not required for previews."
fi

XCTO=$(echo "$HEADERS" | grep -i "^x-content-type-options:" | head -1 || echo "")
if [[ -n "$XCTO" ]]; then
  pass "X-Content-Type-Options present: $XCTO"
else
  warn "X-Content-Type-Options missing. Add nosniff for security."
fi

# -------------------------------------------------------
section "7. Crawler Accessibility (Multiple User Agents)"

# check_ua fetches the full HTML body for each crawler UA.
# It passes if HTTP 200 is returned AND og:image is present in the response.
# It notes if Cloudflare Bot Fight Mode has injected a challenge-platform script
# alongside readable OG tags — this is benign cached residue, not an active block.
# It fails if og:image is absent regardless of challenge injection, or if HTTP != 200.
check_ua() {
  local label="$1"
  local ua="$2"
  local tmpfile="${TMP}/ua_$(echo "$label" | tr ' ' '_')"

  local code
  code=$(curl -s -o "$tmpfile" -w "%{http_code}" -A "$ua" -L --max-time 10 "$TARGET_URL" 2>/dev/null || echo "000")

  if [[ "$code" != "200" ]]; then
    fail "$label: HTTP $code. This crawler is being blocked or challenged."
    return
  fi

  local has_og
  has_og=$(grep -oiE 'property="og:image"' "$tmpfile" 2>/dev/null | head -1 || echo "")

  local has_challenge
  has_challenge=$(grep -c "challenge-platform" "$tmpfile" 2>/dev/null || echo "0")

  if [[ -n "$has_og" ]]; then
    if [ "$has_challenge" -gt 0 ] 2>/dev/null; then
      pass "$label: HTTP 200, og:image present. Bot Fight Mode script in HTML but OG tags are readable. WAF Skip rule is working correctly."
    else
      pass "$label: HTTP 200, og:image present. No challenge injection detected."
    fi
  else
    if [ "$has_challenge" -gt 0 ] 2>/dev/null; then
      fail "$label: HTTP 200 but Bot Fight Mode is injecting challenge scripts and og:image is absent. WAF Skip rule is not working for this user agent."
    else
      fail "$label: HTTP 200 but og:image is absent. The page may not be rendering OG tags for this crawler."
    fi
  fi
}

check_ua "WhatsApp UA"          "$WA_UA"
check_ua "facebookexternalhit"  "$FB_UA"
check_ua "LinkedInBot"          "$LI_UA"
check_ua "Twitterbot"           "$TW_UA"

# -------------------------------------------------------
section "8. robots.txt"

BASE_URL=$(echo "$TARGET_URL" | grep -oE '^https?://[^/]+' || echo "")
ROBOTS=$(curl -s --max-time 5 "${BASE_URL}/robots.txt" 2>/dev/null || echo "")

if [[ -z "$ROBOTS" ]]; then
  warn "robots.txt not found or empty"
else
  for bot in facebookexternalhit WhatsApp Facebot LinkedInBot Twitterbot; do
    if echo "$ROBOTS" | grep -qi "User-agent: ${bot}"; then
      if echo "$ROBOTS" | grep -A2 -i "User-agent: ${bot}" | grep -qi "Disallow: /$"; then
        fail "robots.txt blocks $bot. This will prevent all previews from that platform."
      else
        pass "robots.txt references $bot but does not block it."
      fi
    else
      pass "robots.txt has no restrictions for $bot"
    fi
  done
fi

# -------------------------------------------------------
section "9. Cloudflare Detection"

CF_RAY=$(echo "$HEADERS" | grep -i "^cf-ray:" | head -1 || echo "")
CF_CACHE=$(echo "$HEADERS" | grep -i "^cf-cache-status:" | head -1 || echo "")

if [[ -n "$CF_RAY" ]]; then
  echo "  Cloudflare detected: $CF_RAY"
  [[ -n "$CF_CACHE" ]] && echo "  $CF_CACHE"
  pass "Cloudflare is present. If all crawler UA checks above passed, your WAF Skip rule is configured correctly."
else
  pass "No Cloudflare detected. WAF skip rule not required."
fi

# -------------------------------------------------------
echo -e "\n${BOLD}=============================================="
echo -e "  Results Summary"
echo -e "  ${GREEN}PASS: $PASS${RESET}  ${YELLOW}WARN: $WARN${RESET}  ${RED}FAIL: $FAIL${RESET}"
echo -e "==============================================${RESET}\n"

if [ "$FAIL" -gt 0 ]; then
  echo -e "${RED}${BOLD}Action required: $FAIL critical issue(s) found.${RESET}"
elif [ "$WARN" -gt 0 ]; then
  echo -e "${YELLOW}${BOLD}Review warnings: $WARN potential issue(s) found.${RESET}"
else
  echo -e "${GREEN}${BOLD}All checks passed. Social previews should work correctly.${RESET}"
fi
echo 
EOF
chmod +x check-social-preview.sh

Example results below:

==============================================
  Social Preview Health Check
  https://andrewbaker.ninja/2026/03/01/the-silent-killer-in-your-aws-architecture-iops-mismatches/
==============================================

--- 1. HTTPS ---
  [PASS] URL uses HTTPS

--- 2. HTTP Response (WhatsApp UA) ---
  [PASS] HTTP 200 OK for WhatsApp user agent

--- 3. Response Time ---
  Total: 0.247662s  TTFB: 0.155405s
  [PASS] Response time under 3s

--- 4. Open Graph Tags ---
  [PASS] twitter:card: summary_large_image
  [PASS] twitter:image: https://andrewbaker.ninja/wp-content/uploads/2026/03/StorageVsInstanceSize-1200x630.jpg

--- 5. og:image Analysis ---
  [PASS] og:image URL returns HTTP 200
  [PASS] og:image uses HTTPS
  [PASS] og:image Content-Type: image/jpeg
  [PASS] Image size: 127,5 KB (well within the 300 KB WhatsApp limit)
  [WARN] ImageMagick not installed so image dimensions could not be checked. Install with: brew install imagemagick

--- 6. HTTP Security Headers (Crawler Compatibility) ---
  CSP present:  default-src 'self' https: data: 'unsafe-inline' 'unsafe-eval'; img-src 'self' data: https: blob:; font-src 'self' https: data:; script-src 'self' 'unsafe-inline' 'unsafe-eval' https:; style-src 'self' 'unsafe-inline' https:; frame-src 'self' https:; connect-src 'self' https:;
  [PASS] CSP img-src allows all HTTPS images. og:image will be accessible to crawlers.
  [PASS] X-Content-Type-Options present: X-Content-Type-Options: nosniff

--- 7. Crawler Accessibility (Multiple User Agents) ---
  [PASS] WhatsApp UA: HTTP 200, og:image present. Bot Fight Mode script in HTML but OG tags are readable. WAF Skip rule is working correctly.
  [PASS] facebookexternalhit: HTTP 200, og:image present. Bot Fight Mode script in HTML but OG tags are readable. WAF Skip rule is working correctly.
  [PASS] LinkedInBot: HTTP 200, og:image present. Bot Fight Mode script in HTML but OG tags are readable. WAF Skip rule is working correctly.
  [PASS] Twitterbot: HTTP 200, og:image present. Bot Fight Mode script in HTML but OG tags are readable. WAF Skip rule is working correctly.

--- 8. robots.txt ---
  [PASS] robots.txt has no restrictions for facebookexternalhit
  [PASS] robots.txt has no restrictions for WhatsApp
  [PASS] robots.txt has no restrictions for Facebot
  [PASS] robots.txt has no restrictions for LinkedInBot
  [PASS] robots.txt has no restrictions for Twitterbot

--- 9. Cloudflare Detection ---
  Cloudflare detected: CF-RAY: 9d80a8bc4ee8e6de-LIS
  cf-cache-status: HIT
  [PASS] Cloudflare is present. If all crawler UA checks above passed, your WAF Skip rule is configured correctly.

==============================================
  Results Summary
  PASS: 21  WARN: 1  FAIL: 0
==============================================

Review warnings: 1 potential issue(s) found.

5. Understanding the Report: Known Issue Patterns

Each numbered section below corresponds to a failure pattern the script detects. When you see a FAIL or WARN, this is what it means and exactly what to do.

5.1 Image Over 300 KB (WhatsApp Silent Failure)

The script reports [FAIL] og:image size 412KB exceeds WhatsApp 300KB hard limit. WhatsApp silently drops the thumbnail if the og:image file exceeds roughly 300 KB and there is no error in your logs, no HTTP error code, and no indication in Cloudflare analytics. The preview simply renders without an image and WhatsApp also caches that failure, so users who share the link before you fix the image will continue to see a bare URL until WhatsApp’s cache expires, typically around 7 days and not under your control. This is the single most common cause of missing WhatsApp thumbnails. Facebook supports images up to 8 MB and LinkedIn up to 5 MB, so developers publishing a large hero image have no idea anything is wrong until they test specifically on WhatsApp.

The fix is to compress the image to under 250 KB to leave a safe margin. At 1200×630 pixels, JPEG quality 80 will almost always achieve this. After recompressing, force a cache refresh using the Facebook Sharing Debugger and then retest with diagnose-social-preview.sh.

convert input.jpg -resize 1200x630 -quality 80 -strip output.jpg
jpegoptim --size=250k --strip-all image.jpg
cwebp -q 80 input.png -o output.webp

5.2 Cloudflare Blocking Meta Crawlers (Super Bot Fight Mode)

The script reports [FAIL] facebookexternalhit: HTTP 200 but Cloudflare challenge page detected. This is the second most common failure on WordPress sites behind Cloudflare. Cloudflare’s Super Bot Fight Mode classifies facebookexternalhit as an automated bot and serves it a JavaScript challenge page. The challenge returns HTTP 200 with an HTML body that looks like a normal page, the crawler reads it, finds no OG tags, and caches a blank preview. This is particularly insidious because your monitoring will show HTTP 200 and you will have no idea why previews are broken. A single WhatsApp link preview can trigger requests from three distinct Meta crawler user agents, specifically WhatsApp/2.x, facebookexternalhit/1.1, and Facebot, and all three must be allowed. If any one is challenged, previews fail intermittently depending on which crawler fires first. The fix is to create a Cloudflare WAF Custom Rule as described in Section 6.

5.3 Slow TTFB Causing Crawler Timeout

The script reports [FAIL] TTFB 4.2s. This will cause WhatsApp crawler timeouts on cache miss. WhatsApp’s crawler has an aggressive HTTP timeout and if your origin takes more than a few seconds to deliver the first bytes of HTML, the crawl times out before any OG tags are read. This is most common on cold start servers, WordPress sites with no page cache where every crawler request hits the database, and servers under load where the crawler request queues behind real user traffic. Your CDN cache may be serving humans fine while every crawler request is a cache miss, because crawlers send unique user agent strings that your cache rules do not recognise. Ensure your page cache serves all user agent strings and not just browser user agents. In Cloudflare, verify that your cache rules are not excluding non-browser UAs. The target is a TTFB under 800ms.

5.4 Redirect Chain

The script reports [FAIL] 4 redirect(s) in chain. Very likely causing WhatsApp crawler timeouts. Each redirect hop consumes time against WhatsApp’s timeout budget and a chain of four hops at 200ms each costs 800ms before the origin even begins delivering HTML. Common causes include an HTTP to HTTPS redirect followed by a www to non-www redirect followed by a trailing slash normalisation redirect, old permalink structures redirecting to new ones, and canonical URL enforcement with multiple intermediate redirects. The goal is zero redirects for the canonical URL and your og:url tag should match the exact final URL with no redirects between them.

5.5 CSP Blocking the Image URL

The script reports [FAIL] CSP img-src may block your og:image domain (cdn.example.com). A Content-Security-Policy header with a restrictive img-src directive can interfere with WhatsApp’s internal image rendering pipeline in certain client versions and if your CSP blocks the image URL in the browser context used for preview rendering, the preview will show the title and description but not the image. Add your image CDN domain to the img-src directive:

Content-Security-Policy: img-src 'self' https://your-cdn-domain.com https://s3.af-south-1.amazonaws.com;

5.6 Meta Refresh Redirect

The script reports [FAIL] Meta refresh redirect found in HTML. Meta refresh tags are HTML-level redirects that social crawlers do not execute. The crawler reads the page at the original URL, finds the meta refresh, ignores it, and attempts to extract OG tags from the pre-redirect page. If the pre-redirect page has no OG tags the preview is blank. This appears in some WordPress themes, landing page plugins, and maintenance mode plugins. Replace meta refresh redirects with proper HTTP 301 or 302 redirects at the server or Cloudflare redirect rule level.

6. The Cloudflare WAF Skip Rule

If the diagnostic script detects a Cloudflare challenge page for any Meta crawler user agent, this is exactly how to fix it. Navigate to your Cloudflare dashboard, select your domain, and go to Security then WAF then Custom Rules and click Create rule. Set the rule name to WhatsappThumbnail, switch to Edit expression mode, and paste the following expression:

(http.user_agent contains "WhatsApp") or
(http.user_agent contains "facebookexternalhit") or
(http.user_agent contains "Facebot")

Set the action to Skip. Under WAF components to skip, enable all rate limiting rules, all managed rules, and all Super Bot Fight Mode Rules, but leave all remaining custom rules unchecked. This ensures your Fail2ban IP block list still applies even to these user agents because real attackers spoofing a Meta user agent cannot bypass your IP blocklist while legitimate Meta crawlers get through. Turn Log matching requests off because these are high-frequency crawls and logging every one will consume your event quota quickly.

Cloudflare firewall rule allowing Meta bot crawlers for WhatsApp thumbnails
Screenshot

On rule priority, ensure this rule sits below your primary edge rule (eg a Fail2ban Block List rule) because Cloudflare evaluates WAF rules top to bottom and the IP blocklist must fire first. The reason all three user agents are required is that a single WhatsApp link preview can trigger requests from each of them independently and if any one is missing from the skip rule, previews will fail intermittently.

7. WordPress Specific: Posts with Missing Featured Images

If you are running WordPress and the diagnostic script is passing all checks but some posts still have no og:image, the likely cause is that those posts have no featured image set. Most WordPress SEO plugins generate the og:image tag from the featured image and if it is not set, there is no tag. This script SSHs into your WordPress server and audits which published posts are missing a featured image. Update the four variables at the top before running, then run it as bash audit-wp-og.sh audit or bash audit-wp-og.sh fix <post-id>.

cat > audit-wp-og.sh << 'EOF'
#!/usr/bin/env bash
# audit-wp-og.sh
# Usage: bash audit-wp-og.sh audit|fix [post-id]
# Audits WordPress posts for missing og:image via WP-CLI on remote EC2.
#
# Update the four variables below before running.

set -euo pipefail

MODE="${1:-audit}"
SPECIFIC_POST="${2:-}"

EC2_HOST="[email protected]"
SSH_KEY="$HOME/.ssh/your-key.pem"
WP_PATH="/var/www/html"
SITE_URL="https://yoursite.com"

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
BOLD='\033[1m'
RESET='\033[0m'

echo -e "\n${BOLD}${CYAN}WordPress OG Tag Auditor${RESET}"
echo -e "Mode: ${BOLD}$MODE${RESET}\n"

if [[ "$MODE" == "audit" ]]; then
  echo -e "${YELLOW}Fetching published posts with no featured image...${RESET}\n"

  ssh -i "$SSH_KEY" -o StrictHostKeyChecking=no "$EC2_HOST" bash <<'REMOTE'
echo "Posts with no featured image (og:image will be missing for these):"
wp post list \
  --post_type=post \
  --post_status=publish \
  --fields=ID,post_title,post_date \
  --format=table \
  --meta_query='[{"key":"_thumbnail_id","compare":"NOT EXISTS"}]' \
  --path=/var/www/html \
  --allow-root \
  2>/dev/null || echo "(WP-CLI not available or no posts found)"

echo ""
echo "Total published posts:"
wp post list \
  --post_type=post \
  --post_status=publish \
  --format=count \
  --path=/var/www/html \
  --allow-root \
  2>/dev/null

echo ""
echo "Posts with featured image set:"
wp post list \
  --post_type=post \
  --post_status=publish \
  --format=count \
  --meta_key=_thumbnail_id \
  --path=/var/www/html \
  --allow-root \
  2>/dev/null
REMOTE

  echo -e "\n${YELLOW}Spot-checking live og:image tags on recent posts...${RESET}\n"
  WA_UA="WhatsApp/2.23.24.82 A"
  URLS=$(curl -s "${SITE_URL}/post-sitemap.xml" \
    | grep -oE '<loc>[^<]+</loc>' \
    | sed 's|<loc>||;s|</loc>||' \
    | head -10)

  if [[ -z "$URLS" ]]; then
    echo -e "${YELLOW}Could not fetch sitemap at ${SITE_URL}/post-sitemap.xml${RESET}"
  else
    printf "%-70s %s\n" "URL" "og:image"
    printf "%-70s %s\n" "---" "--------"
    while IFS= read -r url; do
      html=$(curl -s -A "$WA_UA" -L --max-time 8 "$url" 2>/dev/null)
      og_img=$(echo "$html" \
        | grep -oiE 'property="og:image"[^>]+content="[^"]+"' \
        | grep -oiE 'content="[^"]+"' \
        | sed 's/content="//;s/"//' \
        | head -1)
      if [[ -n "$og_img" ]]; then
        printf "%-70s ${GREEN}%s${RESET}\n" "$(echo "$url" | sed "s|${SITE_URL}||")" "PRESENT"
      else
        printf "%-70s ${RED}%s${RESET}\n" "$(echo "$url" | sed "s|${SITE_URL}||")" "MISSING"
      fi
    done <<< "$URLS"
  fi

elif [[ "$MODE" == "fix" ]]; then
  if [[ -z "$SPECIFIC_POST" ]]; then
    echo -e "${RED}Provide a post ID: bash audit-wp-og.sh fix <post-id>${RESET}"
    exit 1
  fi

  ssh -i "$SSH_KEY" -o StrictHostKeyChecking=no "$EC2_HOST" bash <<REMOTE
echo "Available media attachments (most recent 10):"
wp post list \
  --post_type=attachment \
  --posts_per_page=10 \
  --fields=ID,post_title,guid \
  --format=table \
  --path=$WP_PATH \
  --allow-root \
  2>/dev/null
REMOTE

  echo -e "\n${YELLOW}To assign a featured image to post $SPECIFIC_POST:${RESET}"
  echo "  ssh -i $SSH_KEY $EC2_HOST \\"
  echo "    wp post meta set $SPECIFIC_POST _thumbnail_id <ATTACHMENT_ID> --path=$WP_PATH --allow-root"
  echo ""
  echo "Then retest: bash diagnose-social-preview.sh ${SITE_URL}/?p=${SPECIFIC_POST}"

else
  echo -e "${RED}Unknown mode: $MODE. Use 'audit' or 'fix'.${RESET}"
  exit 1
fi
EOF

chmod +x audit-wp-og.sh
echo "Written and made executable: audit-wp-og.sh"

8. The Diagnostic Checklist

Before you create a Cloudflare rule or start modifying OG tags, run diagnose-social-preview.sh against your URL. It will work through every item below in under 30 seconds and flag exactly which one is failing. The script verifies that the URL uses HTTPS, that there is no redirect chain or the chain is two hops or fewer, that there are no meta refresh redirects in the HTML, that TTFB is under 800ms and total response time is under 3s, that og:title, og:description, og:image, og:url, and og:type are all present and non-empty, that twitter:card is present for the X/Twitter large image format, that the og:image URL returns HTTP 200 with the correct MIME type and uses HTTPS, that the og:image file size is under 300 KB, that og:image dimensions are at least 1200×630 px, that CSP img-src does not block the og:image domain, that robots.txt does not disallow facebookexternalhit, WhatsApp, or Facebot, and that all five crawler user agents return HTTP 200 with no challenge page detected.

The two most common failures on WordPress sites behind Cloudflare are Super Bot Fight Mode blocking facebookexternalhit and an og:image file exceeding 300 KB. Both are invisible in your logs and immediately visible when you run the script.

Knowing Your IOPS Are Broken Is Not As Valuable As Knowing They Are About To Break

Andrew Baker | March 2026

Companion article to: https://andrewbaker.ninja/2026/03/01/the-silent-killer-in-your-aws-architecture-iops-mismatches/

Last week I published a script that scans your AWS estate and finds every EBS volume and RDS instance where your provisioned storage IOPS exceed what the compute instance can actually consume. That problem, the structural mismatch between storage ceiling and instance ceiling, is important and expensive and almost completely invisible to your existing tooling. You should run that script.

But there is a second problem that the mismatch auditor does not solve, and in some ways it is the more dangerous one. The mismatch auditor tells you where the gap exists but it does not tell you whether you are actually falling into it.

Consider the difference. A provisioned storage IOPS ceiling of 10,000 on an instance that can only push 3,500 is a configuration problem, meaning you are paying for performance you cannot consume and your headroom assumptions are wrong. But if your actual workload is only ever generating 1,200 IOPS, the mismatch is a cost and an architecture risk rather than an active emergency. The mismatch auditor will find it and you should fix it, but the building is not on fire yet.

Now consider the other case. Your provisioned storage ceiling is correct, your instance class ceiling matches what you need, and your architecture review would pass. But your workload is generating 3,400 IOPS against a 3,500 ceiling for minutes at a time, every day, during the lunchtime transaction peak. CloudWatch shows nothing alarming because the volume is not saturated and the instance is not at CPU capacity. Performance Insights shows no problematic wait events and your APM shows acceptable latency. You are running at 97 percent of your I/O capacity for sustained periods without knowing it.

That is the building that is about to catch fire.

A 3 percent buffer against a hard ceiling is not a buffer, it is a queue waiting to form. When load spikes because a batch job overlaps with transaction traffic, or a partner integration runs slightly earlier than usual, or a retry storm arrives from an upstream timeout, you cross that ceiling and requests start stacking in the virtualised I/O path. What was a 2ms storage read becomes 40ms as the queue grows, connection pools back up, upstream services time out and retry, and those retries compound the I/O load further. You are now in a feedback loop where your database looks healthy by every metric your team is watching and you have no obvious cause to debug because the bottleneck lives in the gap between what your instance can consume and what your workload is demanding, a gap that none of your standard monitoring instruments will name for you.

The script in this post closes that gap.

1. What This Script Actually Does

The script scans your AWS estate across multiple accounts and regions and queries CloudWatch for every EBS volume and RDS instance. For each resource it asks whether actual observed IOPS reached or exceeded a percentage threshold of the resource’s effective ceiling, and if so, whether that condition persisted continuously for longer than a duration threshold you specify.

You provide both numbers at runtime. Running it with 90 percent and 120 seconds means any resource that sustained IOPS at or above 90 percent of its ceiling for more than two consecutive minutes in the lookback window gets reported, along with which resource breached, by how much, when it started and ended, and what the peak utilisation was.

Both parameters matter because a brief spike to 92 percent is not the same problem as 92 percent sustained for eight minutes. A spike is a normal part of operating any database under variable load, but a sustained breach is a sign that your headroom is structurally insufficient and the next slightly larger spike will tip you into saturation and queuing. The duration threshold is what separates the two.

2. Why the Metrics Differ By Service

This is the part that is easy to get wrong, and getting it wrong means your script either misses breaches entirely or fires false positives that erode trust in the output. The correct metric and the correct ceiling are different for EBS and standard RDS, and understanding why Aurora is excluded entirely is just as important as understanding why EBS and RDS are included.

2.1 The Dual Ceiling Problem for EBS and RDS

Before covering each service in detail, there is a principle that applies to EBS volumes and standard RDS instances that does not apply to Aurora, and it is the most common source of incorrect saturation calculations.

Every EBS volume and every RDS instance has two independent IOPS ceilings operating simultaneously. The first is the storage ceiling, which is the provisioned IOPS on the volume or instance. The second is the instance throughput ceiling, which is the maximum IOPS the underlying compute can push to attached storage. Your workload saturates whichever of these two ceilings it hits first, and that effective ceiling is always the lower of the two.

This is exactly the mismatch problem the companion script identifies: when your storage ceiling is higher than your instance ceiling, the instance ceiling becomes the binding constraint and the storage headroom above it is inaccessible. But even when there is no mismatch and both ceilings are sensibly set, you still need to compare observed IOPS against the lower of the two, because that is the number that actually governs when your workload runs out of room.

If you only compare against the storage ceiling you can build a false picture. A db.m6g.large with 4,000 provisioned storage IOPS has an instance class ceiling of 3,500 IOPS. If your workload hits 3,480 IOPS you are at 99.4 percent of your effective capacity, but a naive comparison against the storage ceiling gives you 87 percent and nothing fires. You are six minutes from saturation and your monitoring tells you everything is comfortable.

The script handles this by computing the effective ceiling as the minimum of the storage IOPS and the instance IOPS ceiling at runtime, using the instance type ceiling tables that also power the mismatch auditor. The note field in the output records both values so you can see which ceiling is binding.

2.2 EBS Volumes

EBS publishes VolumeReadOps and VolumeWriteOps as operation counts per CloudWatch collection period rather than as a rate. A 60-second period that returns a value of 180,000 for VolumeReadOps means 180,000 read operations happened in that minute, so to convert that to IOPS, which is the unit your provisioned ceiling is expressed in, you divide by 60. The script does this automatically.

For in-use volumes, the script looks up the EC2 instance the volume is attached to and retrieves the instance type’s IOPS ceiling from the lookup table. The effective ceiling used for breach detection is min(provisioned_storage_iops, instance_type_iops_ceiling). Only io1, io2, and gp3 volumes are scanned because gp2 volumes use a burst credit model where the effective ceiling is elastic and not meaningfully comparable to a fixed provisioned number. If a volume is not attached to a known instance type, the script falls back to the provisioned storage IOPS and records a note accordingly.

2.3 Standard RDS

RDS publishes ReadIOPS and WriteIOPS in the AWS/RDS namespace as rate metrics, meaning they are already expressed in IOPS rather than as counts per period. You add them together. The ceiling requires the same dual-minimum treatment as EBS: the script takes min(provisioned_storage_iops, instance_class_iops_ceiling) as the effective ceiling, using the RDS_IOPS_CEILING table keyed on instance class. This covers PostgreSQL, MySQL, Oracle, SQL Server, and MariaDB. Only instances with io1 or io2 storage are examined since those are the storage types where you have a defined and fixed IOPS ceiling on both sides of the comparison.

The ceiling used for comparison is printed in the note field of each finding, along with both the storage and instance values so you can see immediately which constraint is binding and what the other ceiling is. In the common case where a mismatch exists and the instance ceiling is lower, the percentage reported reflects the instance ceiling, which is the number that actually determines when the workload saturates.

2.4 Aurora — Why It Is Excluded

Aurora is intentionally not monitored by this script, and understanding why is more useful than just knowing that it is not.

Aurora has two completely separate storage systems operating in parallel, and they are governed by completely different constraints.

The first is Aurora cluster storage, which is what the ReadIOPS and WriteIOPS CloudWatch metrics measure. This is a distributed, shared storage subsystem managed entirely outside the instance. It auto-scales from 10 GB to 128 TiB and can sustain up to 256,000 IOPS at the storage layer. It is not backed by EBS. It is not constrained by instance class EBS bandwidth limits. There is no provisioned IOPS value to configure on Aurora because there is nothing to provision — the storage layer manages its own capacity. If you call DescribeDBInstances on an Aurora writer the Iops field returns zero and SupportsIops is false.

The second is Aurora local storage, which is instance-local EBS used for temporary files, sort operations, index builds, and query scratch space. This is constrained by instance class EBS bandwidth limits. But it is not what ReadIOPS and WriteIOPS measure.

This is the fundamental problem with applying an instance class IOPS ceiling to Aurora’s CloudWatch metrics. The metrics measure the cluster storage layer, which has no instance-level ceiling. The instance class ceiling applies to local storage, which is not what the metrics measure. Comparing ReadIOPS + WriteIOPS against any figure from the RDS_IOPS_CEILING table for an Aurora instance is comparing the wrong metric against the wrong ceiling. It will generate false positives on any adequately busy Aurora cluster because the cluster storage layer is designed to sustain IOPS that would saturate any instance class EBS ceiling many times over.

AWS documents this distinction explicitly on the Aurora instance types page: the maximum EBS bandwidth listed for each instance type applies to local storage only and does not apply to communication with the Aurora storage volume.

Aurora Serverless v2 is excluded for the same reason. It uses the same distributed Aurora cluster storage subsystem as provisioned Aurora. There is no instance-level IOPS ceiling to compare against.

If you want to detect pressure on Aurora local storage, that requires the Performance Insights API rather than CloudWatch, specifically the IO:BufFileRead and IO:BufFileWrite wait events. That is a different monitoring approach, the remediation is typically query tuning rather than instance scaling, and it is not covered by this script.

3. The Script

Paste this into your terminal. It writes iops_saturation.py, marks it executable, and the script will install its own dependencies on first run.

cat > iops_saturation.py << 'PYEOF'
#!/usr/bin/env python3
"""
IOPS Saturation Monitor

Scans EBS volumes and standard RDS instances to identify resources
that have sustained IOPS utilisation at or above a threshold percentage
of their capacity for longer than a specified duration.

IMPORTANT: Aurora is NOT monitored by this script because:

  Aurora Architecture (Two Separate Storage Layers):

  1. Local EBS Storage (temp files, NOT monitored by ReadIOPS/WriteIOPS):
     - Used for: sorting, temp tables, index builds, query scratch space
     - Monitored by: FreeLocalStorage CloudWatch metric
     - Constrained by: Instance class EBS bandwidth/IOPS limits ✓

  2. Aurora Cluster Storage (tables/indexes, monitored by ReadIOPS/WriteIOPS):
     - Used for: All persistent database data (tables, indexes)
     - Monitored by: ReadIOPS and WriteIOPS CloudWatch metrics
     - Auto-scales: 10 GB to 128 TB, handles up to 256k IOPS
     - NOT constrained by instance class EBS bandwidth/IOPS limits ✗
     - SupportsIops=false, SupportsStorageThroughput=false, StorageType="aurora"

  Since ReadIOPS/WriteIOPS (what users care about) measure Aurora CLUSTER storage,
  and cluster storage has NO instance-level IOPS ceiling, there is nothing to monitor
  against instance class IOPS limits.

  NOTE: Aurora LOCAL storage (temp files) CAN be constrained by instance class EBS limits.
  This could be monitored via Performance Insights wait events (IO:BufFileRead, IO:BufFileWrite),
  but would require a different script using the Performance Insights API, not CloudWatch metrics.
  Temp file I/O issues are typically solved by query tuning (work_mem, sort optimization),
  not instance scaling, so this is not included in this script.

  Aurora Serverless V2 Exclusion:
  - Aurora Serverless V2 is excluded (no instance-level IOPS ceiling)
  - Uses Aurora Capacity Units (ACUs): 0-256 ACUs (increments of 0.5)
  - Aurora storage auto-scales (10 GB to 256 TiB) and I/O scales with workload demand
  - Same distributed storage subsystem as Aurora provisioned clusters
  - No fixed instance-level IOPS bottleneck exists to monitor

  References:
  - Aurora Serverless V2: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html
  - Aurora instance types: https://aws.amazon.com/rds/aurora/instance-types/
    "The maximum EBS bandwidth refers to I/O bandwidth for local storage within
    the DB instance. It doesn't apply to communication with the Aurora storage volume."

Metric selection and ceiling calculation are automatic per service type:

  EBS                  VolumeReadOps + VolumeWriteOps (Count / 60s = IOPS)
                       Ceiling = min(provisioned_storage_iops, instance_type_iops_ceiling)
                       ✓ Dual-ceiling problem: storage + instance

  RDS standard         ReadIOPS + WriteIOPS (rate metric, IOPS directly)
                       Ceiling = min(provisioned_storage_iops, instance_class_iops_ceiling)
                       ✓ Dual-ceiling problem: storage + instance
                       ✓ Engines: postgres, mysql, mariadb, oracle, sqlserver (NOT aurora)

Usage:
  python iops_saturation.py --max-ops-pct 90 --max-ops-duration-secs 120 \
      --ou-id ou-xxxx-xxxxxxxx --regions eu-west-1 us-east-1

  python iops_saturation.py --max-ops-pct 95 --max-ops-duration-secs 300 \
      --lookback-hours 48 --accounts 123456789012 --regions af-south-1

Required permissions on the assumed role:
  cloudwatch:GetMetricStatistics
  rds:DescribeDBInstances
  ec2:DescribeVolumes
  ec2:DescribeInstances

References:
  - Blog: https://andrewbaker.ninja/2026/03/03/knowing-your-iops-are-broken-is-not-the-same-as-knowing-they-are-about-to-break/
  - Aurora instance types: https://aws.amazon.com/rds/aurora/instance-types/
  - RDS instance classes: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html
"""

# Bootstrap: install missing dependencies before any other imports
import subprocess
import sys

def _bootstrap():
    required = {"boto3": "boto3", "pandas": "pandas", "openpyxl": "openpyxl"}
    missing = []
    for import_name, pkg_name in required.items():
        try:
            __import__(import_name)
        except ImportError:
            missing.append(pkg_name)
    if missing:
        print(f"[bootstrap] Installing missing packages: {', '.join(missing)}", flush=True)
        subprocess.check_call(
            [sys.executable, "-m", "pip", "install", "--break-system-packages", "--quiet"] + missing,
            stderr=subprocess.STDOUT,
        )

_bootstrap()

import boto3
import csv
import argparse
import logging
from datetime import datetime, timezone, timedelta
from dataclasses import dataclass, asdict, field
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed

try:
    import pandas as pd
    PANDAS_AVAILABLE = True
except ImportError:
    PANDAS_AVAILABLE = False

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)
log = logging.getLogger(__name__)


EC2_IOPS_CEILING = {
    "t3.nano": 2085, "t3.micro": 2085, "t3.small": 2085, "t3.medium": 2085,
    "t3.large": 2085, "t3.xlarge": 2085, "t3.2xlarge": 2085,
    "t3a.nano": 2085, "t3a.micro": 2085, "t3a.small": 2085, "t3a.medium": 2085,
    "t3a.large": 2085, "t3a.xlarge": 2085, "t3a.2xlarge": 2085,
    "t4g.nano": 2085, "t4g.micro": 2085, "t4g.small": 2085, "t4g.medium": 2085,
    "t4g.large": 2085, "t4g.xlarge": 2085, "t4g.2xlarge": 2085,
    "m5.large": 3600, "m5.xlarge": 6000, "m5.2xlarge": 8333,
    "m5.4xlarge": 16667, "m5.8xlarge": 18750, "m5.12xlarge": 28750,
    "m5.16xlarge": 37500, "m5.24xlarge": 40000, "m5.metal": 40000,
    "m5a.large": 3600, "m5a.xlarge": 6000, "m5a.2xlarge": 8333,
    "m5a.4xlarge": 16667, "m5a.8xlarge": 18750, "m5a.12xlarge": 28750,
    "m5a.16xlarge": 37500, "m5a.24xlarge": 40000,
    "m6g.medium": 3500, "m6g.large": 3500, "m6g.xlarge": 7000,
    "m6g.2xlarge": 10000, "m6g.4xlarge": 20000, "m6g.8xlarge": 30000,
    "m6g.12xlarge": 40000, "m6g.16xlarge": 40000, "m6g.metal": 40000,
    "m6i.large": 6667, "m6i.xlarge": 10000, "m6i.2xlarge": 13333,
    "m6i.4xlarge": 20000, "m6i.8xlarge": 26667, "m6i.12xlarge": 40000,
    "m6i.16xlarge": 40000, "m6i.24xlarge": 40000, "m6i.32xlarge": 40000,
    "r5.large": 3600, "r5.xlarge": 6000, "r5.2xlarge": 8333,
    "r5.4xlarge": 16667, "r5.8xlarge": 18750, "r5.12xlarge": 28750,
    "r5.16xlarge": 37500, "r5.24xlarge": 40000, "r5.metal": 40000,
    "r6g.medium": 3500, "r6g.large": 3500, "r6g.xlarge": 7000,
    "r6g.2xlarge": 10000, "r6g.4xlarge": 20000, "r6g.8xlarge": 30000,
    "r6g.12xlarge": 40000, "r6g.16xlarge": 40000, "r6g.metal": 40000,
    "r6i.large": 6667, "r6i.xlarge": 10000, "r6i.2xlarge": 13333,
    "r6i.4xlarge": 20000, "r6i.8xlarge": 26667, "r6i.12xlarge": 40000,
    "r6i.16xlarge": 40000, "r6i.24xlarge": 40000, "r6i.32xlarge": 40000,
    "c5.large": 3600, "c5.xlarge": 6000, "c5.2xlarge": 8333,
    "c5.4xlarge": 16667, "c5.9xlarge": 20000, "c5.12xlarge": 28750,
    "c5.18xlarge": 37500, "c5.24xlarge": 40000, "c5.metal": 40000,
    "c6g.medium": 3500, "c6g.large": 3500, "c6g.xlarge": 7000,
    "c6g.2xlarge": 10000, "c6g.4xlarge": 20000, "c6g.8xlarge": 30000,
    "c6g.12xlarge": 40000, "c6g.16xlarge": 40000, "c6g.metal": 40000,
    "c6i.large": 6667, "c6i.xlarge": 10000, "c6i.2xlarge": 13333,
    "c6i.4xlarge": 20000, "c6i.8xlarge": 26667, "c6i.12xlarge": 40000,
    "c6i.16xlarge": 40000, "c6i.24xlarge": 40000, "c6i.32xlarge": 40000,
    "i3.large": 3000, "i3.xlarge": 6000, "i3.2xlarge": 12000,
    "i3.4xlarge": 16000, "i3.8xlarge": 32500, "i3.16xlarge": 65000,
    "i3en.large": 4750, "i3en.xlarge": 9500, "i3en.2xlarge": 19000,
    "i3en.3xlarge": 26125, "i3en.6xlarge": 52250, "i3en.12xlarge": 65000,
    "i3en.24xlarge": 65000,
}

RDS_IOPS_CEILING = {
    "db.t3.micro": 1536, "db.t3.small": 1536, "db.t3.medium": 1536,
    "db.t3.large": 2048, "db.t3.xlarge": 2048, "db.t3.2xlarge": 2048,
    "db.t4g.micro": 1700, "db.t4g.small": 1700, "db.t4g.medium": 1700,
    "db.t4g.large": 2000, "db.t4g.xlarge": 2000, "db.t4g.2xlarge": 2000,
    "db.m5.large": 3600, "db.m5.xlarge": 6000, "db.m5.2xlarge": 8333,
    "db.m5.4xlarge": 16667, "db.m5.8xlarge": 18750, "db.m5.12xlarge": 28750,
    "db.m5.16xlarge": 37500, "db.m5.24xlarge": 40000,
    "db.m6g.large": 3500, "db.m6g.xlarge": 7000, "db.m6g.2xlarge": 10000,
    "db.m6g.4xlarge": 20000, "db.m6g.8xlarge": 30000, "db.m6g.12xlarge": 40000,
    "db.m6g.16xlarge": 40000,
    "db.m6i.large": 6667, "db.m6i.xlarge": 10000, "db.m6i.2xlarge": 13333,
    "db.m6i.4xlarge": 20000, "db.m6i.8xlarge": 26667, "db.m6i.12xlarge": 40000,
    "db.m6i.16xlarge": 40000,
    "db.r5.large": 3600, "db.r5.xlarge": 6000, "db.r5.2xlarge": 8333,
    "db.r5.4xlarge": 16667, "db.r5.8xlarge": 18750, "db.r5.12xlarge": 28750,
    "db.r5.16xlarge": 37500, "db.r5.24xlarge": 40000,
    "db.r6g.large": 3500, "db.r6g.xlarge": 7000, "db.r6g.2xlarge": 10000,
    "db.r6g.4xlarge": 20000, "db.r6g.8xlarge": 30000, "db.r6g.12xlarge": 40000,
    "db.r6g.16xlarge": 40000,
    "db.r6i.large": 6667, "db.r6i.xlarge": 10000, "db.r6i.2xlarge": 13333,
    "db.r6i.4xlarge": 20000, "db.r6i.8xlarge": 26667, "db.r6i.12xlarge": 40000,
    "db.r6i.16xlarge": 40000,
    "db.x1e.xlarge": 3700, "db.x1e.2xlarge": 7400, "db.x1e.4xlarge": 14800,
    "db.x1e.8xlarge": 29600, "db.x1e.16xlarge": 40000, "db.x1e.32xlarge": 40000,
    "db.x2g.large": 3500, "db.x2g.xlarge": 7000, "db.x2g.2xlarge": 10000,
    "db.x2g.4xlarge": 20000, "db.x2g.8xlarge": 30000, "db.x2g.12xlarge": 40000,
    "db.x2g.16xlarge": 40000,
}

PERIOD_SECONDS = 60


@dataclass
class SaturationBreach:
    account_id: str
    account_name: str
    region: str
    service_type: str
    resource_id: str
    resource_name: str
    instance_type: str
    iops_ceiling: int           # effective ceiling = min(storage, instance) for EBS/RDS
    storage_iops_ceiling: int   # provisioned storage IOPS
    instance_iops_ceiling: int  # instance class IOPS ceiling (0 = not applicable)
    threshold_pct: float
    threshold_iops: float
    max_observed_iops: float
    max_observed_pct: float
    longest_breach_seconds: int
    breach_start_utc: str
    breach_end_utc: str
    metric_used: str
    note: str = ""


def get_metric_datapoints(cw_client, namespace, metric_name, dimensions, start_time, end_time, stat="Sum"):
    resp = cw_client.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric_name,
        Dimensions=dimensions,
        StartTime=start_time,
        EndTime=end_time,
        Period=PERIOD_SECONDS,
        Statistics=[stat],
    )
    points = [(dp["Timestamp"], dp[stat]) for dp in resp.get("Datapoints", [])]
    points.sort(key=lambda x: x[0])
    return points


def find_sustained_breaches(combined_iops, threshold_iops, max_ops_duration_seconds, is_count_metric=False):
    if not combined_iops:
        return []

    timestamps = sorted(combined_iops.keys())
    breaches = []
    run_start = None
    run_end = None
    run_max = 0.0

    for ts in timestamps:
        raw = combined_iops[ts]
        iops = raw / PERIOD_SECONDS if is_count_metric else raw

        if iops >= threshold_iops:
            if run_start is None:
                run_start = ts
            run_end = ts
            run_max = max(run_max, iops)
        else:
            if run_start is not None:
                duration = (run_end - run_start).total_seconds() + PERIOD_SECONDS
                if duration >= max_ops_duration_seconds:
                    breaches.append((run_start, run_end, run_max, duration))
                run_start = None
                run_end = None
                run_max = 0.0

    if run_start is not None:
        duration = (run_end - run_start).total_seconds() + PERIOD_SECONDS
        if duration >= max_ops_duration_seconds:
            breaches.append((run_start, run_end, run_max, duration))

    return breaches


def build_volume_instance_map(ec2_client):
    """
    Returns a dict mapping volume_id -> (instance_id, instance_type)
    for all in-use volumes in the region.
    """
    vol_to_instance = {}
    instance_types = {}

    # Collect instance types first
    inst_paginator = ec2_client.get_paginator("describe_instances")
    for page in inst_paginator.paginate():
        for reservation in page["Reservations"]:
            for inst in reservation["Instances"]:
                instance_types[inst["InstanceId"]] = inst.get("InstanceType", "unknown")

    # Map volumes to instances
    vol_paginator = ec2_client.get_paginator("describe_volumes")
    for page in vol_paginator.paginate(
        Filters=[{"Name": "status", "Values": ["in-use"]}]
    ):
        for vol in page["Volumes"]:
            for attachment in vol.get("Attachments", []):
                iid = attachment.get("InstanceId")
                if iid and iid in instance_types:
                    vol_to_instance[vol["VolumeId"]] = (iid, instance_types[iid])
                    break

    return vol_to_instance


def audit_ebs(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    findings = []
    ec2 = session.client("ec2", region_name=region)
    cw = session.client("cloudwatch", region_name=region)
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=lookback_hours)

    # Build volume -> instance type mapping once for the region
    try:
        vol_instance_map = build_volume_instance_map(ec2)
    except Exception as e:
        log.warning(f"Could not build volume/instance map in {account_id}/{region}: {e}")
        vol_instance_map = {}

    vol_paginator = ec2.get_paginator("describe_volumes")
    for page in vol_paginator.paginate(
        Filters=[
            {"Name": "volume-type", "Values": ["io1", "io2", "gp3"]},
            {"Name": "status", "Values": ["in-use"]},
        ]
    ):
        for vol in page["Volumes"]:
            provisioned_iops = vol.get("Iops", 0) or 0
            if provisioned_iops == 0:
                continue

            vol_id = vol["VolumeId"]
            tags = vol.get("Tags", [])
            name = next((t["Value"] for t in tags if t["Key"] == "Name"), vol_id)
            vol_type = vol.get("VolumeType", "unknown")

            # Determine effective ceiling: min(storage IOPS, instance IOPS ceiling)
            instance_type = "unknown"
            instance_iops_ceiling = 0
            ceiling_note = ""
            if vol_id in vol_instance_map:
                _, instance_type = vol_instance_map[vol_id]
                instance_iops_ceiling = EC2_IOPS_CEILING.get(instance_type, 0)

            if instance_iops_ceiling > 0:
                effective_ceiling = min(provisioned_iops, instance_iops_ceiling)
                binding = "storage" if provisioned_iops <= instance_iops_ceiling else "instance"
                ceiling_note = (
                    f"Effective ceiling = min(storage: {provisioned_iops:,}, "
                    f"instance {instance_type}: {instance_iops_ceiling:,}) = {effective_ceiling:,} IOPS "
                    f"[{binding} ceiling is binding]"
                )
            else:
                effective_ceiling = provisioned_iops
                ceiling_note = (
                    f"Storage ceiling used ({provisioned_iops:,} IOPS); "
                    f"instance type {instance_type!r} not in lookup table"
                )

            threshold_iops = effective_ceiling * (max_ops_pct / 100.0)

            try:
                pts_read = get_metric_datapoints(cw, "AWS/EBS", "VolumeReadOps",
                    [{"Name": "VolumeId", "Value": vol_id}], start_time, end_time)
                pts_write = get_metric_datapoints(cw, "AWS/EBS", "VolumeWriteOps",
                    [{"Name": "VolumeId", "Value": vol_id}], start_time, end_time)
                combined = {}
                for ts, val in pts_read:
                    combined[ts] = combined.get(ts, 0.0) + val
                for ts, val in pts_write:
                    combined[ts] = combined.get(ts, 0.0) + val
                breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=True)
            except Exception as e:
                log.warning(f"CloudWatch error for EBS {vol_id} in {account_id}/{region}: {e}")
                continue

            for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
                findings.append(SaturationBreach(
                    account_id=account_id, account_name=account_name, region=region,
                    service_type=f"EBS ({vol_type})", resource_id=vol_id, resource_name=name,
                    instance_type=instance_type,
                    iops_ceiling=effective_ceiling,
                    storage_iops_ceiling=provisioned_iops,
                    instance_iops_ceiling=instance_iops_ceiling,
                    threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
                    max_observed_iops=round(breach_max_iops, 1),
                    max_observed_pct=round((breach_max_iops / effective_ceiling) * 100, 1),
                    longest_breach_seconds=int(breach_secs),
                    breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    metric_used="AWS/EBS: VolumeReadOps + VolumeWriteOps (Sum / 60s = IOPS)",
                    note=ceiling_note,
                ))
    return findings


def audit_rds_standard(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    findings = []
    rds = session.client("rds", region_name=region)
    cw = session.client("cloudwatch", region_name=region)
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=lookback_hours)

    paginator = rds.get_paginator("describe_db_instances")
    for page in paginator.paginate():
        for db in page["DBInstances"]:
            engine = db.get("Engine", "")
            if "aurora" in engine.lower():
                continue
            provisioned_iops = db.get("Iops", 0) or 0
            if provisioned_iops == 0:
                continue
            status = db.get("DBInstanceStatus", "")
            if status not in ("available", "backing-up", "modifying"):
                continue

            db_id = db.get("DBInstanceIdentifier", "")
            instance_type = db.get("DBInstanceClass", "unknown")
            tags = db.get("TagList", [])
            name = next((t["Value"] for t in tags if t["Key"] == "Name"), db_id)

            # Determine effective ceiling: min(storage IOPS, instance class IOPS ceiling)
            instance_iops_ceiling = RDS_IOPS_CEILING.get(instance_type, 0)
            if instance_iops_ceiling > 0:
                effective_ceiling = min(provisioned_iops, instance_iops_ceiling)
                binding = "storage" if provisioned_iops <= instance_iops_ceiling else "instance"
                ceiling_note = (
                    f"Effective ceiling = min(storage: {provisioned_iops:,}, "
                    f"instance {instance_type}: {instance_iops_ceiling:,}) = {effective_ceiling:,} IOPS "
                    f"[{binding} ceiling is binding]"
                )
            else:
                effective_ceiling = provisioned_iops
                ceiling_note = (
                    f"Storage ceiling used ({provisioned_iops:,} IOPS); "
                    f"instance type {instance_type!r} not in lookup table"
                )

            threshold_iops = effective_ceiling * (max_ops_pct / 100.0)
            dims = [{"Name": "DBInstanceIdentifier", "Value": db_id}]

            try:
                pts_read = get_metric_datapoints(cw, "AWS/RDS", "ReadIOPS", dims, start_time, end_time, stat="Average")
                pts_write = get_metric_datapoints(cw, "AWS/RDS", "WriteIOPS", dims, start_time, end_time, stat="Average")
                combined = {}
                for ts, val in pts_read:
                    combined[ts] = combined.get(ts, 0.0) + val
                for ts, val in pts_write:
                    combined[ts] = combined.get(ts, 0.0) + val
                breaches = find_sustained_breaches(combined, threshold_iops, max_ops_duration_seconds, is_count_metric=False)
            except Exception as e:
                log.warning(f"CloudWatch error for RDS {db_id} in {account_id}/{region}: {e}")
                continue

            for breach_start, breach_end, breach_max_iops, breach_secs in breaches:
                findings.append(SaturationBreach(
                    account_id=account_id, account_name=account_name, region=region,
                    service_type=f"RDS ({engine})", resource_id=db_id, resource_name=name,
                    instance_type=instance_type,
                    iops_ceiling=effective_ceiling,
                    storage_iops_ceiling=provisioned_iops,
                    instance_iops_ceiling=instance_iops_ceiling,
                    threshold_pct=max_ops_pct, threshold_iops=round(threshold_iops, 1),
                    max_observed_iops=round(breach_max_iops, 1),
                    max_observed_pct=round((breach_max_iops / effective_ceiling) * 100, 1),
                    longest_breach_seconds=int(breach_secs),
                    breach_start_utc=breach_start.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    breach_end_utc=breach_end.strftime("%Y-%m-%d %H:%M:%S UTC"),
                    metric_used="AWS/RDS: ReadIOPS + WriteIOPS (Average)",
                    note=ceiling_note,
                ))
    return findings


def audit_aurora(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    """
    Aurora monitoring is DISABLED in this script.

    Aurora has TWO separate storage systems:

    1. Local EBS Storage (temporary files, monitored by FreeLocalStorage):
       - Used for sorting, temp tables, index builds
       - Constrained by instance class EBS bandwidth/IOPS limits
       - NOT what ReadIOPS/WriteIOPS metrics measure

    2. Aurora Cluster Storage (tables/indexes, monitored by ReadIOPS/WriteIOPS):
       - Managed by Aurora storage subsystem (NOT EBS)
       - Auto-scales: Storage (10 GB to 128 TiB) and I/O scale with workload demand
       - NOT constrained by instance class EBS bandwidth/IOPS limits
       - SupportsIops: false (no provisioned IOPS to configure)
       - SupportsStorageThroughput: false (no throughput to configure)
       - StorageType: "aurora" (fixed, no gp3/io2 choice)

    Since ReadIOPS/WriteIOPS (what users care about) measure Aurora cluster storage,
    and cluster storage has no instance-level IOPS ceiling, there is nothing to monitor
    against instance class IOPS limits.

    AURORA SERVERLESS V2:
    Aurora Serverless V2 is also excluded - no instance-level IOPS ceiling exists:
    - Uses ACUs: 0-256 ACUs (increments of 0.5 ACUs)
    - Same distributed Aurora cluster storage as provisioned Aurora
    - Storage auto-scales (10 GB to 256 TiB) and I/O scales with workload demand
    - No instance-level IOPS ceiling exists

    References:
    - Aurora Serverless V2: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html
    - "Aurora Serverless DB clusters use the same distributed storage subsystem as
      Aurora provisioned DB clusters"

    COULD WE MONITOR LOCAL STORAGE (TEMP FILES)?
    Yes, but it would require a different approach:
    - Data source: Performance Insights API (not CloudWatch)
    - Wait events: IO:BufFileRead, IO:BufFileWrite
    - Detection: High wait times + high temp file usage = potential EBS limit constraint
    - Fix: Usually query tuning (work_mem), not instance scaling
    - Not included in this script due to different data source and remediation approach

    References:
    - https://aws.amazon.com/rds/aurora/instance-types/
      "The maximum EBS bandwidth refers to I/O bandwidth for local storage within
      the DB instance. It doesn't apply to communication with the Aurora storage volume."
    - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Managing.html#AuroraPostgreSQL.Managing.TempStorage
      "Aurora PostgreSQL stores tables and indexes in the Aurora storage subsystem,
      which is separate from the temporary storage."
    - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/PostgreSQL.ManagingTempFiles.Example.html
      How to monitor temp file I/O using Performance Insights
    """
    log.info(f"  {account_id}/{region}: Aurora cluster storage monitoring skipped (no instance-level IOPS ceiling)")
    log.info(f"  {account_id}/{region}: Aurora local storage (temp files) would require Performance Insights API monitoring")
    return []


def list_accounts_in_ou(ou_id):
    org = boto3.client("organizations")
    accounts = []

    def recurse(parent_id):
        for child_type in ("ACCOUNT", "ORGANIZATIONAL_UNIT"):
            paginator = org.get_paginator("list_children")
            for page in paginator.paginate(ParentId=parent_id, ChildType=child_type):
                for child in page["Children"]:
                    if child_type == "ACCOUNT":
                        try:
                            resp = org.describe_account(AccountId=child["Id"])
                            acc = resp["Account"]
                            if acc["Status"] == "ACTIVE":
                                accounts.append({"id": acc["Id"], "name": acc["Name"]})
                        except Exception as e:
                            log.warning(f"Could not describe account {child['Id']}: {e}")
                    else:
                        recurse(child["Id"])

    recurse(ou_id)
    return accounts


def get_session(account_id, role_name):
    if role_name:
        sts = boto3.client("sts")
        role_arn = f"arn:aws:iam::{account_id}:role/{role_name}"
        creds = sts.assume_role(RoleArn=role_arn, RoleSessionName="IOPSSaturationScan")["Credentials"]
        return boto3.Session(
            aws_access_key_id=creds["AccessKeyId"],
            aws_secret_access_key=creds["SecretAccessKey"],
            aws_session_token=creds["SessionToken"],
        )
    return boto3.Session()


def audit_account(account, role_name, regions, max_ops_pct, max_ops_duration_seconds, lookback_hours):
    account_id = account["id"]
    account_name = account["name"]
    all_findings = []
    log.info(f"Auditing account {account_id} ({account_name})")
    try:
        session = get_session(account_id, role_name)
    except Exception as e:
        log.error(f"Cannot assume role in {account_id}: {e}")
        return []

    for region in regions:
        log.info(f"  {account_id} scanning {region}...")
        try:
            all_findings.extend(audit_ebs(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
            all_findings.extend(audit_rds_standard(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
            all_findings.extend(audit_aurora(session, account_id, account_name, region, max_ops_pct, max_ops_duration_seconds, lookback_hours))
        except Exception as e:
            log.error(f"  Error in {account_id}/{region}: {e}")

    return all_findings


def write_csv(findings, path):
    if not findings:
        return
    fieldnames = list(asdict(findings[0]).keys())
    with open(path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for finding in findings:
            writer.writerow(asdict(finding))
    log.info(f"CSV written: {path}")


def write_excel(findings, path):
    if not PANDAS_AVAILABLE:
        log.warning("pandas/openpyxl not installed -- skipping Excel output. pip install pandas openpyxl")
        return
    if not findings:
        return

    from openpyxl.styles import PatternFill

    rows = [asdict(f) for f in findings]
    df = pd.DataFrame(rows)
    df = df.sort_values("max_observed_pct", ascending=False)

    def row_colour(pct):
        if pct >= 100:
            return "FF2222"
        elif pct >= 95:
            return "FF6600"
        elif pct >= 90:
            return "FFB300"
        return "90EE90"

    with pd.ExcelWriter(path, engine="openpyxl") as writer:
        df.to_excel(writer, index=False, sheet_name="IOPS Saturation Breaches")
        ws = writer.sheets["IOPS Saturation Breaches"]
        pct_col_idx = list(df.columns).index("max_observed_pct") + 1
        for row_idx in range(2, len(df) + 2):
            pct_val = ws.cell(row=row_idx, column=pct_col_idx).value or 0
            colour = row_colour(pct_val)
            for col_idx in range(1, len(df.columns) + 1):
                ws.cell(row=row_idx, column=col_idx).fill = PatternFill(
                    start_color=colour, end_color=colour, fill_type="solid"
                )
        summary = df.groupby("service_type").agg(
            breaches=("resource_id", "count"),
            max_pct_observed=("max_observed_pct", "max"),
            avg_breach_seconds=("longest_breach_seconds", "mean"),
        ).reset_index()
        summary.to_excel(writer, index=False, sheet_name="Summary by Service")

    log.info(f"Excel written: {path}")


def print_results(findings, max_ops_pct, max_ops_duration_seconds):
    print()
    print("=" * 70)
    print("IOPS SATURATION BREACH REPORT")
    print(f"Threshold : >= {max_ops_pct}% of effective IOPS ceiling")
    print(f"Duration  : >= {max_ops_duration_seconds}s sustained")
    print("=" * 70)

    if not findings:
        print("\nNo sustained IOPS saturation breaches found.")
        print("=" * 70)
        return

    findings_sorted = sorted(findings, key=lambda f: f.max_observed_pct, reverse=True)
    by_type = {}
    for f in findings_sorted:
        by_type.setdefault(f.service_type, []).append(f)

    for svc_type, items in sorted(by_type.items()):
        print(f"\n  {svc_type} ({len(items)} breach{'es' if len(items) != 1 else ''})")
        print(f"  {'Resource':<40} {'Ceiling':>8} {'Peak IOPS':>10} {'Peak %':>7} {'Duration':>10}")
        print(f"  {'=' * 40} {'=' * 8} {'=' * 10} {'=' * 7} {'=' * 10}")
        for f in items:
            print(f"  {f.resource_name:<40} {f.iops_ceiling:>8,} {f.max_observed_iops:>10,.0f} {f.max_observed_pct:>6.1f}% {f.longest_breach_seconds:>8}s")
            print(f"    Account: {f.account_id} | Region: {f.region}")
            print(f"    Window:  {f.breach_start_utc}  to  {f.breach_end_utc}")
            if f.note:
                print(f"    Note:    {f.note}")

    print(f"\n  Total breaches found: {len(findings)}")
    print("=" * 70)


def parse_args():
    parser = argparse.ArgumentParser(
        description="Scan EBS volumes and standard RDS instances for sustained IOPS saturation. "
                    "Aurora (including Serverless V2) is excluded (no instance-level IOPS ceiling for cluster storage)."
    )
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("--ou-id", help="AWS Organizations OU ID")
    group.add_argument("--accounts", nargs="+", help="Specific AWS account IDs")
    parser.add_argument("--max-ops-pct", type=float, required=True,
        help="Percentage of IOPS ceiling that constitutes a breach (e.g. 90)")
    parser.add_argument("--max-ops-duration-secs", type=int, required=True,
        help="Minimum sustained breach duration in seconds to report (e.g. 120)")
    parser.add_argument("--lookback-hours", type=int, default=24,
        help="Hours of CloudWatch history to examine (default: 24)")
    parser.add_argument("--role-name", default="OrganizationAccountAccessRole")
    parser.add_argument("--regions", nargs="+", default=["af-south-1"])
    parser.add_argument("--workers", type=int, default=5)
    parser.add_argument("--output-prefix", default="iops_saturation_report")
    return parser.parse_args()


def main():
    args = parse_args()
    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")

    log.info(f"IOPS Saturation Scan starting")
    log.info(f"  Threshold : >= {args.max_ops_pct}% for >= {args.max_ops_duration_secs}s")
    log.info(f"  Lookback  : {args.lookback_hours}h | Regions: {', '.join(args.regions)}")

    accounts = list_accounts_in_ou(args.ou_id) if args.ou_id else [{"id": a, "name": a} for a in args.accounts]
    if args.ou_id:
        log.info(f"  Found {len(accounts)} active accounts in OU {args.ou_id}")

    all_findings = []
    with ThreadPoolExecutor(max_workers=args.workers) as executor:
        futures = {
            executor.submit(audit_account, acc, args.role_name, args.regions,
                args.max_ops_pct, args.max_ops_duration_secs, args.lookback_hours): acc
            for acc in accounts
        }
        for future in as_completed(futures):
            acc = futures[future]
            try:
                findings = future.result()
                all_findings.extend(findings)
                log.info(f"Account {acc['id']} complete: {len(findings)} breach(es)")
            except Exception as e:
                log.error(f"Account {acc['id']} failed: {e}")

    print_results(all_findings, args.max_ops_pct, args.max_ops_duration_secs)

    write_csv(all_findings, f"{args.output_prefix}_{timestamp}.csv")
    write_excel(all_findings, f"{args.output_prefix}_{timestamp}.xlsx")

    log.info("Scan complete.")
    return 1 if any(f.max_observed_pct >= 100.0 for f in all_findings) else 0


if __name__ == "__main__":
    sys.exit(main())
PYEOF

chmod +x iops_saturation.py
echo "iops_saturation.py written — run it directly, dependencies install automatically on first use"

4. Installation and Permissions

The script self-installs its dependencies on first run. Just download it and execute it. If boto3, pandas, or openpyxl are missing, the script installs them automatically before proceeding. pandas and openpyxl are optional. The script runs and produces CSV output without them. Excel output is skipped with a warning if they are absent.

The IAM role you assume in each target account needs these permissions:

cloudwatch:GetMetricStatistics
rds:DescribeDBInstances
ec2:DescribeVolumes
ec2:DescribeInstances

If you are using AWS Organizations, the caller identity in your management account also needs organizations:ListChildren, organizations:DescribeAccount, and sts:AssumeRole into each member account. The script will attempt to assume the role name you specify in every account it discovers and will log a warning for any account where the assumption fails rather than aborting the entire run, so you get full coverage for every account where the role is in place.

5. Running the Script

The two required parameters are --max-ops-pct and --max-ops-duration-secs and everything else has sensible defaults. To scan an entire OU looking for anything that hit 90 percent for more than two minutes in the last 24 hours, run it like this:

python iops_saturation.py \
  --max-ops-pct 90 \
  --max-ops-duration-secs 120 \
  --ou-id ou-xxxx-xxxxxxxx \
  --regions af-south-1 \
  --workers 10

To scan specific accounts with a tighter threshold and a 48-hour lookback, pass account IDs directly instead of an OU:

python iops_saturation.py \
  --max-ops-pct 95 \
  --max-ops-duration-secs 300 \
  --lookback-hours 48 \
  --accounts 123456789012 234567890123 \
  --regions af-south-1

The script exits with code 0 if no breaches are found and code 1 if any resource hit 100 percent of its ceiling during the lookback window, which means you can wire it directly into a CI pipeline or EventBridge scheduled job that posts to your incident channel when the condition fires. A resource reaching 100 percent of its I/O ceiling is not a configuration curiosity, it is a past incident or an active risk, and it deserves the same treatment as any other production alert.

6. Reading the Output

The script produces a colour-coded Excel workbook sorted by peak utilisation percentage, a flat CSV for programmatic consumption, and a summary printed to stdout. Here is what a realistic run looks like across a small estate with two breach findings:

======================================================================
IOPS SATURATION BREACH REPORT
Threshold : >= 90.0% of IOPS ceiling
Duration  : >= 120s sustained
======================================================================

  EBS (io2) (1 breach)
  Resource                                  Ceiling  Peak IOPS  Peak %   Duration
  ======================================== ======== ========== ======= ==========
  analytics-etl-vol / vol-0a1b2c3d4e5f      13,333     12,940   97.0%      180s
    Account: 234567890123 | Region: af-south-1
    Window:  2026-03-01 02:31:00 UTC  to  2026-03-01 02:34:00 UTC
    Note:    Effective ceiling = min(storage: 16,000, instance m6i.2xlarge: 13,333) = 13,333 IOPS [instance ceiling is binding]

  RDS (postgres) (1 breach)
  Resource                                  Ceiling  Peak IOPS  Peak %   Duration
  ======================================== ======== ========== ======= ==========
  reporting-db / db-reporting-prod            3,500      3,412   97.5%      240s
    Account: 123456789012 | Region: af-south-1
    Window:  2026-03-01 08:45:00 UTC  to  2026-03-01 08:49:00 UTC
    Note:    Effective ceiling = min(storage: 6,000, instance db.r6g.large: 3,500) = 3,500 IOPS [instance ceiling is binding]

  Total breaches found: 2
======================================================================

Notice how the RDS finding above shows a ceiling of 3,500 even though the instance has 6,000 provisioned storage IOPS. The instance class is the binding constraint. A naive comparison against the storage ceiling would have shown 56.9 percent utilisation and produced no finding at all. The workload is actually at 97.5 percent of its effective capacity.

There are three things to read in each finding. The peak percentage tells you how close to the ceiling the resource actually got, the duration tells you how long it held there, and the window timestamps tell you exactly when to look in your application metrics and logs to find the correlated latency spike or error rate increase.

6.1 What the Excel Workbook Contains

The workbook has two sheets. The first, IOPS Saturation Breaches, contains one row per breach event sorted by peak utilisation percentage descending, with columns for account ID and name, region, service type, resource ID and name, instance type, the effective IOPS ceiling used for comparison, the storage IOPS ceiling, the instance IOPS ceiling, the threshold IOPS at your specified percentage, peak observed IOPS, peak percentage of ceiling, longest breach duration in seconds, breach start and end timestamps in UTC, the CloudWatch metric used, and any relevant notes about how the ceiling was calculated. The second sheet, Summary by Service, groups findings by service type and shows total breach count, maximum peak percentage observed, and average breach duration.

Row colours in the first sheet map to utilisation severity. Red means the resource hit or exceeded 100 percent of its ceiling, which is confirmed saturation and should be treated as an incident retrospective item. Orange covers 95 to 100 percent and represents a resource operating without meaningful headroom that needs attention before the next load event. Amber covers 90 to 95 percent, which is a structural warning worth adding to your next architecture review with the breach timestamps and duration included in the discussion.

7. What To Do With the Findings

Red rows require immediate action. Pull the breach timestamps and correlate them with your application latency and error rate metrics for the same period, because there is almost certainly an incident in your history that traces back to this resource even if it was attributed at the time to something else such as a downstream timeout, a connection pool exhaustion, or an upstream retry storm. The I/O saturation was almost certainly the initiating cause. Fix the instance class or provisioned IOPS and then open a retrospective item to understand why this was not caught at provisioning time.

Orange rows mean the resource does not have enough headroom to absorb any meaningful load increase. You need to either increase the instance ceiling by upgrading the instance class, reduce the workload through query optimisation, connection pooling improvements, or read replica offloading, or accept the risk consciously and document it. What you should not do is leave it and assume it will be fine because it has not triggered an outage yet. Luck is not a capacity model.

Amber rows are planning cycle items rather than emergencies. Add them to your next architecture review and include the breach timestamps and duration in the discussion. If a resource is repeatedly hitting 90 percent during predictable traffic peaks the fix is straightforward, and the conversation with your team is easier when you have the data to show them what is actually happening rather than asking them to take it on faith.

If the script returns no findings, either your estate is genuinely healthy from an I/O capacity perspective, or your threshold is set too conservatively, or the lookback window did not capture your peak traffic period. Try a 72-hour lookback across your typical weekly peak if you are not seeing results you expect, because the absence of findings with a 24-hour window that does not cover your busiest period is not the same as confirmation that nothing is wrong.

8. Using Both Scripts Together

The mismatch auditor and this saturation detector answer different questions and you need both to have complete coverage. The mismatch auditor runs against your configuration data from the AWS APIs, does not need CloudWatch, and tells you where your architecture has provisioned more storage IOPS than your instance class can consume. It is a preventive tool, and you should run it on a schedule as part of your infrastructure compliance pipeline, treat findings with the same severity as security misconfigurations, and block deployments that introduce new critical-severity mismatches.

The saturation detector runs against observed operational data from CloudWatch and tells you which resources are actually approaching or hitting their ceiling under real workload conditions, regardless of whether a configuration mismatch exists. It is a detective tool, and you should run it on a schedule against your recent history, pipe its exit code into your alerting infrastructure, and use it as an input to your capacity planning cycle.

The scenarios they catch are different. You can have a mismatch with no saturation because the storage is over-provisioned but the workload is light and the instance ceiling is never approached. You can have saturation with no mismatch because the configuration is correct but the workload has grown to the point where even the right instance class is running out of room. And you can have both, which is the worst case: a resource where the effective ceiling is lower than you think because of a mismatch and where observed IOPS are already approaching that lower ceiling. Running both scripts closes both gaps and gives you the structural audit and the operational signal together, and between them they surface the class of failure that your existing tooling, including Trusted Advisor, CloudWatch alarms, Performance Insights, and your APM, will not name for you until it has already caused an outage.

Install Chrome MCP for Claude Desktop in a single script

If you have ever sat there manually clicking through a UI, copying error messages, and pasting them into Claude just to get help debugging something, I have good news. There is a better way.

Chrome MCP gives Claude Desktop direct access to your Chrome browser, allowing it to read the page, inspect the DOM, execute JavaScript, monitor network requests, and capture console output without you lifting a finger. For anyone doing software development, QA, or release testing, this changes the game entirely.

Why This Matters

When you are debugging a production issue or validating a new release, the bottleneck is almost never Claude reasoning ability. It is the friction of getting context into Claude in the first place – copying stack traces, screenshotting UI states, manually describing what you see, and repeating yourself every time something changes. Chrome MCP eliminates that friction entirely, giving Claude direct visibility into what is actually happening in your browser so it can read live page content and DOM state, capture JavaScript errors straight from the console, intercept network requests and API responses in real time, and autonomously navigate and interact with your application while flagging anything that looks wrong.

For senior engineers and CTOs who care about reducing MTTR and shipping with confidence, this is a genuine force multiplier.

Install in One Command

Copy the block below in its entirety and paste it into your terminal. It writes the installer script, makes it executable, and runs it all in one go.

cat > install-chrome-mcp.sh << 'EOF'
#!/bin/bash
set -euo pipefail

echo "Installing Chrome MCP for Claude Desktop..."

CONFIG_DIR="$HOME/Library/Application Support/Claude"
CONFIG_FILE="$CONFIG_DIR/claude_desktop_config.json"

mkdir -p "$CONFIG_DIR"

if [[ -f "$CONFIG_FILE" ]]; then
  echo "Existing config found. Merging Chrome MCP entry..."
  node -e "
    const fs = require('fs');
    const config = JSON.parse(fs.readFileSync('$CONFIG_FILE', 'utf8'));
    config.mcpServers = config.mcpServers || {};
    config.mcpServers['chrome-devtools'] = {
      command: 'npx',
      args: ['-y', 'chrome-devtools-mcp@latest']
    };
    fs.writeFileSync('$CONFIG_FILE', JSON.stringify(config, null, 2));
    console.log('Config updated successfully.');
  "
else
  echo "No existing config found. Creating new config..."
  printf '{
  "mcpServers": {
    "chrome-devtools": {
      "command": "npx",
      "args": ["-y", "chrome-devtools-mcp@latest"]
    }
  }
}
' > "$CONFIG_FILE"
  echo "Config created at $CONFIG_FILE"
fi

echo ""
echo "Done. Restart Claude Desktop to activate Chrome MCP."
echo "You should see a browser tools indicator in the Claude interface."
EOF
chmod +x install-chrome-mcp.sh
./install-chrome-mcp.sh

One paste and you are done. The script writes itself to disk, becomes executable, and runs immediately without any manual file editing or separate steps. Using chrome-devtools-mcp@latest means you will always pull the current version without needing to reinstall.

Using It for Debugging

Once Chrome MCP is active, you direct Claude to navigate to any URL and investigate it directly. You might ask it to check the dev console on a page for JavaScript errors, navigate to your staging environment and verify the dashboard loads cleanly, or walk through a specific user flow and report back on anything unexpected. Claude reads the console output, intercepts the network calls, and reports back in plain language with specifics you can act on immediately rather than a vague description you then have to go and verify yourself.

Using It for Release Testing

This is where Chrome MCP really earns its keep. Before pushing a release to production, you can give Claude a test checklist and let it execute the entire regression suite autonomously against your staging URL, navigating through each scenario, capturing screenshots, checking for console errors, and producing a structured pass/fail summary at the end. The alternative is a human doing this manually for an hour before every release, and there is simply no comparison once you have seen what autonomous browser testing looks like in practice.

How It Actually Works

Chrome MCP connects to your browser using the Chrome DevTools Protocol, the same underlying mechanism that powers Chrome’s built-in developer tools. When Claude Desktop has Chrome MCP active, it can issue DevTools commands directly to pages it navigates to, reading the accessibility tree, querying DOM elements, firing JavaScript in the page context, and listening on the network and console streams.

There is no screen recording, no pixel scraping, and no vision model trying to interpret screenshots. Claude is working with structured data, the actual DOM state, actual network payloads, actual console messages, which means it reasons about your application the same way a senior developer would when sitting at the DevTools panel, not the way a junior tester would when eyeballing a screen.

The connection is local. Chrome MCP runs as a process on your machine and communicates with Claude Desktop over a local socket. Nothing leaves your machine except what Claude sends to the Anthropic API as part of normal inference.

One important clarification on scope: chrome-devtools-mcp operates in its own managed browser context, separate from your normal Chrome windows. Claude cannot see or interact with tabs you already have open. It only controls pages it has navigated to itself. This is worth understanding both practically and as a security property. Claude cannot accidentally interact with your AWS console, banking session, or anything else you have open unless you explicitly direct it to navigate there within its own context.

What Claude Will and Will Not Do

Giving an AI agent direct access to a browser raises a fair question about guardrails. Here is how it breaks down in practice.

Claude will not enter passwords or credentials under any circumstances, even if you provide them directly in the chat. It will not touch financial data, will not permanently delete content, and will not modify security permissions or access controls, including sharing documents or changing who can view or edit files. It will not create accounts on your behalf.

For anything irreversible, Claude stops and asks for explicit confirmation before proceeding. Clicking Publish, submitting a form, sending an email, or executing a purchase all require you to say yes in the chat before Claude acts. The instruction to proceed must come from you in the conversation, not from content found on a web page.

That last point matters more than it sounds. If a web page contains hidden instructions telling Claude to take some action, Claude treats that as untrusted data and surfaces it to you rather than following it. This class of attack is called prompt injection and it is a real risk when AI agents interact with arbitrary web content. Chrome MCP is designed to be resistant to it by default.

Things Worth Trying

Once you have it running, here are some concrete starting points.

Debug a broken page in seconds. Direct Claude to navigate to the broken page and check it for JavaScript errors. Claude reads the console, identifies the error, traces it back to the relevant DOM state or network call, and gives you a specific diagnosis rather than a list of things to check.

Validate an API integration. Navigate Claude to a feature that calls your backend and ask it to monitor the network requests while it triggers the action. Claude captures the request payload, the response, the status code, and any timing anomalies, and flags anything that deviates from what you would expect.

Audit a form for accessibility issues. Point Claude at a form and ask it to walk the accessibility tree and identify any inputs missing labels, incorrect ARIA roles, or tab order problems. This takes Claude about ten seconds and would take a human tester considerably longer.

Smoke test a deployment. After pushing to staging, give Claude your critical user journeys as a numbered list and ask it to execute each one, navigate through the steps, and report back with a pass or fail and the reason for any failure. Claude does not get tired, does not skip steps, and does not interpret close enough as a pass.

Compare environments. Ask Claude to open your production and staging URLs in sequence and compare the DOM structure of a specific component across both. Subtle differences in class names, missing elements, or divergent data often show up immediately when you stop looking with your eyes and start looking with structured queries.

The common thread across all of these is that you stop describing your problem to Claude and start showing it directly. That shift in how you interact with the tool is where the real productivity gain lives.

A Note on Security

Chrome MCP runs entirely locally and is not sending your browser data to any external service beyond your normal Claude API calls. That said, it is worth being deliberate about which tabs you have open when Claude is actively using the browser tool, and you should avoid leaving authenticated sessions open that you would not want an automated agent interacting with.

Final Thought

The best debugging tools are the ones that remove the distance between the problem and the person solving it, and Chrome MCP does exactly that by putting Claude in the same browser you are looking at with full visibility into what is actually happening. If you are serious about software quality and not using this yet, you are leaving time on the table.

Andrew Baker is CIO at Capitec Bank and writes about enterprise architecture, cloud infrastructure, and the tools that actually move the needle at andrewbaker.ninja.

You just Uploaded a new Plugin and your WordPress Site Just Crashed. Now What?

You updated a plugin five minutes ago. Maybe it was a security patch. Maybe you were trying a new caching layer. You clicked “Update Now,” saw the progress bar fill, got the green tick, and moved on with your day. Now the site is down. Not partially down. Not slow. Gone. A blank white page. No error message, no admin panel, no way in. Your visitors see nothing. Your contact forms are dead. If you are running WooCommerce, your checkout just stopped processing orders.

If you are running WordPress 5.2 or later, you might not even get a white screen. Instead you get this:

There has been a critical error on this website. Please check your site admin email inbox for instructions.

That is the exact message. No stack trace, no file name, no line number. Just a single sentence telling you to check an email that may or may not arrive. WordPress also sends a notification to the admin email address with the subject line “Your Site Is Experiencing a Technical Issue” containing a recovery mode link. In theory this is helpful. In practice the email takes minutes to arrive, may land in spam, or may never arrive at all if your site’s mail configuration is itself broken (which it often is on cheap shared hosting).

If you are running WordPress older than 5.2, you get nothing. A blank white page. No message at all. That is the original White Screen of Death.

Either way, the question is not whether it will happen to you. The question is what happens in the 60 seconds after it does.

1. Why WordPress Does Not Protect You

WordPress has no runtime health check. There is no circuit breaker, no post activation validation, no automatic rollback. When you activate a plugin, WordPress writes the plugin name into an active_plugins option in the database and then loads that plugin’s PHP file on the next request. If that file throws a fatal error, PHP dies and takes the entire request pipeline with it. Apache or Nginx returns a 500 or a blank page. WordPress never gets far enough into its own boot sequence to realise something is wrong.

There is a recovery mode that was introduced in WordPress 5.2. It catches fatal errors and sends an email to the admin address with a special recovery link. In theory this is helpful. In practice it has three problems. First, the email may take minutes to arrive or may never arrive at all if your site’s mail configuration is itself broken (which it often is on cheap shared hosting). Second, the recovery link expires after a short window. Third, it only pauses the offending plugin for the recovery session. It does not deactivate it permanently. If you log in via the recovery link but forget to deactivate the plugin manually, the next regular visitor request will crash the site again.

The core issue is architectural. WordPress loads every active plugin on every request. There is no sandbox, no isolation, no health gate between plugin activation and the next page load. A single throw or require of a missing file in any active plugin will take down the entire application. The plugin system is cooperative, not defensive.

2. What Recovery Normally Looks Like

If you have SSH access, the fix takes about 30 seconds. You connect to the server, navigate to wp-content/plugins/, and either rename or delete the offending plugin directory. The next request to WordPress skips the missing plugin and the site comes back.

If you do not have SSH, you try FTP. Most hosting providers still offer it. You open FileZilla or whatever client you have configured, navigate to the plugins folder, and do the same thing. This takes longer because FTP clients are slow, and if you do not have your credentials saved, you are now hunting through old emails from your hosting provider.

If you do not have FTP, or you are on a managed host that restricts file access, you file a support ticket. On a good host this gets resolved in minutes. On a bad one it takes hours. On a weekend it takes longer. Your site is down the entire time.

If you have a backup plugin and it stored snapshots externally (S3, Google Drive, Dropbox), you can restore from the last known good state. This works, but it is a sledgehammer for a thumbtack. You are restoring the entire site, including the database, to fix a single bad plugin file. If any content was created between the backup and the crash, it is gone.

Every one of these options assumes technical knowledge, preconfigured access, or a responsive support team. Most WordPress site owners have none of the three.

The Emergency SSH One (Two) Liner(s)

If you do have SSH access and you just need the site back up immediately, two commands. First, see what you are about to kill:

find /var/www/html/wp-content/plugins -maxdepth 1 -mindepth 1 -type d -mmin -60 -printf '%T+ %f\n' | sort

This lists every plugin folder modified in the last 60 minutes with its timestamp. Review the output. If you are happy with the list, delete them:

find /var/www/html/wp-content/plugins -maxdepth 1 -mindepth 1 -type d -mmin -60 -exec rm -rf {} \;

Adjust the path if your WordPress installation is not at /var/www/html. On many hosts it will be /home/username/public_html or similar. Change -mmin -60 to -mmin -30 for 30 minutes or -mmin -120 for two hours.

This is the nuclear option. It does not deactivate the plugin cleanly through WordPress. It deletes the files from disk. WordPress will notice the plugin is missing on the next request and remove it from the active plugins list automatically. If you need to be more surgical, use WP-CLI instead:

wp plugin deactivate $(find /var/www/html/wp-content/plugins -maxdepth 1 -mindepth 1 -type d -mmin -60 -printf '%f\n') --path=/var/www/html

This deactivates recently modified plugins without deleting them, so you can inspect them later.

3. The Watchdog Pattern

The solution is a plugin that watches the site from the inside. Not a monitoring service that pings your URL from an external server and sends you an alert. Not an uptime checker that tells you the site is down (you already know the site is down). A plugin that detects the crash, identifies the cause, and fixes it automatically before you even notice.

The pattern works like this. A lightweight cron job fires every 60 seconds. Each tick does three things.

Probe. The plugin sends an HTTP GET to a dedicated health endpoint on its own site. The endpoint is registered at init priority 1, before themes and most other plugins load. It returns a plain text response: CLOUDSCALE_OK. No HTML, no template, no database queries. The request includes cache busting parameters and no cache headers to ensure CDNs and browsers do not serve a stale 200 when the site is actually dead.

Evaluate. If the probe comes back HTTP 200 with the expected body, the site is healthy. The tick exits and does nothing. No logging, no database writes, no overhead.

Recover. If the probe fails (500 error, timeout, connection refused, unexpected response body), the plugin scans the wp-content/plugins/ directory and identifies the plugin file with the most recent modification time. If that file was modified within the last 10 minutes, the watchdog deactivates it, deletes its files from disk, and lets the next cron tick re-probe to confirm the site is back.

The entire recovery loop takes less than two minutes from crash to restored site. No human intervention. No SSH. No support ticket.

4. The Recovery Window

The 10 minute window is the most important design decision in the plugin. It defines the boundary between “a plugin that was just installed or updated” and “a plugin that has been sitting on the server for days.”

Without a time window, the watchdog would be dangerous. If the site crashes because the database is down or the disk is full, the watchdog would delete whatever plugin happens to have the newest file, even if that plugin has been stable for months and had nothing to do with the crash. That would be worse than the original problem.

The 10 minute window scopes the blast radius. The watchdog only acts on plugins that were modified in the last 600 seconds. If no plugin was recently modified, the watchdog sees the crash, finds no candidate, and does nothing. This is the correct behaviour. A crash with no recent plugin change is a server problem, not a plugin problem, and the watchdog should not try to fix server problems.

The timing scenarios are worth walking through explicitly.

You install a plugin at 14:00. The site crashes at 14:03. The plugin’s file modification time is 3 minutes ago, well within the window. The watchdog removes it.

You install a plugin at 14:00. The site crashes at 14:15. The plugin’s file modification time is 15 minutes ago, outside the window. The watchdog sees the crash but finds no candidate within the window. It does nothing. This is correct. If the plugin ran fine for 15 minutes and the site only now crashed, the plugin is probably not the cause.

You update two plugins at 14:00 and 14:05. The site crashes at 14:06. The watchdog finds the 14:05 plugin (most recently modified) and removes it. If the site is still down at the next tick 60 seconds later, the 14:00 plugin is now the most recently modified and still within the window. It gets removed next. The watchdog works through the candidates sequentially, most recent first.

5. What It Deletes and What It Leaves Alone

The watchdog targets one plugin per tick: the most recently modified file within the recovery window. It deactivates the plugin first (removes it from the active_plugins list in the database), then deletes the plugin’s files from disk.

It deletes rather than just deactivates. A deactivated plugin still has files on disk that could be autoloaded, could contain vulnerable code, or could conflict with other plugins through file level includes. If the plugin crashed your site, you do not want its files sitting around. You want it gone. You can reinstall it later once you have investigated the issue.

The watchdog never touches itself. It explicitly skips its own plugin file when scanning for candidates. It also never touches themes, mu-plugins, or drop-in plugins. Its scope is strictly the wp-content/plugins/ directory.

It does not act on database corruption. It does not act on PHP version incompatibilities at the server level. It does not act on disk space exhaustion, memory limit errors caused by the WordPress core, or misconfigurations in wp-config.php. It is a single purpose tool with a narrow scope, and that narrowness is what makes it safe.

6. The Design Decisions

Single file, no dependencies. The entire plugin is one PHP file. No Composer packages, no JavaScript assets, no CSS, no database tables, no options. A recovery tool that requires its own infrastructure is a recovery tool that can fail for infrastructure reasons. The fewer moving parts, the more likely it works when everything else is broken.

No configuration UI. There is no settings page. There is nothing to configure. The recovery window is a constant in the code. The probe endpoint is hardcoded. The cron schedule is fixed at 60 seconds. Every configuration option is a potential misconfiguration. A watchdog plugin that requires the user to set it up correctly is a watchdog plugin that will be set up incorrectly on exactly the sites that need it most.

Self probe, not external ping. The plugin probes itself from inside WordPress, not from an external monitoring service. This means it works on localhost development environments, on staging servers behind VPNs, on intranets, and on any host where inbound HTTP is restricted. It also means the probe tests the full WordPress request pipeline, not just whether the server is responding to TCP connections.

SSL verification disabled on the probe. The self probe sets sslverify to false. This is deliberate. Many staging and development environments use self signed certificates. A watchdog that fails because it cannot verify its own SSL certificate is useless in exactly the environments where you are most likely to be testing plugin changes.

Cache busting on every probe. The probe URL includes a timestamp parameter and sends explicit no cache headers. WordPress sites frequently run behind Varnish, Cloudflare, or plugin level page caches. Without cache busting, the probe could receive a cached 200 response from the CDN while the origin server is returning 500 errors. The site would appear healthy when it is actually dead.

7. WordPress Cron: The One Thing You Need to Know

WordPress does not have a real cron system. The built in “WP-Cron” is triggered by visitor requests. When someone visits your site, WordPress checks whether any scheduled events are due and runs them before serving the page.

This means on a low traffic site, the watchdog might not tick for several minutes or even hours if nobody visits. On a crashed site with zero traffic, it might never tick at all, because the crash happens before WordPress gets far enough into its boot sequence to check the cron schedule.

The fix is a real system cron. One line in your server’s crontab:

* * * * * curl -s https://yoursite.com/wp-cron.php?doing_wp_cron > /dev/null 2>&1

This hits wp-cron.php every 60 seconds regardless of visitor traffic. Combined with the watchdog plugin, it means your site self heals within two minutes of a plugin crash, even if nobody is visiting.

If you are on shared hosting without cron access, services like EasyCron or cron-job.org can make the same request externally. Some managed WordPress hosts (Kinsta, WP Engine, Cloudways) already run system cron for you. Check with your host.

8. Test It Yourself

Confidence in a recovery tool comes from seeing it work. Included with this post is a downloadable test plugin: CloudScale Crash Test. It is a WordPress plugin that does exactly one thing: throw a fatal error on every request, immediately white screening your site:

https://andrewninjawordpress.s3.af-south-1.amazonaws.com/cloudscale-plugin-crash-recovery.zip

The test procedure:

  1. Install and activate CloudScale Plugin Crash Recovery on your site
  2. Confirm your system cron is running (or that your site has enough traffic to trigger WP-Cron reliably)
  3. Install and activate the CloudScale Crash Test plugin
  4. Your site will immediately show: “There has been a critical error on this website. Please check your site admin email inbox for instructions.”
  5. Wait 60 to 120 seconds
  6. Refresh your site. It should be back online
  7. Check your plugins list. CloudScale Crash Test should be gone

The crash test plugin contains a single throw new \Error() statement at the top level. It is not subtle. It does not simulate a partial failure or an intermittent bug. It kills the site immediately on every request. If the watchdog can recover from this, it can recover from any plugin that fatal errors within the recovery window.

Do not install the crash test plugin on a production site without the recovery plugin active. If you do and your site is down, SSH in and run:

rm -rf /var/www/html/wp-content/plugins/cloudscale-crash-test-plugin/

Adjust the path to match your WordPress installation. On most shared hosts this will be /home/username/public_html/wp-content/plugins/cloudscale-crash-test-plugin/. Your site will come back on the next request.

9. When This Does Not Help

No tool solves every problem, and it is worth being explicit about the boundaries.

The watchdog does not help if the crash is caused by a theme. Themes are loaded through a different mechanism and the watchdog only scans the plugins directory. It does not help if the crash is caused by a mu-plugin (must use plugin), because mu-plugins load before regular plugins and before the cron system has a chance to act. It does not help if the database is down, because WordPress cannot read its own options (including the active plugins list) without a database connection. It does not help if the server’s PHP process is completely dead, because there is no PHP runtime to execute the cron tick.

It also does not help if the crash happens more than 10 minutes after the plugin was installed. If you install a plugin at 09:00 and it causes a crash at 11:00 due to a cron job or a deferred activation hook, the plugin’s file modification time is two hours old and outside the recovery window. The watchdog will see the crash but find no candidate to remove. This is a design tradeoff: a wider window catches more edge cases but increases the risk of removing an innocent plugin.

The watchdog is one layer in a broader defence strategy. It handles the most common failure mode (a recently installed or updated plugin that immediately crashes the site) and handles it automatically. For everything else, you still need backups, monitoring, and access to your server.

10. The Code

The full source code is available on GitHub under GPLv2. It is a single PHP file with no dependencies:

CloudScale Plugin Crash Recovery on GitHub

The crash test plugin is available as a zip download attached to this post.

Install the recovery plugin. Set up your system cron. Forget about it until it saves you at 2am on a Saturday when a plugin auto update goes wrong and you are nowhere near a terminal. That is the point.

Enable Claude Desktop To Run Bash MCP : Fully Scripted Installation

Andrew Baker | 01 Mar 2026 | andrewbaker.ninja

You want one script that does everything. No digging around in settings. No manually editing JSON. No clicking Developer, Edit Config. Just run it once and Claude Desktop can execute bash commands through an MCP server.

This guide gives you exactly that.

1. Why You Would Want This

Out of the box, Claude Desktop is a chat window. It can write code, explain things, and draft documents, but it cannot actually do anything on your machine. It cannot run a command. It cannot check a log file. It cannot restart a service. You are the middleman, copying and pasting between Claude and your terminal.

MCP (Model Context Protocol) changes that. It lets Claude Desktop call local tools directly. Once you wire up a bash MCP server, Claude stops being a suggestion engine and becomes something closer to a capable assistant that can act on your behalf.

Here are real situations where this matters.

1.1 Debugging Without the Copy Paste Loop

You are troubleshooting a failing deployment. Normally the conversation goes like this: you describe the error, Claude suggests a command, you copy it into terminal, you copy the output back into Claude, Claude suggests the next command, and you repeat this loop fifteen times.

With bash MCP enabled, you say:

Check the last 50 lines of /var/log/app/error.log and tell me what is going wrong

Claude runs the command, reads the output, and gives you a diagnosis. If it needs more context it runs the next command itself. A fifteen step copy paste loop becomes one prompt.

1.2 System Health Checks on Demand

You want to know if your machine is in good shape. Instead of remembering the right incantations for disk usage, memory pressure, and process counts, you ask Claude:

Give me a quick health check on this machine. Disk, memory, CPU, and any processes using more than 1GB of RAM

Claude runs df -h, free -m, uptime, and ps aux --sort=-%mem in sequence, then summarises everything into a single readable report. No tab switching. No forgetting flags.

1.3 File Operations at Scale

You have 200 log files from last month and you need to find which ones contain a specific error code, then extract the timestamps of each occurrence into a CSV. Describing this to Claude without bash access means Claude writes you a script, you save it, chmod it, run it, fix the one thing that did not work, and run it again.

With bash MCP, you say:

Search all .log files in /var/log/myapp/ from February for error code E4012, extract the timestamps, and save them to ~/Desktop/e4012-timestamps.csv

Claude writes the pipeline, executes it, checks the output, and tells you it is done. If something fails it adjusts and retries.

1.4 Git Operations and Code Exploration

You are picking up an unfamiliar codebase. Instead of manually navigating directories, you ask Claude:

Show me the directory structure of this repo, find all Python files that import redis, and tell me how many lines of code are in each one

Claude runs find, grep, and wc itself, then gives you an annotated summary. You can follow up with questions like “show me the largest one” and Claude will cat the file and walk you through it.

1.5 Environment Setup and Configuration

You are setting up a new development environment and need to install dependencies, configure services, and verify everything works. Instead of following a README step by step, you point Claude at it:

Read the SETUP.md in this repo and execute the setup steps for a macOS development environment. Stop and ask me before doing anything destructive.

Claude reads the file, runs each installation command, checks for errors, and reports back. You stay in control of anything risky, but you are not manually typing brew install forty times.

2. What the Script Does

The installation script below handles the full setup in one shot:

  1. Creates a local MCP launcher script at ~/mcp/run-bash-mcp.sh that runs a bash MCP server via npx bash-mcp
  2. Locates your Claude Desktop config file automatically (macOS and Windows paths)
  3. Creates a timestamped backup of the existing config
  4. Safely merges the required mcpServers entry using jq without overwriting your other MCP servers
  5. Sets correct file permissions
  6. Validates the JSON and restores the backup if anything goes wrong

After a restart, Claude Desktop will have a tool called myLocalBashServer available in every conversation.

3. One Command Installation

I dislike wasting time following step by step guides. So just copy this entire block into your Terminal and run it. Done!

cat << 'EOF' > ~/claude-enable-bash-mcp.sh
#!/usr/bin/env bash
set -euo pipefail

SERVER_NAME="myLocalBashServer"
MCP_PACKAGE="bash-mcp"

die() { echo "ERROR: $*" >&2; exit 1; }
have() { command -v "$1" >/dev/null 2>&1; }
timestamp() { date +"%Y%m%d-%H%M%S"; }

echo "Creating MCP launcher..."

mkdir -p "$HOME/mcp"
LAUNCHER="$HOME/mcp/run-bash-mcp.sh"

cat > "$LAUNCHER" <<LAUNCH_EOF
#!/usr/bin/env bash
set -euo pipefail

export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:\$PATH"

if ! command -v node >/dev/null 2>&1; then
  echo "node is not installed or not on PATH" >&2
  exit 1
fi

exec npx ${MCP_PACKAGE}
LAUNCH_EOF

chmod +x "$LAUNCHER"

echo "Locating Claude Desktop config..."

OS="$(uname -s || true)"
CONFIG=""

if [[ "$OS" == "Darwin" ]]; then
  C1="$HOME/Library/Application Support/Claude/claude_desktop_config.json"
  C2="$HOME/Library/Application Support/Anthropic/Claude/claude_desktop_config.json"
  C3="$HOME/Library/Application Support/claude/claude_desktop_config.json"

  if [[ -f "$C1" || -d "$(dirname "$C1")" ]]; then CONFIG="$C1"; fi
  if [[ -z "$CONFIG" && ( -f "$C2" || -d "$(dirname "$C2")" ) ]]; then CONFIG="$C2"; fi
  if [[ -z "$CONFIG" && ( -f "$C3" || -d "$(dirname "$C3")" ) ]]; then CONFIG="$C3"; fi
fi

if [[ -z "$CONFIG" && -n "${APPDATA:-}" ]]; then
  W1="${APPDATA}/Claude/claude_desktop_config.json"
  if [[ -f "$W1" || -d "$(dirname "$W1")" ]]; then CONFIG="$W1"; fi
fi

[[ -n "$CONFIG" ]] || die "Could not determine Claude Desktop config path. Open Claude Desktop → Settings → Developer → Edit Config once, then rerun this script."

mkdir -p "$(dirname "$CONFIG")"

if ! have jq; then
  if [[ "$OS" == "Darwin" ]] && have brew; then
    echo "Installing jq..."
    brew install jq
  else
    die "jq is required. Install it and rerun."
  fi
fi

if [[ ! -f "$CONFIG" ]]; then
  echo '{}' > "$CONFIG"
fi

BACKUP="${CONFIG}.bak.$(timestamp)"
cp -f "$CONFIG" "$BACKUP"

echo "Updating Claude config..."

if ! jq . "$CONFIG" >/dev/null 2>&1; then
  cp -f "$BACKUP" "$CONFIG"
  die "Config was invalid JSON. Restored backup."
fi

TMP="$(mktemp)"

jq --arg name "$SERVER_NAME" --arg cmd "$LAUNCHER" '
  .mcpServers = (.mcpServers // {}) |
  .mcpServers[$name] = (
    (.mcpServers[$name] // {}) |
    .command = $cmd
  )
' "$CONFIG" > "$TMP"

mv "$TMP" "$CONFIG"

echo ""
echo "DONE."
echo ""
echo "Launcher created at:"
echo "  $LAUNCHER"
echo ""
echo "Claude config updated at:"
echo "  $CONFIG"
echo ""
echo "Backup saved at:"
echo "  $BACKUP"
echo ""
echo "IMPORTANT: Completely quit Claude Desktop and relaunch it."
echo "Claude only loads MCP servers on startup."
echo ""
echo "Then try:"
echo "  Use the MCP tool ${SERVER_NAME} to run: pwd"
echo ""
EOF

chmod +x ~/claude-enable-bash-mcp.sh
~/claude-enable-bash-mcp.sh

4. What Happens Under the Hood

Claude Desktop runs local tools using MCP. The config file contains a key called mcpServers. Each entry defines a command Claude launches when it starts.

The script creates ~/mcp/run-bash-mcp.sh which uses npx bash-mcp to expose shell execution as a tool. The launcher explicitly sets PATH to include common binary locations like /opt/homebrew/bin because GUI launched apps on macOS do not inherit your shell profile. Without this, Node would not be found even if it is installed.

The config update uses jq to merge the new server entry into your existing config rather than replacing the whole file. If you already have other MCP servers configured they will not be touched. If the existing config is invalid JSON, the script restores the backup and exits rather than making things worse.

5. Test It

After restarting Claude Desktop, open a new chat and type:

Use your MCP myLocalBashServer to run: ls -la

If everything worked, Claude will call the MCP tool and return your directory listing. From there you can ask it to do anything your shell can do.

Some good first tests:

Use your MCP to show me disk usage on this machine

Use your MCP to determine what version of Python and Node do I have installed?

Use your MCP to find all files larger than 100MB in my home directory

6. Security Warning

You are giving Claude the ability to execute shell commands with your user permissions. That means file access, deletion, modification, everything your account can do.

Only enable this on a machine you control. Consider creating a dedicated limited permission user if you want stronger isolation. Claude will ask for confirmation before running destructive commands in most cases, but the capability is there.

That is it. One script. Full setup. No clicking through menus.

Simple Guide to Publishing Your Code on GitHub

GitHub is not just a code hosting platform. It is your public engineering ledger. It shows how you think, how you structure problems, how you document tradeoffs, and how you ship. If you build software and it never lands on GitHub, as far as the wider technical world is concerned, it does not exist.

This guide walks you from nothing to a clean public repository that is properly licensed, tagged, and released. No clicking around aimlessly. No half configured repos. No “I’ll tidy it later.” We will automate the entire process.

1 Why GitHub Matters

Before the mechanics, understand the leverage. Recruiters, engineers, and contributors can see your work, which gives you visibility you cannot get any other way. Clean commits and structured repos demonstrate discipline, and that builds credibility in a way that talking about your work never will. Tags and releases formalise change through proper versioning, and GitHub Releases turn your repo into a distribution channel. Beyond all of that, issues and pull requests scale development beyond you by opening the door to community contribution.

If you are building WordPress plugins, internal tooling, or AI integrations, publishing them properly is a signal. Discipline in open source hygiene matters.

2 The Manual Way vs The Correct Way

The manual way looks like this: install Git, create a repo in the browser, clone it, copy your files across, add a README, add a LICENSE, commit, push, tag, upload a release, add topics, then go back and fix all the mistakes you made along the way. That is friction. Friction creates inconsistency. Inconsistency creates messy repos.

Instead, automate it once and reuse it.

3 One Shot GitHub Publish Script (macOS)

The script below handles everything in a single pass. It installs Homebrew if needed, then installs Git and GitHub CLI. It authenticates you with GitHub via browser OAuth so you never have to manually create tokens. It then scaffolds a clean project directory with a MIT license, a sensible .gitignore, and a README.md. From there it initialises Git, creates the public GitHub repo, pushes the initial commit, tags a release, and sets repository topics. You edit three variables at the top of the script and the rest takes care of itself.

Save this as github-publish.sh, then run it:

chmod +x github-publish.sh
./github-publish.sh

Here is the script:

#!/usr/bin/env bash
# ============================================================================
# github-publish.sh
#
# One shot script to install tools, create a public GitHub repo, and publish
# your project as a clean, properly licensed open source repository.
#
# What it does:
#   1. Installs Homebrew, Git, and GitHub CLI (gh) if not already present
#   2. Authenticates with GitHub via browser OAuth
#   3. Scaffolds LICENSE, .gitignore, and README.md
#   4. Creates the public repo, pushes, tags a release, and sets topics
#
# Usage:
#   chmod +x github-publish.sh
#   ./github-publish.sh
#
# Prerequisites:
#   macOS with admin rights.
# ============================================================================

set -euo pipefail

# ---------- configuration (edit these three lines) ----------
REPO_NAME="my-project"
REPO_DESC="A short description of what your project does."
VERSION="1.0.0"
# ------------------------------------------------------------

echo ""
echo "========================================="
echo " GitHub Open Source Publish"
echo " Project: $REPO_NAME"
echo "========================================="
echo ""

# ── 1. Homebrew ──────────────────────────────────────────────────────────────
if ! command -v brew &>/dev/null; then
    echo "[1/7] Installing Homebrew..."
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    if [[ -f /opt/homebrew/bin/brew ]]; then
        eval "$(/opt/homebrew/bin/brew shellenv)"
    fi
else
    echo "[1/7] Homebrew already installed."
fi

# ── 2. Git ───────────────────────────────────────────────────────────────────
if ! command -v git &>/dev/null; then
    echo "[2/7] Installing Git..."
    brew install git
else
    echo "[2/7] Git already installed ($(git --version))."
fi

# ── 3. GitHub CLI ────────────────────────────────────────────────────────────
if ! command -v gh &>/dev/null; then
    echo "[3/7] Installing GitHub CLI..."
    brew install gh
else
    echo "[3/7] GitHub CLI already installed ($(gh --version | head -1))."
fi

# ── 4. GitHub auth ───────────────────────────────────────────────────────────
if ! gh auth status &>/dev/null; then
    echo "[4/7] Logging into GitHub..."
    echo "       A browser window will open. Approve the OAuth request."
    gh auth login --web --git-protocol https
else
    echo "[4/7] Already authenticated with GitHub."
fi

# ── 5. Scaffold project ─────────────────────────────────────────────────────
echo "[5/7] Scaffolding project directory..."

mkdir -p "$REPO_NAME"
cd "$REPO_NAME"

# MIT license (swap this for GPLv2 or Apache if you prefer)
GH_USER=$(gh api user --jq .login)
YEAR=$(date +%Y)

cat > LICENSE << EOF
MIT License

Copyright (c) $YEAR $GH_USER

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
EOF

# .gitignore
cat > .gitignore << 'GITIGNORE'
# macOS
.DS_Store
._*

# IDE
.idea/
.vscode/
*.swp
*.swo

# Build artifacts
node_modules/
vendor/
dist/
*.zip
*.tar.gz

# Environment
.env
.env.local
GITIGNORE

# README.md
cat > README.md << README
# $REPO_NAME

$REPO_DESC

## Getting Started

Clone the repository and you are ready to go:

\`\`\`bash
git clone https://github.com/$GH_USER/$REPO_NAME.git
cd $REPO_NAME
\`\`\`

## License

MIT License. See [LICENSE](LICENSE) for the full text.

## Author

$GH_USER
README

# Initialise git repo
git init -b main
git add -A

# Set commit identity from GitHub if not already configured globally
if ! git config user.email &>/dev/null; then
    GH_EMAIL=$(gh api user --jq '.email // empty')
    if [[ -z "$GH_EMAIL" ]]; then
        GH_EMAIL="${GH_USER}@users.noreply.github.com"
    fi
    git config user.name "$GH_USER"
    git config user.email "$GH_EMAIL"
fi

git commit -m "Initial commit: $REPO_NAME v${VERSION}"

# ── 6. Create repo and push ─────────────────────────────────────────────────
echo "[6/7] Creating public GitHub repo and pushing..."

gh repo create "$REPO_NAME" \
    --public \
    --description "$REPO_DESC" \
    --source . \
    --remote origin \
    --push

# ── 7. Tag release and set topics ───────────────────────────────────────────
echo "[7/7] Creating release and setting topics..."

gh release create "v${VERSION}" \
    --title "v${VERSION}" \
    --notes "Initial open source release of $REPO_NAME."

# Add your own topics here. These help people discover your repo.
gh repo edit \
    --add-topic open-source

REPO_URL=$(gh repo view --json url --jq .url)

echo ""
echo "========================================="
echo " Done!"
echo ""
echo " Repository: $REPO_URL"
echo " Release:    $REPO_URL/releases/tag/v${VERSION}"
echo "========================================="

When the script completes, you have a public repository with a clean initial commit, a tagged release, and a structured open source project ready for contribution. The whole thing runs in under a minute on a machine that already has Homebrew installed.

4 Anatomy of a Good README

The README is the front door of your project. Most developers either skip it entirely or write something so vague it tells you nothing. A good README answers three questions immediately: what does this project do, how do I use it, and where is the license.

Here is a minimal example that covers the essentials:

# hello-world

A minimal CLI tool that prints a greeting. Built as a reference for clean
GitHub repository structure.

## Getting Started

Clone the repository:

    git clone https://github.com/your-username/hello-world.git
    cd hello-world

Run the script:

    python hello.py

You should see:

    Hello, world!

## Usage

Pass a name as an argument to personalise the greeting:

    python hello.py Andrew
    Hello, Andrew!

## Requirements

Python 3.8 or higher. No external dependencies.

## License

MIT License. See [LICENSE](LICENSE) for the full text.

## Author

[Your Name](https://your-site.com)

That is enough to tell someone everything they need to know in thirty seconds. You can always expand it later with sections for configuration, contributing guidelines, or architecture notes, but this baseline should exist from day one.

5 Final Thought

Most developers overthink GitHub and under invest in automation. The difference between a hobby repo and a professional one is not complexity. It is structure.

Automate structure once. Then focus on shipping. Your code deserves to exist in public properly.

What is Minification and How to Test if it is Actually Working

1. What is Minification

Minification is the process of removing everything from source code that a browser does not need to execute it. This includes whitespace, line breaks, comments, and long variable names. The resulting file is functionally identical to the original but significantly smaller.

A CSS file written for human readability might look like this:

/* Main navigation styles */
.nav-container {
    display: flex;
    align-items: center;
    padding: 16px 24px;
    background-color: #ffffff;
}

After minification it becomes:

.nav-container{display:flex;align-items:center;padding:16px 24px;background-color:#fff}

Same output in the browser. A fraction of the bytes over the wire.

The same principle applies to JavaScript and HTML. Minified JavaScript also shortens variable and function names where safe to do so, which compounds the size reduction.

2. Why it Matters

Every byte of CSS and JavaScript has to be downloaded before the browser can render the page. Unminified assets increase page load time, increase bandwidth costs, and penalise your Core Web Vitals scores. On mobile connections the difference between a minified and unminified asset bundle can be several seconds of load time.

The impact is not theoretical. A JavaScript file that is 400KB unminified is routinely 150KB or less after minification, before any compression is applied on top.

3. Why You Cannot Assume it is Working

Most teams believe minification is handled because something is configured to do it. A WordPress plugin is active. A CDN toggle is switched on. A build pipeline supposedly runs. Everyone moves on. Performance is declared handled.

But assumptions fail silently. Plugins get disabled during updates. Build steps get bypassed in deployment shortcuts. CDN configurations get overridden. The site continues to serve unminified assets while the team believes otherwise.

The only way to know is to measure it.

4. How to Test it From the Terminal

The script below checks your HTML, CSS, and JavaScript assets and colour codes the results. Green means likely minified. Red means likely not minified. Yellow means borderline and worth inspecting.

4.1 Create the Script

Paste this into your terminal:

cat << 'EOF' > check-minify.sh
#!/bin/bash

if [ -z "$1" ]; then
  echo "Usage: ./check-minify.sh https://yourdomain.com"
  exit 1
fi

SITE=$1

GREEN="\033[0;32m"
RED="\033[0;31m"
YELLOW="\033[1;33m"
NC="\033[0m"

colorize() {
  LINES=$1
  TYPE=$2

  if [ "$TYPE" = "html" ]; then
    if [ "$LINES" -lt 50 ]; then
      echo -e "${GREEN}$LINES (minified)${NC}"
    else
      echo -e "${RED}$LINES (not minified)${NC}"
    fi
  else
    if [ "$LINES" -lt 50 ]; then
      echo -e "${GREEN}$LINES (minified)${NC}"
    elif [ "$LINES" -lt 200 ]; then
      echo -e "${YELLOW}$LINES (borderline)${NC}"
    else
      echo -e "${RED}$LINES (not minified)${NC}"
    fi
  fi
}

echo
echo "Checking $SITE"
echo

echo "---- HTML ----"
HTML_LINES=$(curl -s "$SITE" | wc -l)
echo -n "HTML lines: "
colorize $HTML_LINES "html"
echo

echo "---- CSS ----"
curl -s "$SITE" \
| grep -oE 'https?://[^"]+\.css' \
| sort -u \
| while read url; do
    LINES=$(curl -s "$url" | wc -l)
    SIZE=$(curl -s -I "$url" | grep -i content-length | awk '{print $2}' | tr -d '\r')
    echo "$url"
    echo -n "  Lines: "
    colorize $LINES "asset"
    echo "  Size:  ${SIZE:-unknown} bytes"
    echo
done

echo "---- JS ----"
curl -s "$SITE" \
| grep -oE 'https?://[^"]+\.js' \
| sort -u \
| while read url; do
    LINES=$(curl -s "$url" | wc -l)
    SIZE=$(curl -s -I "$url" | grep -i content-length | awk '{print $2}' | tr -d '\r')
    echo "$url"
    echo -n "  Lines: "
    colorize $LINES "asset"
    echo "  Size:  ${SIZE:-unknown} bytes"
    echo
done
EOF

chmod +x check-minify.sh

4.2 Run It

./check-minify.sh https://yourdomain.com

5. Reading the Output

5.1 Properly Minified Site

Checking https://example.com

---- HTML ----
HTML lines: 18 (minified)

---- CSS ----
https://example.com/assets/main.min.css
  Lines: 6 (minified)
  Size:  48213 bytes

---- JS ----
https://example.com/assets/app.min.js
  Lines: 3 (minified)
  Size:  164882 bytes

5.2 Site Leaking Development Assets

Checking https://example.com

---- HTML ----
HTML lines: 1243 (not minified)

---- CSS ----
https://example.com/assets/main.css
  Lines: 892 (not minified)
  Size:  118432 bytes

---- JS ----
https://example.com/assets/app.js
  Lines: 2147 (not minified)
  Size:  402771 bytes

The line count is the key signal. A minified CSS or JavaScript file collapses to a handful of lines regardless of how large the source file was. If you are seeing hundreds or thousands of lines in production, your minification pipeline is not running.

Minification is not a belief system. It is measurable in seconds from a terminal. If performance matters, verify it.

Automatically Recovering a Failed WordPress Instance on AWS

When WordPress goes down on your AWS instance, waiting for manual intervention means downtime and lost revenue. Here are two robust approaches to automatically detect and recover from WordPress failures.

Approach 1: Lambda Based Intelligent Recovery

This approach tries the least disruptive fix first (restarting services) before escalating to a full instance reboot.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" https://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Install CloudWatch Agent on Your EC2 Instance

Still on your EC2 instance, download and install the CloudWatch agent:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb

Step 3: Create Metric Publishing Script on Your EC2 Instance

This script will send the health check result to CloudWatch every minute:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data 
  --namespace "WordPress" 
  --metric-name HealthCheck 
  --value $HEALTH 
  --dimensions Instance=$INSTANCE_ID 
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it:

/usr/local/bin/send-wordpress-metric.sh

If you get permission errors, ensure your EC2 instance has an IAM role with CloudWatch permissions.

Step 4: Add Health Check to Cron on Your EC2 Instance

This runs the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 5: Create IAM Role for Lambda on Your Laptop

Now switch to your laptop (or use AWS CloudShell in your browser). You’ll need the AWS CLI installed and configured with credentials.

Create the IAM role that Lambda will use:

aws iam create-role 
  --role-name WordPressRecoveryRole 
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "lambda.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

Attach the necessary policies:

aws iam attach-role-policy 
  --role-name WordPressRecoveryRole 
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

aws iam put-role-policy 
  --role-name WordPressRecoveryRole 
  --policy-name EC2SSMAccess 
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "ec2:RebootInstances",
          "ec2:DescribeInstances",
          "ssm:SendCommand",
          "ssm:GetCommandInvocation"
        ],
        "Resource": "*"
      }
    ]
  }'

Step 6: Create Lambda Function on Your Laptop

On your laptop, create a file called wordpress-recovery.py in a new directory:

import boto3
import os
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    instance_id = os.environ.get('INSTANCE_ID')
    
    if not instance_id:
        return {'statusCode': 400, 'body': 'INSTANCE_ID not configured'}
    
    print(f"WordPress health check failed for {instance_id}")
    
    # Step 1: Try restarting services
    try:
        print("Attempting to restart services...")
        response = ssm.send_command(
            InstanceIds=[instance_id],
            DocumentName='AWS-RunShellScript',
            Parameters={
                'commands': [
                    'systemctl restart php-fpm || systemctl restart php8.2-fpm || systemctl restart php8.1-fpm',
                    'systemctl restart nginx || systemctl restart apache2',
                    'sleep 30',
                    'curl -f https://localhost || exit 1'
                ]
            },
            TimeoutSeconds=120
        )
        
        command_id = response['Command']['CommandId']
        print(f"Command ID: {command_id}")
        
        # Wait for command to complete
        time.sleep(35)
        
        result = ssm.get_command_invocation(
            CommandId=command_id,
            InstanceId=instance_id
        )
        
        if result['Status'] == 'Success':
            print("Services restarted successfully")
            return {'statusCode': 200, 'body': 'Services restarted successfully'}
        else:
            print(f"Service restart failed with status: {result['Status']}")
    
    except Exception as e:
        print(f"Service restart failed with error: {str(e)}")
    
    # Step 2: Reboot the instance as last resort
    try:
        print(f"Rebooting instance {instance_id}")
        ec2.reboot_instances(InstanceIds=[instance_id])
        return {'statusCode': 200, 'body': 'Instance rebooted'}
    except Exception as e:
        print(f"Reboot failed: {str(e)}")
        return {'statusCode': 500, 'body': f'Recovery failed: {str(e)}'}

Create the deployment package:

zip wordpress-recovery.zip wordpress-recovery.py

Get your AWS account ID:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Deploy the Lambda function (replace i-1234567890abcdef0 with your actual instance ID and us-east-1 with your region):

aws lambda create-function 
  --function-name wordpress-recovery 
  --runtime python3.11 
  --role arn:aws:iam::${AWS_ACCOUNT_ID}:role/WordPressRecoveryRole 
  --handler wordpress-recovery.lambda_handler 
  --zip-file fileb://wordpress-recovery.zip 
  --timeout 180 
  --region us-east-1 
  --environment Variables={INSTANCE_ID=i-1234567890abcdef0}

Step 7: Create CloudWatch Alarm on Your Laptop

Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-down-recovery 
  --alarm-description "Trigger recovery when WordPress is down" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching

This alarm triggers if the health check fails for 10 minutes (2 periods of 5 minutes each).

Step 8: Connect Alarm to Lambda on Your Laptop

Create an SNS topic (replace us-east-1 with your region):

aws sns create-topic --name wordpress-recovery-topic --region us-east-1

Get the topic ARN:

export TOPIC_ARN=$(aws sns list-topics --region us-east-1 --query 'Topics[?contains(TopicArn, `wordpress-recovery-topic`)].TopicArn' --output text)

Subscribe Lambda to the topic:

aws sns subscribe 
  --region us-east-1 
  --topic-arn ${TOPIC_ARN} 
  --protocol lambda 
  --notification-endpoint arn:aws:lambda:us-east-1:${AWS_ACCOUNT_ID}:function:wordpress-recovery

Give SNS permission to invoke Lambda:

aws lambda add-permission 
  --region us-east-1 
  --function-name wordpress-recovery 
  --statement-id AllowSNSInvoke 
  --action lambda:InvokeFunction 
  --principal sns.amazonaws.com 
  --source-arn ${TOPIC_ARN}

Update the CloudWatch alarm to notify SNS (replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region):

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-down-recovery 
  --alarm-description "Trigger recovery when WordPress is down" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching 
  --alarm-actions ${TOPIC_ARN}

Approach 2: Custom Health Check with CloudWatch Reboot

This approach is simpler than the Lambda version. It uses a custom CloudWatch metric based on checking your WordPress homepage, then automatically reboots when the check fails.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" https://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Create Metric Publishing Script on Your EC2 Instance

This script sends the health check result to CloudWatch:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data 
  --namespace "WordPress" 
  --metric-name HealthCheck 
  --value $HEALTH 
  --dimensions Instance=$INSTANCE_ID 
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it (ensure your EC2 instance has an IAM role with CloudWatch permissions):

/usr/local/bin/send-wordpress-metric.sh

Step 3: Add Health Check to Cron on Your EC2 Instance

Run the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 4: Create CloudWatch Alarm with Reboot Action on Your Laptop

Now from your laptop (or AWS CloudShell), create the alarm. Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-health-reboot 
  --alarm-description "Reboot instance when WordPress health check fails" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching 
  --alarm-actions arn:aws:automate:us-east-1:ec2:reboot

This will reboot your instance if WordPress fails health checks for 10 minutes (2 periods of 5 minutes).

That’s it. The entire setup is contained in 4 steps, and there’s no Lambda function to maintain. When WordPress goes down, CloudWatch will automatically reboot your instance.

Which Approach Should You Use?

Use Lambda Recovery (Approach 1) if:

  • You want intelligent recovery that tries service restart before rebooting
  • You need visibility into what recovery actions are taken
  • You want to extend the logic later (notifications, multiple recovery steps, etc)
  • You have SSM agent installed on your instance

Use Custom Health Check Reboot (Approach 2) if:

  • You want a simple solution with minimal moving parts
  • A full reboot is acceptable for all WordPress failures
  • You don’t need to try service restarts before rebooting
  • You prefer fewer AWS services to maintain

The Lambda approach is more sophisticated and tries to minimize downtime by restarting services first. The custom health check reboot approach is simpler, requires no Lambda function, but always reboots the entire instance.

Testing Your Setup

For Lambda Approach

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Watch the Lambda logs from your laptop:

aws logs tail /aws/lambda/wordpress-recovery --follow --region us-east-1

After 10 minutes, you should see the Lambda function trigger and attempt to restart services.

For Custom Health Check Reboot

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Check that the metric is being sent from your laptop:

aws cloudwatch get-metric-statistics 
  --region us-east-1 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) 
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) 
  --period 60 
  --statistics Average

You should see values of 0 appearing. After 10 minutes, your instance will automatically reboot.

Both approaches ensure your WordPress site recovers automatically without manual intervention.

Create / Migrate WordPress to AWS Graviton: Maximum Performance, Minimum Cost

Running WordPress on ARM-based Graviton instances delivers up to 40% better price-performance compared to x86 equivalents. This guide provides production-ready scripts to deploy an optimised WordPress stack in minutes, plus everything you need to migrate your existing site.

Why Graviton for WordPress?

Graviton3 processors deliver:

  • 40% better price-performance vs comparable x86 instances
  • Up to 25% lower cost for equivalent workloads
  • 60% less energy consumption per compute hour
  • Native ARM64 optimisations for PHP 8.x and MariaDB

The t4g.small instance (2 vCPU, 2GB RAM) at ~$12/month handles most WordPress sites comfortably. For higher traffic, t4g.medium or c7g instances scale beautifully.

Architecture

┌─────────────────────────────────────────────────┐
│                   CloudFront                     │
│              (Optional CDN Layer)                │
└─────────────────────┬───────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────┐
│              Graviton EC2 Instance               │
│  ┌─────────────────────────────────────────────┐│
│  │              Caddy (Reverse Proxy)          ││
│  │         Auto-TLS, HTTP/2, Compression       ││
│  └─────────────────────┬───────────────────────┘│
│                        │                         │
│  ┌─────────────────────▼───────────────────────┐│
│  │              PHP-FPM 8.3                     ││
│  │         OPcache, JIT Compilation            ││
│  └─────────────────────┬───────────────────────┘│
│                        │                         │
│  ┌─────────────────────▼───────────────────────┐│
│  │              MariaDB 10.11                   ││
│  │         InnoDB Optimised, Query Cache       ││
│  └─────────────────────────────────────────────┘│
│                                                  │
│  ┌─────────────────────────────────────────────┐│
│  │              EBS gp3 Volume                  ││
│  │         3000 IOPS, 125 MB/s baseline        ││
│  └─────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘

Prerequisites

  • AWS CLI configured with appropriate permissions
  • A domain name with DNS you control
  • SSH key pair in your target region

If you’d prefer to download these scripts, check out https://github.com/Scr1ptW0lf/wordpress-graviton.

Part 1: Launch the Instance

Save this as launch-graviton-wp.sh and run from AWS CloudShell:

#!/bin/bash

# AWS EC2 ARM Instance Launch Script with Elastic IP
# Launches ARM-based instances with Ubuntu 24.04 LTS ARM64

set -e

echo "=== AWS EC2 ARM Ubuntu Instance Launcher ==="
echo ""

# Function to get Ubuntu 24.04 ARM64 AMI for a region
get_ubuntu_ami() {
    local region=$1
    # Get the latest Ubuntu 24.04 LTS ARM64 AMI
    aws ec2 describe-images \
        --region "$region" \
        --owners 099720109477 \
        --filters "Name=name,Values=ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-arm64-server-*" \
                  "Name=state,Values=available" \
        --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
        --output text
}

# Check for default region
if [ -n "$AWS_DEFAULT_REGION" ]; then
    echo "AWS default region detected: $AWS_DEFAULT_REGION"
    read -p "Use this region? (y/n, default: y): " use_default
    use_default=${use_default:-y}
    
    if [ "$use_default" == "y" ]; then
        REGION="$AWS_DEFAULT_REGION"
        echo "Using region: $REGION"
    else
        use_default="n"
    fi
else
    use_default="n"
fi

# Prompt for region if not using default
if [ "$use_default" == "n" ]; then
    echo ""
    echo "Available regions for ARM instances:"
    echo "1. us-east-1 (N. Virginia)"
    echo "2. us-east-2 (Ohio)"
    echo "3. us-west-2 (Oregon)"
    echo "4. eu-west-1 (Ireland)"
    echo "5. eu-central-1 (Frankfurt)"
    echo "6. ap-southeast-1 (Singapore)"
    echo "7. ap-northeast-1 (Tokyo)"
    echo "8. Enter custom region"
    echo ""
    read -p "Select region (1-8): " region_choice

    case $region_choice in
        1) REGION="us-east-1" ;;
        2) REGION="us-east-2" ;;
        3) REGION="us-west-2" ;;
        4) REGION="eu-west-1" ;;
        5) REGION="eu-central-1" ;;
        6) REGION="ap-southeast-1" ;;
        7) REGION="ap-northeast-1" ;;
        8) read -p "Enter region code: " REGION ;;
        *) echo "Invalid choice"; exit 1 ;;
    esac
    
    echo "Selected region: $REGION"
fi

# Prompt for instance type
echo ""
echo "Select instance type (ARM/Graviton):"
echo "1. t4g.micro   (2 vCPU, 1 GB RAM)   - Free tier eligible"
echo "2. t4g.small   (2 vCPU, 2 GB RAM)   - ~\$0.0168/hr"
echo "3. t4g.medium  (2 vCPU, 4 GB RAM)   - ~\$0.0336/hr"
echo "4. t4g.large   (2 vCPU, 8 GB RAM)   - ~\$0.0672/hr"
echo "5. t4g.xlarge  (4 vCPU, 16 GB RAM)  - ~\$0.1344/hr"
echo "6. t4g.2xlarge (8 vCPU, 32 GB RAM)  - ~\$0.2688/hr"
echo "7. Enter custom ARM instance type"
echo ""
read -p "Select instance type (1-7): " instance_choice

case $instance_choice in
    1) INSTANCE_TYPE="t4g.micro" ;;
    2) INSTANCE_TYPE="t4g.small" ;;
    3) INSTANCE_TYPE="t4g.medium" ;;
    4) INSTANCE_TYPE="t4g.large" ;;
    5) INSTANCE_TYPE="t4g.xlarge" ;;
    6) INSTANCE_TYPE="t4g.2xlarge" ;;
    7) read -p "Enter instance type (e.g., c7g.medium): " INSTANCE_TYPE ;;
    *) echo "Invalid choice"; exit 1 ;;
esac

echo "Selected instance type: $INSTANCE_TYPE"
echo ""
echo "Fetching latest Ubuntu 24.04 ARM64 AMI..."

AMI_ID=$(get_ubuntu_ami "$REGION")

if [ -z "$AMI_ID" ]; then
    echo "Error: Could not find Ubuntu ARM64 AMI in region $REGION"
    exit 1
fi

echo "Found AMI: $AMI_ID"
echo ""

# List existing key pairs
echo "Fetching existing key pairs in $REGION..."
EXISTING_KEYS=$(aws ec2 describe-key-pairs \
    --region "$REGION" \
    --query 'KeyPairs[*].KeyName' \
    --output text 2>/dev/null || echo "")

if [ -n "$EXISTING_KEYS" ]; then
    echo "Existing key pairs in $REGION:"
    # Convert to array for number selection
    mapfile -t KEY_ARRAY < <(echo "$EXISTING_KEYS" | tr '\t' '\n')
    for i in "${!KEY_ARRAY[@]}"; do
        echo "$((i+1)). ${KEY_ARRAY[$i]}"
    done
    echo ""
else
    echo "No existing key pairs found in $REGION"
    echo ""
fi

# Prompt for key pair
read -p "Enter key pair name, number to select from list, or press Enter to create new: " KEY_INPUT
CREATE_NEW_KEY=false

if [ -z "$KEY_INPUT" ]; then
    KEY_NAME="arm-key-$(date +%s)"
    CREATE_NEW_KEY=true
    echo "Will create new key pair: $KEY_NAME"
elif [[ "$KEY_INPUT" =~ ^[0-9]+$ ]] && [ -n "$EXISTING_KEYS" ]; then
    # User entered a number
    if [ "$KEY_INPUT" -ge 1 ] && [ "$KEY_INPUT" -le "${#KEY_ARRAY[@]}" ]; then
        KEY_NAME="${KEY_ARRAY[$((KEY_INPUT-1))]}"
        echo "Will use existing key pair: $KEY_NAME"
    else
        echo "Invalid selection number"
        exit 1
    fi
else
    KEY_NAME="$KEY_INPUT"
    echo "Will use existing key pair: $KEY_NAME"
fi

echo ""

# List existing security groups
echo "Fetching existing security groups in $REGION..."
EXISTING_SGS=$(aws ec2 describe-security-groups \
    --region "$REGION" \
    --query 'SecurityGroups[*].[GroupId,GroupName,Description]' \
    --output text 2>/dev/null || echo "")

if [ -n "$EXISTING_SGS" ]; then
    echo "Existing security groups in $REGION:"
    # Convert to arrays for number selection
    mapfile -t SG_LINES < <(echo "$EXISTING_SGS")
    declare -a SG_ID_ARRAY
    declare -a SG_NAME_ARRAY
    declare -a SG_DESC_ARRAY

    for line in "${SG_LINES[@]}"; do
        read -r sg_id sg_name sg_desc <<< "$line"
        SG_ID_ARRAY+=("$sg_id")
        SG_NAME_ARRAY+=("$sg_name")
        SG_DESC_ARRAY+=("$sg_desc")
    done

    for i in "${!SG_ID_ARRAY[@]}"; do
        echo "$((i+1)). ${SG_ID_ARRAY[$i]} - ${SG_NAME_ARRAY[$i]} (${SG_DESC_ARRAY[$i]})"
    done
    echo ""
else
    echo "No existing security groups found in $REGION"
    echo ""
fi

# Prompt for security group
read -p "Enter security group ID, number to select from list, or press Enter to create new: " SG_INPUT
CREATE_NEW_SG=false

if [ -z "$SG_INPUT" ]; then
    SG_NAME="arm-sg-$(date +%s)"
    CREATE_NEW_SG=true
    echo "Will create new security group: $SG_NAME"
    echo "  - Port 22 (SSH) - open to 0.0.0.0/0"
    echo "  - Port 80 (HTTP) - open to 0.0.0.0/0"
    echo "  - Port 443 (HTTPS) - open to 0.0.0.0/0"
elif [[ "$SG_INPUT" =~ ^[0-9]+$ ]] && [ -n "$EXISTING_SGS" ]; then
    # User entered a number
    if [ "$SG_INPUT" -ge 1 ] && [ "$SG_INPUT" -le "${#SG_ID_ARRAY[@]}" ]; then
        SG_ID="${SG_ID_ARRAY[$((SG_INPUT-1))]}"
        echo "Will use existing security group: $SG_ID (${SG_NAME_ARRAY[$((SG_INPUT-1))]})"
        echo "Note: Ensure ports 22, 80, and 443 are open if needed"
    else
        echo "Invalid selection number"
        exit 1
    fi
else
    SG_ID="$SG_INPUT"
    echo "Will use existing security group: $SG_ID"
    echo "Note: Ensure ports 22, 80, and 443 are open if needed"
fi

echo ""

# Prompt for Elastic IP
read -p "Allocate and assign an Elastic IP? (y/n, default: n): " ALLOCATE_EIP
ALLOCATE_EIP=${ALLOCATE_EIP:-n}

echo ""
read -p "Enter instance name tag (default: ubuntu-arm-instance): " INSTANCE_NAME
INSTANCE_NAME=${INSTANCE_NAME:-ubuntu-arm-instance}

echo ""
echo "=== Launch Configuration ==="
echo "Region: $REGION"
echo "Instance Type: $INSTANCE_TYPE"
echo "AMI: $AMI_ID (Ubuntu 24.04 ARM64)"
echo "Key Pair: $KEY_NAME $([ "$CREATE_NEW_KEY" == true ] && echo '(will be created)')"
echo "Security Group: $([ "$CREATE_NEW_SG" == true ] && echo "$SG_NAME (will be created)" || echo "$SG_ID")"
echo "Name: $INSTANCE_NAME"
echo "Elastic IP: $([ "$ALLOCATE_EIP" == "y" ] && echo 'Yes' || echo 'No')"
echo ""
read -p "Launch instance? (y/n, default: y): " CONFIRM
CONFIRM=${CONFIRM:-y}

if [ "$CONFIRM" != "y" ]; then
    echo "Launch cancelled"
    exit 0
fi

echo ""
echo "Starting launch process..."

# Create key pair if needed
if [ "$CREATE_NEW_KEY" == true ]; then
    echo ""
    echo "Creating key pair: $KEY_NAME"
    aws ec2 create-key-pair \
        --region "$REGION" \
        --key-name "$KEY_NAME" \
        --query 'KeyMaterial' \
        --output text > "${KEY_NAME}.pem"
    chmod 400 "${KEY_NAME}.pem"
    echo "  ✓ Key saved to: ${KEY_NAME}.pem"
    echo "  ⚠️  IMPORTANT: Download this key file from CloudShell if you need it elsewhere!"
fi

# Create security group if needed
if [ "$CREATE_NEW_SG" == true ]; then
    echo ""
    echo "Creating security group: $SG_NAME"
    
    # Get default VPC
    VPC_ID=$(aws ec2 describe-vpcs \
        --region "$REGION" \
        --filters "Name=isDefault,Values=true" \
        --query 'Vpcs[0].VpcId' \
        --output text)
    
    if [ -z "$VPC_ID" ] || [ "$VPC_ID" == "None" ]; then
        echo "Error: No default VPC found. Please specify a security group ID."
        exit 1
    fi
    
    SG_ID=$(aws ec2 create-security-group \
        --region "$REGION" \
        --group-name "$SG_NAME" \
        --description "Security group for ARM instance with web access" \
        --vpc-id "$VPC_ID" \
        --query 'GroupId' \
        --output text)
    
    echo "  ✓ Created security group: $SG_ID"
    echo "  Adding security rules..."
    
    # Add SSH rule
    aws ec2 authorize-security-group-ingress \
        --region "$REGION" \
        --group-id "$SG_ID" \
        --ip-permissions \
        IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges="[{CidrIp=0.0.0.0/0,Description='SSH'}]" \
        > /dev/null
    
    # Add HTTP rule
    aws ec2 authorize-security-group-ingress \
        --region "$REGION" \
        --group-id "$SG_ID" \
        --ip-permissions \
        IpProtocol=tcp,FromPort=80,ToPort=80,IpRanges="[{CidrIp=0.0.0.0/0,Description='HTTP'}]" \
        > /dev/null
    
    # Add HTTPS rule
    aws ec2 authorize-security-group-ingress \
        --region "$REGION" \
        --group-id "$SG_ID" \
        --ip-permissions \
        IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges="[{CidrIp=0.0.0.0/0,Description='HTTPS'}]" \
        > /dev/null
    
    echo "  ✓ Port 22 (SSH) configured"
    echo "  ✓ Port 80 (HTTP) configured"
    echo "  ✓ Port 443 (HTTPS) configured"
fi

echo ""
echo "Launching instance..."

INSTANCE_ID=$(aws ec2 run-instances \
    --region "$REGION" \
    --image-id "$AMI_ID" \
    --instance-type "$INSTANCE_TYPE" \
    --key-name "$KEY_NAME" \
    --security-group-ids "$SG_ID" \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$INSTANCE_NAME}]" \
    --query 'Instances[0].InstanceId' \
    --output text)

echo "  ✓ Instance launched: $INSTANCE_ID"
echo "  Waiting for instance to be running..."

aws ec2 wait instance-running \
    --region "$REGION" \
    --instance-ids "$INSTANCE_ID"

echo "  ✓ Instance is running!"

# Handle Elastic IP
if [ "$ALLOCATE_EIP" == "y" ]; then
    echo ""
    echo "Allocating Elastic IP..."
    
    ALLOCATION_OUTPUT=$(aws ec2 allocate-address \
        --region "$REGION" \
        --domain vpc \
        --tag-specifications "ResourceType=elastic-ip,Tags=[{Key=Name,Value=$INSTANCE_NAME-eip}]")
    
    ALLOCATION_ID=$(echo "$ALLOCATION_OUTPUT" | grep -o '"AllocationId": "[^"]*' | cut -d'"' -f4)
    ELASTIC_IP=$(echo "$ALLOCATION_OUTPUT" | grep -o '"PublicIp": "[^"]*' | cut -d'"' -f4)
    
    echo "  ✓ Elastic IP allocated: $ELASTIC_IP"
    echo "  Associating Elastic IP with instance..."
    
    ASSOCIATION_ID=$(aws ec2 associate-address \
        --region "$REGION" \
        --instance-id "$INSTANCE_ID" \
        --allocation-id "$ALLOCATION_ID" \
        --query 'AssociationId' \
        --output text)
    
    echo "  ✓ Elastic IP associated"
    PUBLIC_IP=$ELASTIC_IP
else
    PUBLIC_IP=$(aws ec2 describe-instances \
        --region "$REGION" \
        --instance-ids "$INSTANCE_ID" \
        --query 'Reservations[0].Instances[0].PublicIpAddress' \
        --output text)
fi

echo ""
echo "=========================================="
echo "=== Instance Ready ==="
echo "=========================================="
echo "Instance ID: $INSTANCE_ID"
echo "Instance Type: $INSTANCE_TYPE"
echo "Public IP: $PUBLIC_IP"
if [ "$ALLOCATE_EIP" == "y" ]; then
    echo "Elastic IP: Yes (IP will persist after stop/start)"
    echo "Allocation ID: $ALLOCATION_ID"
else
    echo "Elastic IP: No (IP will change if instance is stopped)"
fi
echo "Region: $REGION"
echo "Security: SSH (22), HTTP (80), HTTPS (443) open"
echo ""
echo "Connect with:"
echo "  ssh -i ${KEY_NAME}.pem ubuntu@${PUBLIC_IP}"
echo ""
echo "Test web access:"
echo "  curl https://${PUBLIC_IP}"
echo ""
echo "⏱️  Wait 30-60 seconds for SSH to become available"

if [ "$ALLOCATE_EIP" == "y" ]; then
    echo ""
    echo "=========================================="
    echo "⚠️  ELASTIC IP WARNING"
    echo "=========================================="
    echo "Elastic IPs cost \$0.005/hour when NOT"
    echo "associated with a running instance!"
    echo ""
    echo "To avoid charges, release the EIP if you"
    echo "delete the instance:"
    echo ""
    echo "aws ec2 release-address \\"
    echo "  --region $REGION \\"
    echo "  --allocation-id $ALLOCATION_ID"
fi

echo ""
echo "=========================================="

Run it:

chmod +x launch-graviton-wp.sh
./launch-graviton-wp.sh

Part 2: Install WordPress Stack

SSH into your new instance and save this as setup-wordpress.sh:

#!/bin/bash

# WordPress Installation Script for Ubuntu 24.04 ARM64
# Installs Apache, MySQL, PHP, and WordPress with automatic configuration

set -e

echo "=== WordPress Installation Script (Apache) ==="
echo "This script will install and configure:"
echo "  - Apache web server"
echo "  - MySQL database"
echo "  - PHP 8.3"
echo "  - WordPress (latest version)"
echo ""

# Check if running as root
if [ "$EUID" -ne 0 ]; then
    echo "Please run as root (use: sudo bash $0)"
    exit 1
fi

# Get configuration from user
echo "=== WordPress Configuration ==="
read -p "Enter your domain name (or press Enter to use server IP): " DOMAIN_NAME
read -p "Enter WordPress site title (default: My WordPress Site): " SITE_TITLE
SITE_TITLE=${SITE_TITLE:-My WordPress Site}
read -p "Enter WordPress admin username (default: admin): " WP_ADMIN_USER
WP_ADMIN_USER=${WP_ADMIN_USER:-admin}
read -sp "Enter WordPress admin password (or press Enter to generate): " WP_ADMIN_PASS
echo ""
if [ -z "$WP_ADMIN_PASS" ]; then
    WP_ADMIN_PASS=$(openssl rand -base64 16)
    echo "Generated password: $WP_ADMIN_PASS"
fi
read -p "Enter WordPress admin email: (default:[email protected])" WP_ADMIN_EMAIL
WP_ADMIN_EMAIL=${WP_ADMIN_EMAIL:[email protected]}

# Generate database credentials
DB_NAME="wordpress"
DB_USER="wpuser"
DB_PASS=$(openssl rand -base64 16)
DB_ROOT_PASS=$(openssl rand -base64 16)

echo ""
echo "=== Installation Summary ==="
echo "Domain: ${DOMAIN_NAME:-Server IP}"
echo "Site Title: $SITE_TITLE"
echo "Admin User: $WP_ADMIN_USER"
echo "Admin Email: $WP_ADMIN_EMAIL"
echo "Database: $DB_NAME"
echo ""
read -p "Proceed with installation? (y/n, default: y): " CONFIRM
CONFIRM=${CONFIRM:-y}

if [ "$CONFIRM" != "y" ]; then
    echo "Installation cancelled"
    exit 0
fi

echo ""
echo "Starting installation..."

# Update system
echo ""
echo "[1/8] Updating system packages..."
apt-get update -qq
apt-get upgrade -y -qq

# Install Apache
echo ""
echo "[2/8] Installing Apache..."
apt-get install -y -qq apache2

# Enable Apache modules
echo "Enabling Apache modules..."
a2enmod rewrite
a2enmod ssl
a2enmod headers

# Check if MySQL is already installed
MYSQL_INSTALLED=false
if systemctl is-active --quiet mysql || systemctl is-active --quiet mysqld; then
    MYSQL_INSTALLED=true
    echo ""
    echo "MySQL is already installed and running."
elif command -v mysql &> /dev/null; then
    MYSQL_INSTALLED=true
    echo ""
    echo "MySQL is already installed."
fi

if [ "$MYSQL_INSTALLED" = true ]; then
    echo ""
    echo "[3/8] Using existing MySQL installation..."
    read -sp "Enter MySQL root password (or press Enter to try without password): " EXISTING_ROOT_PASS
    echo ""

    MYSQL_CONNECTION_OK=false

    # Test the password
    if [ -n "$EXISTING_ROOT_PASS" ]; then
        if mysql -u root -p"${EXISTING_ROOT_PASS}" -e "SELECT 1;" &> /dev/null; then
            echo "Successfully connected to MySQL."
            DB_ROOT_PASS="$EXISTING_ROOT_PASS"
            MYSQL_CONNECTION_OK=true
        else
            echo "Error: Could not connect to MySQL with provided password."
        fi
    fi

    # Try without password if previous attempt failed or no password was provided
    if [ "$MYSQL_CONNECTION_OK" = false ]; then
        echo "Trying to connect without password..."
        if mysql -u root -e "SELECT 1;" &> /dev/null; then
            echo "Connected without password. Will set a password now."
            DB_ROOT_PASS=$(openssl rand -base64 16)
            mysql -u root -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '${DB_ROOT_PASS}';"
            echo "New root password set: $DB_ROOT_PASS"
            MYSQL_CONNECTION_OK=true
        fi
    fi

    # If still cannot connect, offer to reinstall
    if [ "$MYSQL_CONNECTION_OK" = false ]; then
        echo ""
        echo "ERROR: Cannot connect to MySQL with any method."
        echo "This usually means MySQL is in an inconsistent state."
        echo ""
        read -p "Remove and reinstall MySQL? (y/n, default: y): " REINSTALL_MYSQL
        REINSTALL_MYSQL=${REINSTALL_MYSQL:-y}

        if [ "$REINSTALL_MYSQL" = "y" ]; then
            echo ""
            echo "Removing MySQL..."
            systemctl stop mysql 2>/dev/null || systemctl stop mysqld 2>/dev/null || true
            apt-get remove --purge -y mysql-server mysql-client mysql-common mysql-server-core-* mysql-client-core-* -qq
            apt-get autoremove -y -qq
            apt-get autoclean -qq
            rm -rf /etc/mysql /var/lib/mysql /var/log/mysql

            echo "Reinstalling MySQL..."
            export DEBIAN_FRONTEND=noninteractive
            apt-get update -qq
            apt-get install -y -qq mysql-server

            # Generate new root password
            DB_ROOT_PASS=$(openssl rand -base64 16)

            # Set root password and secure installation
            mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '${DB_ROOT_PASS}';"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='';"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DROP DATABASE IF EXISTS test;"
            mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';"
            mysql -u root -p"${DB_ROOT_PASS}" -e "FLUSH PRIVILEGES;"

            echo "MySQL reinstalled successfully."
            echo "New root password: $DB_ROOT_PASS"
        else
            echo "Installation cancelled."
            exit 1
        fi
    fi
else
    # Install MySQL
    echo ""
    echo "[3/8] Installing MySQL..."
    export DEBIAN_FRONTEND=noninteractive
    apt-get install -y -qq mysql-server

    # Secure MySQL installation
    echo ""
    echo "[4/8] Configuring MySQL..."
    mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '${DB_ROOT_PASS}';"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='';"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DROP DATABASE IF EXISTS test;"
    mysql -u root -p"${DB_ROOT_PASS}" -e "DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';"
    mysql -u root -p"${DB_ROOT_PASS}" -e "FLUSH PRIVILEGES;"
fi

# Check if WordPress database already exists
echo ""
echo "[4/8] Setting up WordPress database..."

# Create MySQL defaults file for safer password handling
MYSQL_CNF=$(mktemp)
cat > "$MYSQL_CNF" <<EOF
[client]
user=root
password=${DB_ROOT_PASS}
EOF
chmod 600 "$MYSQL_CNF"

# Test MySQL connection first
echo "Testing MySQL connection..."
if ! mysql --defaults-extra-file="$MYSQL_CNF" -e "SELECT 1;" &> /dev/null; then
    echo "ERROR: Cannot connect to MySQL to create database."
    rm -f "$MYSQL_CNF"
    exit 1
fi

echo "MySQL connection successful."

# Check if database exists
echo "Checking for existing database '${DB_NAME}'..."
DB_EXISTS=$(mysql --defaults-extra-file="$MYSQL_CNF" -e "SHOW DATABASES LIKE '${DB_NAME}';" 2>/dev/null | grep -c "${DB_NAME}" || true)

if [ "$DB_EXISTS" -gt 0 ]; then
    echo ""
    echo "WARNING: Database '${DB_NAME}' already exists!"
    read -p "Delete existing database and create fresh? (y/n, default: n): " DELETE_DB
    DELETE_DB=${DELETE_DB:-n}

    if [ "$DELETE_DB" = "y" ]; then
        echo "Dropping existing database..."
        mysql --defaults-extra-file="$MYSQL_CNF" -e "DROP DATABASE ${DB_NAME};"
        echo "Creating fresh WordPress database..."
        mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
CREATE DATABASE ${DB_NAME} DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
EOF
    else
        echo "Using existing database '${DB_NAME}'."
    fi
else
    echo "Creating WordPress database..."
    mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
CREATE DATABASE ${DB_NAME} DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
EOF
    echo "Database created successfully."
fi

# Check if WordPress user already exists
echo "Checking for existing database user '${DB_USER}'..."
USER_EXISTS=$(mysql --defaults-extra-file="$MYSQL_CNF" -e "SELECT User FROM mysql.user WHERE User='${DB_USER}';" 2>/dev/null | grep -c "${DB_USER}" || true)

if [ "$USER_EXISTS" -gt 0 ]; then
    echo "Database user '${DB_USER}' already exists. Updating password and permissions..."
    mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
ALTER USER '${DB_USER}'@'localhost' IDENTIFIED BY '${DB_PASS}';
GRANT ALL PRIVILEGES ON ${DB_NAME}.* TO '${DB_USER}'@'localhost';
FLUSH PRIVILEGES;
EOF
    echo "User updated successfully."
else
    echo "Creating WordPress database user..."
    mysql --defaults-extra-file="$MYSQL_CNF" <<EOF
CREATE USER '${DB_USER}'@'localhost' IDENTIFIED BY '${DB_PASS}';
GRANT ALL PRIVILEGES ON ${DB_NAME}.* TO '${DB_USER}'@'localhost';
FLUSH PRIVILEGES;
EOF
    echo "User created successfully."
fi

echo "Database setup complete."
rm -f "$MYSQL_CNF"

# Install PHP
echo ""
echo "[5/8] Installing PHP and extensions..."
apt-get install -y -qq php8.3 php8.3-mysql php8.3-curl php8.3-gd php8.3-mbstring \
    php8.3-xml php8.3-xmlrpc php8.3-soap php8.3-intl php8.3-zip libapache2-mod-php8.3 php8.3-imagick

# Configure PHP
echo "Configuring PHP..."
sed -i 's/upload_max_filesize = .*/upload_max_filesize = 64M/' /etc/php/8.3/apache2/php.ini
sed -i 's/post_max_size = .*/post_max_size = 64M/' /etc/php/8.3/apache2/php.ini
sed -i 's/max_execution_time = .*/max_execution_time = 300/' /etc/php/8.3/apache2/php.ini

# Check if WordPress directory already exists
if [ -d "/var/www/html/wordpress" ]; then
    echo ""
    echo "WARNING: WordPress directory /var/www/html/wordpress already exists!"
    read -p "Delete existing WordPress installation? (y/n, default: n): " DELETE_WP
    DELETE_WP=${DELETE_WP:-n}

    if [ "$DELETE_WP" = "y" ]; then
        echo "Removing existing WordPress directory..."
        rm -rf /var/www/html/wordpress
    fi
fi

# Download WordPress
echo ""
echo "[6/8] Downloading WordPress..."
cd /tmp
wget -q https://wordpress.org/latest.tar.gz
tar -xzf latest.tar.gz
mv wordpress /var/www/html/
chown -R www-data:www-data /var/www/html/wordpress
rm -f latest.tar.gz

# Configure WordPress
echo ""
echo "[7/8] Configuring WordPress..."
cd /var/www/html/wordpress

# Generate WordPress salts
SALTS=$(curl -s https://api.wordpress.org/secret-key/1.1/salt/)

# Create wp-config.php
cat > wp-config.php <<EOF
<?php
define( 'DB_NAME', '${DB_NAME}' );
define( 'DB_USER', '${DB_USER}' );
define( 'DB_PASSWORD', '${DB_PASS}' );
define( 'DB_HOST', 'localhost' );
define( 'DB_CHARSET', 'utf8mb4' );
define( 'DB_COLLATE', '' );

${SALTS}

\$table_prefix = 'wp_';

define( 'WP_DEBUG', false );

if ( ! defined( 'ABSPATH' ) ) {
    define( 'ABSPATH', __DIR__ . '/' );
}

require_once ABSPATH . 'wp-settings.php';
EOF

chown www-data:www-data wp-config.php
chmod 640 wp-config.php

# Configure Apache
echo ""
echo "[8/8] Configuring Apache..."

# Determine server name
if [ -z "$DOMAIN_NAME" ]; then
    # Try to get EC2 public IP first
    SERVER_NAME=$(curl -s --connect-timeout 5 http://169.254.169.254/latest/meta-data/public-ipv4 2>/dev/null)

    # If we got a valid public IP, use it
    if [ -n "$SERVER_NAME" ] && [[ ! "$SERVER_NAME" =~ ^172\. ]] && [[ ! "$SERVER_NAME" =~ ^10\. ]] && [[ ! "$SERVER_NAME" =~ ^192\.168\. ]]; then
        echo "Detected EC2 public IP: $SERVER_NAME"
    else
        # Fallback: try to get public IP from external service
        echo "EC2 metadata not available, trying external service..."
        SERVER_NAME=$(curl -s --connect-timeout 5 https://api.ipify.org 2>/dev/null || curl -s --connect-timeout 5 https://icanhazip.com 2>/dev/null)

        if [ -n "$SERVER_NAME" ]; then
            echo "Detected public IP from external service: $SERVER_NAME"
        else
            # Last resort: use local IP (but warn user)
            SERVER_NAME=$(hostname -I | awk '{print $1}')
            echo "WARNING: Using local IP address: $SERVER_NAME"
            echo "This is a private IP and won't be accessible from the internet."
            echo "Consider specifying a domain name or public IP."
        fi
    fi
else
    SERVER_NAME="$DOMAIN_NAME"
    echo "Using provided domain: $SERVER_NAME"
fi

# Create Apache virtual host
cat > /etc/apache2/sites-available/wordpress.conf <<EOF
<VirtualHost *:80>
    ServerName ${SERVER_NAME}
    ServerAdmin ${WP_ADMIN_EMAIL}
    DocumentRoot /var/www/html/wordpress

    <Directory /var/www/html/wordpress>
        Options FollowSymLinks
        AllowOverride All
        Require all granted
    </Directory>

    ErrorLog \${APACHE_LOG_DIR}/wordpress-error.log
    CustomLog \${APACHE_LOG_DIR}/wordpress-access.log combined
</VirtualHost>
EOF

# Enable WordPress site
echo "Enabling WordPress site..."
a2ensite wordpress.conf

# Disable default site if it exists
if [ -f /etc/apache2/sites-enabled/000-default.conf ]; then
    echo "Disabling default site..."
    a2dissite 000-default.conf
fi

# Test Apache configuration
echo ""
echo "Testing Apache configuration..."
if ! apache2ctl configtest 2>&1 | grep -q "Syntax OK"; then
    echo "ERROR: Apache configuration test failed!"
    apache2ctl configtest
    exit 1
fi

echo "Apache configuration is valid."

# Restart Apache
echo "Restarting Apache..."
systemctl restart apache2

# Enable services to start on boot
systemctl enable apache2
systemctl enable mysql

# Install WP-CLI for command line WordPress management
echo ""
echo "Installing WP-CLI..."
wget -q https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar -O /usr/local/bin/wp
chmod +x /usr/local/bin/wp

# Complete WordPress installation via WP-CLI
echo ""
echo "Completing WordPress installation..."
cd /var/www/html/wordpress

# Determine WordPress URL
# If SERVER_NAME looks like a private IP, try to get public IP
if [[ "$SERVER_NAME" =~ ^172\. ]] || [[ "$SERVER_NAME" =~ ^10\. ]] || [[ "$SERVER_NAME" =~ ^192\.168\. ]]; then
    PUBLIC_IP=$(curl -s --connect-timeout 5 http://169.254.169.254/latest/meta-data/public-ipv4 2>/dev/null || curl -s --connect-timeout 5 https://api.ipify.org 2>/dev/null)
    if [ -n "$PUBLIC_IP" ]; then
        WP_URL="https://${PUBLIC_IP}"
        echo "Using public IP for WordPress URL: $PUBLIC_IP"
    else
        WP_URL="https://${SERVER_NAME}"
        echo "WARNING: Could not determine public IP, using private IP: $SERVER_NAME"
    fi
else
    WP_URL="https://${SERVER_NAME}"
fi

echo "WordPress URL will be: $WP_URL"

# Check if WordPress is already installed
if sudo -u www-data wp core is-installed 2>/dev/null; then
    echo ""
    echo "WARNING: WordPress is already installed!"
    read -p "Continue with fresh installation? (y/n, default: n): " REINSTALL_WP
    REINSTALL_WP=${REINSTALL_WP:-n}

    if [ "$REINSTALL_WP" = "y" ]; then
        echo "Reinstalling WordPress..."
        sudo -u www-data wp db reset --yes
        sudo -u www-data wp core install \
            --url="$WP_URL" \
            --title="${SITE_TITLE}" \
            --admin_user="${WP_ADMIN_USER}" \
            --admin_password="${WP_ADMIN_PASS}" \
            --admin_email="${WP_ADMIN_EMAIL}" \
            --skip-email
    fi
else
    sudo -u www-data wp core install \
        --url="$WP_URL" \
        --title="${SITE_TITLE}" \
        --admin_user="${WP_ADMIN_USER}" \
        --admin_password="${WP_ADMIN_PASS}" \
        --admin_email="${WP_ADMIN_EMAIL}" \
        --skip-email
fi

echo ""
echo "=========================================="
echo "=== WordPress Installation Complete! ==="
echo "=========================================="
echo ""
echo "Website URL: $WP_URL"
echo "Admin URL: $WP_URL/wp-admin"
echo ""
echo "WordPress Admin Credentials:"
echo "  Username: $WP_ADMIN_USER"
echo "  Password: $WP_ADMIN_PASS"
echo "  Email: $WP_ADMIN_EMAIL"
echo ""
echo "Database Credentials:"
echo "  Database: $DB_NAME"
echo "  User: $DB_USER"
echo "  Password: $DB_PASS"
echo ""
echo "MySQL Root Password: $DB_ROOT_PASS"
echo ""
echo "IMPORTANT: Save these credentials securely!"
echo ""

# Save credentials to file
CREDS_FILE="/root/wordpress-credentials.txt"
cat > "$CREDS_FILE" <<EOF
WordPress Installation Credentials
===================================
Date: $(date)

Website URL: $WP_URL
Admin URL: $WP_URL/wp-admin

WordPress Admin:
  Username: $WP_ADMIN_USER
  Password: $WP_ADMIN_PASS
  Email: $WP_ADMIN_EMAIL

Database:
  Name: $DB_NAME
  User: $DB_USER
  Password: $DB_PASS

MySQL Root Password: $DB_ROOT_PASS

WP-CLI installed at: /usr/local/bin/wp
Usage: sudo -u www-data wp <command>

Apache Configuration: /etc/apache2/sites-available/wordpress.conf
EOF

chmod 600 "$CREDS_FILE"

echo "Credentials saved to: $CREDS_FILE"
echo ""
echo "Next steps:"
echo "1. Visit $WP_URL/wp-admin to access your site"
echo "2. Consider setting up SSL/HTTPS with Let's Encrypt"
echo "3. Install a caching plugin for better performance"
echo "4. Configure regular backups"
echo ""

if [ -n "$DOMAIN_NAME" ]; then
    echo "To set up SSL with Let's Encrypt:"
    echo "  apt-get install -y certbot python3-certbot-apache"
    echo "  certbot --apache -d ${DOMAIN_NAME}"
    echo ""
fi

echo "To manage WordPress from command line:"
echo "  cd /var/www/html/wordpress"
echo "  sudo -u www-data wp plugin list"
echo "  sudo -u www-data wp theme list"
echo ""
echo "Apache logs:"
echo "  Error log: /var/log/apache2/wordpress-error.log"
echo "  Access log: /var/log/apache2/wordpress-access.log"
echo ""
echo "=========================================="

Run it:

chmod +x setup-wordpress.sh
sudo ./setup-wordpress.sh

Part 3: Migrate Your Existing Site

If you’re migrating from an existing WordPress installation, follow these steps.

What gets migrated:

  • All posts, pages, and media
  • All users and their roles
  • All plugins (files + database settings)
  • All themes (including customisations)
  • All plugin/theme configurations (stored in wp_options table)
  • Widgets, menus, and customizer settings
  • WooCommerce products, orders, customers (if applicable)
  • All custom database tables created by plugins

Step 3a: Export from Old Server

Run this on your existing WordPress server. Save as wp-export.sh:

#!/bin/bash
set -euo pipefail

# Configuration
WP_PATH="/var/www/html"           # Adjust to your WordPress path
EXPORT_DIR="/tmp/wp-migration"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Detect WordPress path if not set correctly
if [ ! -f "${WP_PATH}/wp-config.php" ]; then
    for path in "/var/www/wordpress" "/var/www/html/wordpress" "/home/*/public_html" "/var/www/*/public_html"; do
        if [ -f "${path}/wp-config.php" ]; then
            WP_PATH="$path"
            break
        fi
    done
fi

if [ ! -f "${WP_PATH}/wp-config.php" ]; then
    echo "ERROR: wp-config.php not found. Please set WP_PATH correctly."
    exit 1
fi

echo "==> WordPress found at: ${WP_PATH}"

# Extract database credentials from wp-config.php
DB_NAME=$(grep "DB_NAME" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_USER=$(grep "DB_USER" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_PASS=$(grep "DB_PASSWORD" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_HOST=$(grep "DB_HOST" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)

echo "==> Database: ${DB_NAME}"

# Create export directory
mkdir -p "${EXPORT_DIR}"
cd "${EXPORT_DIR}"

echo "==> Exporting database..."
mysqldump -h "${DB_HOST}" -u "${DB_USER}" -p"${DB_PASS}" \
    --single-transaction \
    --quick \
    --lock-tables=false \
    --routines \
    --triggers \
    "${DB_NAME}" > database.sql

DB_SIZE=$(ls -lh database.sql | awk '{print $5}')
echo "    Database exported: ${DB_SIZE}"

echo "==> Exporting wp-content..."
tar czf wp-content.tar.gz -C "${WP_PATH}" wp-content

CONTENT_SIZE=$(ls -lh wp-content.tar.gz | awk '{print $5}')
echo "    wp-content exported: ${CONTENT_SIZE}"

echo "==> Exporting wp-config.php..."
cp "${WP_PATH}/wp-config.php" wp-config.php.bak

echo "==> Creating migration package..."
tar czf "wordpress-migration-${TIMESTAMP}.tar.gz" \
    database.sql \
    wp-content.tar.gz \
    wp-config.php.bak

rm -f database.sql wp-content.tar.gz wp-config.php.bak

PACKAGE_SIZE=$(ls -lh "wordpress-migration-${TIMESTAMP}.tar.gz" | awk '{print $5}')

echo ""
echo "============================================"
echo "Export complete!"
echo ""
echo "Package: ${EXPORT_DIR}/wordpress-migration-${TIMESTAMP}.tar.gz"
echo "Size:    ${PACKAGE_SIZE}"
echo ""
echo "Transfer to new server with:"
echo "  scp ${EXPORT_DIR}/wordpress-migration-${TIMESTAMP}.tar.gz ec2-user@NEW_IP:/tmp/"
echo "============================================"

Step 3b: Transfer the Export

scp /tmp/wp-migration/wordpress-migration-*.tar.gz ec2-user@YOUR_NEW_IP:/tmp/

Step 3c: Import on New Server

Run this on your new Graviton instance. Save as wp-import.sh:

#!/bin/bash
set -euo pipefail

# Configuration - EDIT THESE
MIGRATION_FILE="${1:-/tmp/wordpress-migration-*.tar.gz}"
OLD_DOMAIN="oldsite.com"          # Your old domain
NEW_DOMAIN="newsite.com"          # Your new domain (can be same)
WP_PATH="/var/www/wordpress"

# Resolve migration file path
MIGRATION_FILE=$(ls -1 ${MIGRATION_FILE} 2>/dev/null | head -1)

if [ ! -f "${MIGRATION_FILE}" ]; then
    echo "ERROR: Migration file not found: ${MIGRATION_FILE}"
    echo "Usage: $0 /path/to/wordpress-migration-XXXXXX.tar.gz"
    exit 1
fi

echo "==> Using migration file: ${MIGRATION_FILE}"

# Get database credentials from existing wp-config
if [ ! -f "${WP_PATH}/wp-config.php" ]; then
    echo "ERROR: wp-config.php not found at ${WP_PATH}"
    echo "Please run the WordPress setup script first"
    exit 1
fi

DB_NAME=$(grep "DB_NAME" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_USER=$(grep "DB_USER" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
DB_PASS=$(grep "DB_PASSWORD" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)
MYSQL_ROOT_PASS=$(cat /root/.wordpress/credentials | grep "MySQL Root" | awk '{print $4}')

echo "==> Extracting migration package..."
TEMP_DIR=$(mktemp -d)
cd "${TEMP_DIR}"
tar xzf "${MIGRATION_FILE}"

echo "==> Backing up current installation..."
BACKUP_DIR="/var/backups/wordpress/pre-migration-$(date +%Y%m%d_%H%M%S)"
mkdir -p "${BACKUP_DIR}"
cp -r "${WP_PATH}/wp-content" "${BACKUP_DIR}/" 2>/dev/null || true
mysqldump -u root -p"${MYSQL_ROOT_PASS}" "${DB_NAME}" > "${BACKUP_DIR}/database.sql" 2>/dev/null || true

echo "==> Importing database..."
mysql -u root -p"${MYSQL_ROOT_PASS}" << EOF
DROP DATABASE IF EXISTS ${DB_NAME};
CREATE DATABASE ${DB_NAME} CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
GRANT ALL PRIVILEGES ON ${DB_NAME}.* TO '${DB_USER}'@'localhost';
FLUSH PRIVILEGES;
EOF

mysql -u root -p"${MYSQL_ROOT_PASS}" "${DB_NAME}" < database.sql

echo "==> Importing wp-content..."
rm -rf "${WP_PATH}/wp-content"
tar xzf wp-content.tar.gz -C "${WP_PATH}"
chown -R caddy:caddy "${WP_PATH}/wp-content"
find "${WP_PATH}/wp-content" -type d -exec chmod 755 {} \;
find "${WP_PATH}/wp-content" -type f -exec chmod 644 {} \;

echo "==> Updating URLs in database..."
cd "${WP_PATH}"

OLD_URL_HTTP="https://${OLD_DOMAIN}"
OLD_URL_HTTPS="https://${OLD_DOMAIN}"
NEW_URL="https://${NEW_DOMAIN}"

# Install WP-CLI if not present
if ! command -v wp &> /dev/null; then
    curl -sO https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar
    chmod +x wp-cli.phar
    mv wp-cli.phar /usr/local/bin/wp
fi

echo "    Replacing ${OLD_URL_HTTPS} with ${NEW_URL}..."
sudo -u caddy wp search-replace "${OLD_URL_HTTPS}" "${NEW_URL}" --all-tables --precise --skip-columns=guid 2>/dev/null || true

echo "    Replacing ${OLD_URL_HTTP} with ${NEW_URL}..."
sudo -u caddy wp search-replace "${OLD_URL_HTTP}" "${NEW_URL}" --all-tables --precise --skip-columns=guid 2>/dev/null || true

echo "    Replacing //${OLD_DOMAIN} with //${NEW_DOMAIN}..."
sudo -u caddy wp search-replace "//${OLD_DOMAIN}" "//${NEW_DOMAIN}" --all-tables --precise --skip-columns=guid 2>/dev/null || true

echo "==> Flushing caches and rewrite rules..."
sudo -u caddy wp cache flush
sudo -u caddy wp rewrite flush

echo "==> Reactivating plugins..."
# Some plugins may deactivate during migration - reactivate all
sudo -u caddy wp plugin activate --all 2>/dev/null || true

echo "==> Verifying import..."
POST_COUNT=$(sudo -u caddy wp post list --post_type=post --format=count)
PAGE_COUNT=$(sudo -u caddy wp post list --post_type=page --format=count)
USER_COUNT=$(sudo -u caddy wp user list --format=count)
PLUGIN_COUNT=$(sudo -u caddy wp plugin list --format=count)

echo ""
echo "============================================"
echo "Migration complete!"
echo ""
echo "Imported content:"
echo "  - Posts:   ${POST_COUNT}"
echo "  - Pages:   ${PAGE_COUNT}"
echo "  - Users:   ${USER_COUNT}"
echo "  - Plugins: ${PLUGIN_COUNT}"
echo ""
echo "Site URL: https://${NEW_DOMAIN}"
echo ""
echo "Pre-migration backup: ${BACKUP_DIR}"
echo "============================================"

rm -rf "${TEMP_DIR}"

Run it:

chmod +x wp-import.sh
sudo ./wp-import.sh /tmp/wordpress-migration-*.tar.gz

Step 3d: Verify Migration

#!/bin/bash
set -euo pipefail

WP_PATH="/var/www/wordpress"
cd "${WP_PATH}"

echo "==> WordPress Verification Report"
echo "=================================="
echo ""

echo "WordPress Version:"
sudo -u caddy wp core version
echo ""

echo "Site URL Configuration:"
sudo -u caddy wp option get siteurl
sudo -u caddy wp option get home
echo ""

echo "Database Status:"
sudo -u caddy wp db check
echo ""

echo "Content Summary:"
echo "  Posts:      $(sudo -u caddy wp post list --post_type=post --format=count)"
echo "  Pages:      $(sudo -u caddy wp post list --post_type=page --format=count)"
echo "  Media:      $(sudo -u caddy wp post list --post_type=attachment --format=count)"
echo "  Users:      $(sudo -u caddy wp user list --format=count)"
echo ""

echo "Plugin Status:"
sudo -u caddy wp plugin list --format=table
echo ""

echo "Uploads Directory:"
UPLOAD_COUNT=$(find "${WP_PATH}/wp-content/uploads" -type f 2>/dev/null | wc -l)
UPLOAD_SIZE=$(du -sh "${WP_PATH}/wp-content/uploads" 2>/dev/null | cut -f1)
echo "  Files: ${UPLOAD_COUNT}"
echo "  Size:  ${UPLOAD_SIZE}"
echo ""

echo "Service Status:"
echo "  PHP-FPM: $(systemctl is-active php-fpm)"
echo "  MariaDB: $(systemctl is-active mariadb)"
echo "  Caddy:   $(systemctl is-active caddy)"
echo ""

echo "Page Load Test:"
DOMAIN=$(sudo -u caddy wp option get siteurl | sed 's|https://||' | sed 's|/.*||')
curl -w "  Total time: %{time_total}s\n  HTTP code: %{http_code}\n" -o /dev/null -s "https://${DOMAIN}/"

Rollback if Needed

If something goes wrong:

#!/bin/bash
set -euo pipefail

BACKUP_DIR=$(ls -1d /var/backups/wordpress/pre-migration-* 2>/dev/null | tail -1)

if [ -z "${BACKUP_DIR}" ]; then
    echo "ERROR: No backup found"
    exit 1
fi

echo "==> Rolling back to: ${BACKUP_DIR}"

WP_PATH="/var/www/wordpress"
MYSQL_ROOT_PASS=$(cat /root/.wordpress/credentials | grep "MySQL Root" | awk '{print $4}')
DB_NAME=$(grep "DB_NAME" "${WP_PATH}/wp-config.php" | cut -d "'" -f 4)

mysql -u root -p"${MYSQL_ROOT_PASS}" "${DB_NAME}" < "${BACKUP_DIR}/database.sql"

rm -rf "${WP_PATH}/wp-content"
cp -r "${BACKUP_DIR}/wp-content" "${WP_PATH}/"
chown -R caddy:caddy "${WP_PATH}/wp-content"

cd "${WP_PATH}"
sudo -u caddy wp cache flush
sudo -u caddy wp rewrite flush

echo "Rollback complete!"

Part 4: Post-Installation Optimisations

After setup (or migration), run these additional optimisations:

#!/bin/bash

cd /var/www/wordpress

# Remove default content
sudo -u caddy wp post delete 1 2 --force 2>/dev/null || true
sudo -u caddy wp theme delete twentytwentytwo twentytwentythree 2>/dev/null || true

# Update everything
sudo -u caddy wp core update
sudo -u caddy wp plugin update --all
sudo -u caddy wp theme update --all

# Configure WP Super Cache
sudo -u caddy wp super-cache enable 2>/dev/null || true

# Set optimal permalink structure
sudo -u caddy wp rewrite structure '/%postname%/'
sudo -u caddy wp rewrite flush

echo "Optimisations complete!"

Performance Verification

Check your stack is running optimally:

# Verify PHP OPcache status
php -i | grep -i opcache

# Check PHP-FPM status
systemctl status php-fpm

# Test page load time
curl -w "@-" -o /dev/null -s "https://yourdomain.com" << 'EOF'
     time_namelookup:  %{time_namelookup}s
        time_connect:  %{time_connect}s
     time_appconnect:  %{time_appconnect}s
    time_pretransfer:  %{time_pretransfer}s
       time_redirect:  %{time_redirect}s
  time_starttransfer:  %{time_starttransfer}s
                     ----------
          time_total:  %{time_total}s
EOF

Cost Comparison

InstancevCPURAMMonthly CostUse Case
t4g.micro21GB~$6Dev/testing
t4g.small22GB~$12Small blogs
t4g.medium24GB~$24Medium traffic
t4g.large28GB~$48High traffic
c7g.medium12GB~$25CPU-intensive

All prices are approximate for eu-west-1 with on-demand pricing. Reserved instances or Savings Plans reduce costs by 30-60%.


Troubleshooting

502 Bad Gateway: PHP-FPM socket permissions issue

systemctl restart php-fpm
ls -la /run/php-fpm/www.sock

Database connection error: Check MariaDB is running

systemctl status mariadb
mysql -u wp_user -p wordpress

SSL certificate not working: Ensure DNS is pointing to instance IP

dig +short yourdomain.com
curl -I https://yourdomain.com

OPcache not working: Verify with phpinfo

php -r "phpinfo();" | grep -i opcache.enable

Quick Reference

# 1. Launch instance (local machine)
./launch-graviton-wp.sh

# 2. SSH in and setup WordPress
ssh -i ~/.ssh/key.pem ec2-user@IP
sudo ./setup-wordpress.sh

# 3. If migrating - on old server
./wp-export.sh
scp /tmp/wp-migration/wordpress-migration-*.tar.gz ec2-user@NEW_IP:/tmp/

# 4. If migrating - on new server
sudo ./wp-import.sh /tmp/wordpress-migration-*.tar.gz

This setup delivers a production-ready WordPress installation that’ll handle significant traffic while keeping your AWS bill minimal. The combination of Graviton’s price-performance, Caddy’s efficiency, and properly-tuned PHP creates a stack that punches well above its weight class.