Automatically Recovering a Failed WordPress Instance on AWS

When WordPress goes down on your AWS instance, waiting for manual intervention means downtime and lost revenue. Here are two robust approaches to automatically detect and recover from WordPress failures.

Approach 1: Lambda Based Intelligent Recovery

This approach tries the least disruptive fix first (restarting services) before escalating to a full instance reboot.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Install CloudWatch Agent on Your EC2 Instance

Still on your EC2 instance, download and install the CloudWatch agent:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb

Step 3: Create Metric Publishing Script on Your EC2 Instance

This script will send the health check result to CloudWatch every minute:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data \
  --namespace "WordPress" \
  --metric-name HealthCheck \
  --value $HEALTH \
  --dimensions Instance=$INSTANCE_ID \
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it:

/usr/local/bin/send-wordpress-metric.sh

If you get permission errors, ensure your EC2 instance has an IAM role with CloudWatch permissions.

Step 4: Add Health Check to Cron on Your EC2 Instance

This runs the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 5: Create IAM Role for Lambda on Your Laptop

Now switch to your laptop (or use AWS CloudShell in your browser). You’ll need the AWS CLI installed and configured with credentials.

Create the IAM role that Lambda will use:

aws iam create-role \
  --role-name WordPressRecoveryRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "lambda.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

Attach the necessary policies:

aws iam attach-role-policy \
  --role-name WordPressRecoveryRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

aws iam put-role-policy \
  --role-name WordPressRecoveryRole \
  --policy-name EC2SSMAccess \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "ec2:RebootInstances",
          "ec2:DescribeInstances",
          "ssm:SendCommand",
          "ssm:GetCommandInvocation"
        ],
        "Resource": "*"
      }
    ]
  }'

Step 6: Create Lambda Function on Your Laptop

On your laptop, create a file called wordpress-recovery.py in a new directory:

import boto3
import os
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    instance_id = os.environ.get('INSTANCE_ID')
    
    if not instance_id:
        return {'statusCode': 400, 'body': 'INSTANCE_ID not configured'}
    
    print(f"WordPress health check failed for {instance_id}")
    
    # Step 1: Try restarting services
    try:
        print("Attempting to restart services...")
        response = ssm.send_command(
            InstanceIds=[instance_id],
            DocumentName='AWS-RunShellScript',
            Parameters={
                'commands': [
                    'systemctl restart php-fpm || systemctl restart php8.2-fpm || systemctl restart php8.1-fpm',
                    'systemctl restart nginx || systemctl restart apache2',
                    'sleep 30',
                    'curl -f http://localhost || exit 1'
                ]
            },
            TimeoutSeconds=120
        )
        
        command_id = response['Command']['CommandId']
        print(f"Command ID: {command_id}")
        
        # Wait for command to complete
        time.sleep(35)
        
        result = ssm.get_command_invocation(
            CommandId=command_id,
            InstanceId=instance_id
        )
        
        if result['Status'] == 'Success':
            print("Services restarted successfully")
            return {'statusCode': 200, 'body': 'Services restarted successfully'}
        else:
            print(f"Service restart failed with status: {result['Status']}")
    
    except Exception as e:
        print(f"Service restart failed with error: {str(e)}")
    
    # Step 2: Reboot the instance as last resort
    try:
        print(f"Rebooting instance {instance_id}")
        ec2.reboot_instances(InstanceIds=[instance_id])
        return {'statusCode': 200, 'body': 'Instance rebooted'}
    except Exception as e:
        print(f"Reboot failed: {str(e)}")
        return {'statusCode': 500, 'body': f'Recovery failed: {str(e)}'}

Create the deployment package:

zip wordpress-recovery.zip wordpress-recovery.py

Get your AWS account ID:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Deploy the Lambda function (replace i-1234567890abcdef0 with your actual instance ID and us-east-1 with your region):

aws lambda create-function \
  --function-name wordpress-recovery \
  --runtime python3.11 \
  --role arn:aws:iam::${AWS_ACCOUNT_ID}:role/WordPressRecoveryRole \
  --handler wordpress-recovery.lambda_handler \
  --zip-file fileb://wordpress-recovery.zip \
  --timeout 180 \
  --region us-east-1 \
  --environment Variables={INSTANCE_ID=i-1234567890abcdef0}

Step 7: Create CloudWatch Alarm on Your Laptop

Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm \
  --region us-east-1 \
  --alarm-name wordpress-down-recovery \
  --alarm-description "Trigger recovery when WordPress is down" \
  --namespace WordPress \
  --metric-name HealthCheck \
  --dimensions Name=Instance,Value=i-1234567890abcdef0 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data notBreaching

This alarm triggers if the health check fails for 10 minutes (2 periods of 5 minutes each).

Step 8: Connect Alarm to Lambda on Your Laptop

Create an SNS topic (replace us-east-1 with your region):

aws sns create-topic --name wordpress-recovery-topic --region us-east-1

Get the topic ARN:

export TOPIC_ARN=$(aws sns list-topics --region us-east-1 --query 'Topics[?contains(TopicArn, `wordpress-recovery-topic`)].TopicArn' --output text)

Subscribe Lambda to the topic:

aws sns subscribe \
  --region us-east-1 \
  --topic-arn ${TOPIC_ARN} \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:us-east-1:${AWS_ACCOUNT_ID}:function:wordpress-recovery

Give SNS permission to invoke Lambda:

aws lambda add-permission \
  --region us-east-1 \
  --function-name wordpress-recovery \
  --statement-id AllowSNSInvoke \
  --action lambda:InvokeFunction \
  --principal sns.amazonaws.com \
  --source-arn ${TOPIC_ARN}

Update the CloudWatch alarm to notify SNS (replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region):

aws cloudwatch put-metric-alarm \
  --region us-east-1 \
  --alarm-name wordpress-down-recovery \
  --alarm-description "Trigger recovery when WordPress is down" \
  --namespace WordPress \
  --metric-name HealthCheck \
  --dimensions Name=Instance,Value=i-1234567890abcdef0 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions ${TOPIC_ARN}

Approach 2: Custom Health Check with CloudWatch Reboot

This approach is simpler than the Lambda version. It uses a custom CloudWatch metric based on checking your WordPress homepage, then automatically reboots when the check fails.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Create Metric Publishing Script on Your EC2 Instance

This script sends the health check result to CloudWatch:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data \
  --namespace "WordPress" \
  --metric-name HealthCheck \
  --value $HEALTH \
  --dimensions Instance=$INSTANCE_ID \
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it (ensure your EC2 instance has an IAM role with CloudWatch permissions):

/usr/local/bin/send-wordpress-metric.sh

Step 3: Add Health Check to Cron on Your EC2 Instance

Run the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 4: Create CloudWatch Alarm with Reboot Action on Your Laptop

Now from your laptop (or AWS CloudShell), create the alarm. Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm \
  --region us-east-1 \
  --alarm-name wordpress-health-reboot \
  --alarm-description "Reboot instance when WordPress health check fails" \
  --namespace WordPress \
  --metric-name HealthCheck \
  --dimensions Name=Instance,Value=i-1234567890abcdef0 \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:automate:us-east-1:ec2:reboot

This will reboot your instance if WordPress fails health checks for 10 minutes (2 periods of 5 minutes).

That’s it. The entire setup is contained in 4 steps, and there’s no Lambda function to maintain. When WordPress goes down, CloudWatch will automatically reboot your instance.

Which Approach Should You Use?

Use Lambda Recovery (Approach 1) if:

  • You want intelligent recovery that tries service restart before rebooting
  • You need visibility into what recovery actions are taken
  • You want to extend the logic later (notifications, multiple recovery steps, etc)
  • You have SSM agent installed on your instance

Use Custom Health Check Reboot (Approach 2) if:

  • You want a simple solution with minimal moving parts
  • A full reboot is acceptable for all WordPress failures
  • You don’t need to try service restarts before rebooting
  • You prefer fewer AWS services to maintain

The Lambda approach is more sophisticated and tries to minimize downtime by restarting services first. The custom health check reboot approach is simpler, requires no Lambda function, but always reboots the entire instance.

Testing Your Setup

For Lambda Approach

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Watch the Lambda logs from your laptop:

aws logs tail /aws/lambda/wordpress-recovery --follow --region us-east-1

After 10 minutes, you should see the Lambda function trigger and attempt to restart services.

For Custom Health Check Reboot

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Check that the metric is being sent from your laptop:

aws cloudwatch get-metric-statistics \
  --region us-east-1 \
  --namespace WordPress \
  --metric-name HealthCheck \
  --dimensions Name=Instance,Value=i-1234567890abcdef0 \
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average

You should see values of 0 appearing. After 10 minutes, your instance will automatically reboot.

Both approaches ensure your WordPress site recovers automatically without manual intervention.

Leave a Reply

Your email address will not be published. Required fields are marked *